-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: introduce hadoop mini cluster to test native scan on hdfs #1556
base: main
Are you sure you want to change the base?
Conversation
@@ -68,7 +68,7 @@ datafusion-comet-proto = { workspace = true } | |||
object_store = { workspace = true } | |||
url = { workspace = true } | |||
parking_lot = "0.12.3" | |||
datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true} | |||
datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true, default-features = false, features = ["hdfs"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disable try_spawn_blocking
to avoid native thread hanging
import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys | ||
import org.apache.spark.internal.Logging | ||
|
||
trait WithHdfsCluster extends Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most copy from kyuubi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave a comment that this was taken from kyuubi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, added
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1556 +/- ##
============================================
+ Coverage 56.12% 58.39% +2.26%
- Complexity 976 977 +1
============================================
Files 119 122 +3
Lines 11743 12217 +474
Branches 2251 2280 +29
============================================
+ Hits 6591 7134 +543
+ Misses 4012 3951 -61
+ Partials 1140 1132 -8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@@ -447,6 +448,13 @@ under the License. | |||
<version>5.1.0</version> | |||
</dependency> | |||
|
|||
<dependency> | |||
<groupId>org.apache.hadoop</groupId> | |||
<artifactId>hadoop-client-minicluster</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this dependency instead? https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-minicluster/3.3.4
Not sure what the difference is as long as both allow us to spin up a miniDFSCluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that hadoop-client-minicluster
has fewer dependencies, and it depends on hadoop-client-runtime
which is a shaded hadoop client (to avoid introducing conflicts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/hadoop/blob/trunk/hadoop-client-modules/hadoop-client-minicluster/pom.xml
hadoop-client-minicluster
seems to be a fat jar of hadoop mini cluster, so is it more suitable as a dependency for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I did not know that. It doesn't matter which one we use then.
@@ -63,6 +65,7 @@ object CometReadBenchmark extends CometBenchmarkBase { | |||
sqlBenchmark.addCase("SQL Parquet - Comet") { _ => | |||
withSQLConf( | |||
CometConf.COMET_ENABLED.key -> "true", | |||
CometConf.COMET_EXEC_ENABLED.key -> "true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CometBenchmarkBase sets COMET_EXEC_ENABLED to false by default, while native scan requires COMET_EXEC_ENABLED=true
datafusion-comet/spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala
Line 213 in 5224108
s"Full native scan disabled because ${COMET_EXEC_ENABLED.key} disabled" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 68 is indeed redundant configuration, I will remove it.
@@ -58,6 +58,7 @@ under the License. | |||
<protobuf.version>3.25.5</protobuf.version> | |||
<parquet.version>1.13.1</parquet.version> | |||
<parquet.maven.scope>provided</parquet.maven.scope> | |||
<hadoop.version>3.3.4</hadoop.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to use the same hadoop version as what Spark uses? I.e. should we use maven profile to change versions?
Which issue does this PR close?
Closes #1515.
Rationale for this change
test native scan on hdfs
What changes are included in this PR?
introduce hadoop mini cluster to test native scan on hdfs
How are these changes tested?
Successfully run
CometReadHdfsBenchmark
locally (tips: build native enable hdfs:cd native && cargo build --features hdfs
)