Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce hadoop mini cluster to test native scan on hdfs #1556

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

wForget
Copy link
Member

@wForget wForget commented Mar 19, 2025

Which issue does this PR close?

Closes #1515.

Rationale for this change

test native scan on hdfs

What changes are included in this PR?

introduce hadoop mini cluster to test native scan on hdfs

How are these changes tested?

Successfully run CometReadHdfsBenchmark locally (tips: build native enable hdfs: cd native && cargo build --features hdfs)

@@ -68,7 +68,7 @@ datafusion-comet-proto = { workspace = true }
object_store = { workspace = true }
url = { workspace = true }
parking_lot = "0.12.3"
datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true}
datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true, default-features = false, features = ["hdfs"] }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disable try_spawn_blocking to avoid native thread hanging

import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys
import org.apache.spark.internal.Logging

trait WithHdfsCluster extends Logging {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most copy from kyuubi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave a comment that this was taken from kyuubi

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added

@codecov-commenter
Copy link

codecov-commenter commented Mar 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.39%. Comparing base (f09f8af) to head (cb7d0b2).
Report is 96 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1556      +/-   ##
============================================
+ Coverage     56.12%   58.39%   +2.26%     
- Complexity      976      977       +1     
============================================
  Files           119      122       +3     
  Lines         11743    12217     +474     
  Branches       2251     2280      +29     
============================================
+ Hits           6591     7134     +543     
+ Misses         4012     3951      -61     
+ Partials       1140     1132       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -447,6 +448,13 @@ under the License.
<version>5.1.0</version>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client-minicluster</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this dependency instead? https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-minicluster/3.3.4
Not sure what the difference is as long as both allow us to spin up a miniDFSCluster

Copy link
Member Author

@wForget wForget Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that hadoop-client-minicluster has fewer dependencies, and it depends on hadoop-client-runtime which is a shaded hadoop client (to avoid introducing conflicts)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/hadoop/blob/trunk/hadoop-client-modules/hadoop-client-minicluster/pom.xml

hadoop-client-minicluster seems to be a fat jar of hadoop mini cluster, so is it more suitable as a dependency for testing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I did not know that. It doesn't matter which one we use then.

@@ -63,6 +65,7 @@ object CometReadBenchmark extends CometBenchmarkBase {
sqlBenchmark.addCase("SQL Parquet - Comet") { _ =>
withSQLConf(
CometConf.COMET_ENABLED.key -> "true",
CometConf.COMET_EXEC_ENABLED.key -> "true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CometBenchmarkBase sets COMET_EXEC_ENABLED to false by default, while native scan requires COMET_EXEC_ENABLED=true

https://github.com/apache/datafusion-comet/blob/main/spark/src/test/scala/org/apache/spark/sql/benchmark/CometBenchmarkBase.scala#L53-L54

s"Full native scan disabled because ${COMET_EXEC_ENABLED.key} disabled"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 68 is indeed redundant configuration, I will remove it.

@@ -58,6 +58,7 @@ under the License.
<protobuf.version>3.25.5</protobuf.version>
<parquet.version>1.13.1</parquet.version>
<parquet.maven.scope>provided</parquet.maven.scope>
<hadoop.version>3.3.4</hadoop.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to use the same hadoop version as what Spark uses? I.e. should we use maven profile to change versions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable hdfs test(s) in ci
4 participants