Skip to content

feat: introduce hadoop mini cluster to test native scan on hdfs #1556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 25, 2025

Conversation

wForget
Copy link
Member

@wForget wForget commented Mar 19, 2025

Which issue does this PR close?

Closes #1515.

Rationale for this change

test native scan on hdfs

What changes are included in this PR?

introduce hadoop mini cluster to test native scan on hdfs

How are these changes tested?

Successfully run CometReadHdfsBenchmark locally (tips: build native enable hdfs: cd native && cargo build --features hdfs)

@@ -68,7 +68,7 @@ datafusion-comet-proto = { workspace = true }
object_store = { workspace = true }
url = { workspace = true }
parking_lot = "0.12.3"
datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true}
datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true, default-features = false, features = ["hdfs"] }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disable try_spawn_blocking to avoid native thread hanging

import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys
import org.apache.spark.internal.Logging

trait WithHdfsCluster extends Logging {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most copy from kyuubi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave a comment that this was taken from kyuubi

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added

@codecov-commenter
Copy link

codecov-commenter commented Mar 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.39%. Comparing base (f09f8af) to head (cb7d0b2).
Report is 96 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1556      +/-   ##
============================================
+ Coverage     56.12%   58.39%   +2.26%     
- Complexity      976      977       +1     
============================================
  Files           119      122       +3     
  Lines         11743    12217     +474     
  Branches       2251     2280      +29     
============================================
+ Hits           6591     7134     +543     
+ Misses         4012     3951      -61     
+ Partials       1140     1132       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -447,6 +448,13 @@ under the License.
<version>5.1.0</version>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client-minicluster</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this dependency instead? https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-minicluster/3.3.4
Not sure what the difference is as long as both allow us to spin up a miniDFSCluster

Copy link
Member Author

@wForget wForget Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that hadoop-client-minicluster has fewer dependencies, and it depends on hadoop-client-runtime which is a shaded hadoop client (to avoid introducing conflicts)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/hadoop/blob/trunk/hadoop-client-modules/hadoop-client-minicluster/pom.xml

hadoop-client-minicluster seems to be a fat jar of hadoop mini cluster, so is it more suitable as a dependency for testing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I did not know that. It doesn't matter which one we use then.

@@ -63,6 +65,7 @@ object CometReadBenchmark extends CometBenchmarkBase {
sqlBenchmark.addCase("SQL Parquet - Comet") { _ =>
withSQLConf(
CometConf.COMET_ENABLED.key -> "true",
CometConf.COMET_EXEC_ENABLED.key -> "true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CometBenchmarkBase sets COMET_EXEC_ENABLED to false by default, while native scan requires COMET_EXEC_ENABLED=true

https://github.com/apache/datafusion-comet/blob/main/spark/src/test/scala/org/apache/spark/sql/benchmark/CometBenchmarkBase.scala#L53-L54

s"Full native scan disabled because ${COMET_EXEC_ENABLED.key} disabled"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 68 is indeed redundant configuration, I will remove it.

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only last question is whether we should enable native exec for the microbenchmarks

@@ -71,6 +73,7 @@ object CometReadBenchmark extends CometBenchmarkBase {
sqlBenchmark.addCase("SQL Parquet - Comet Native DataFusion") { _ =>
withSQLConf(
CometConf.COMET_ENABLED.key -> "true",
CometConf.COMET_EXEC_ENABLED.key -> "true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should enable COMET_EXEC_ENABLED as it will mix the scan benchmark and exec benchmark

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should enable COMET_EXEC_ENABLED as it will mix the scan benchmark and exec benchmark

It seems difficult to benchmark only scan anyway. If we disable exec conversion, it may introduce the performance loss of ColumnarToRow.

@kazuyukitanimura kazuyukitanimura merged commit 49fa287 into apache:main Mar 25, 2025
68 checks passed
@kazuyukitanimura
Copy link
Contributor

Merged thanks @wForget

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable hdfs test(s) in ci
4 participants