forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 11
[HWORKS-2203] Spark 3.5 #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vatj
wants to merge
87
commits into
branch-3.5
Choose a base branch
from
HWORKS-2203-vatj
base: branch-3.5
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…lid_return_type compatible with Python only client
### What changes were proposed in this pull request? This is a back-port of apache#51043. This PR changes `InMemoryFileIndex#equals` to compare a non-distinct collection of root paths rather than a distinct set of root paths. Without this change, `InMemoryFileIndex#equals` considers the following two collections of root paths to be equal, even though they represent a different number of rows: ``` ["/tmp/test", "/tmp/test"] ["/tmp/test", "/tmp/test", "/tmp/test"] ``` ### Why are the changes needed? The bug can cause correctness issues, e.g. ``` // create test data val data = Seq((1, 2), (2, 3)).toDF("a", "b") data.write.mode("overwrite").csv("/tmp/test") val fileList1 = List.fill(2)("/tmp/test") val fileList2 = List.fill(3)("/tmp/test") val df1 = spark.read.schema("a int, b int").csv(fileList1: _*) val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) df1.count() // correctly returns 4 df2.count() // correctly returns 6 // the following is the same as above, except df1 is persisted val df1 = spark.read.schema("a int, b int").csv(fileList1: _*).persist val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) df1.count() // correctly returns 4 df2.count() // incorrectly returns 4!! ``` In the above example, df1 and df2 were created with a different number of paths: df1 has 2, and df2 has 3. But since the distinct set of root paths is the same (e.g., `Set("/tmp/test") == Set("/tmp/test"))`, the two dataframes are considered equal. Thus, when df1 is persisted, df2 uses df1's cached plan. The same bug also causes inappropriate exchange reuse. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51256 from bersprockets/multi_path_issue_br35. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Kent Yao <[email protected]>
…nd push ### What changes were proposed in this pull request? This PR proposes to add the automatic release note process. Here is what it does: 1. Add download link to docs Inserts the new release version link into `documentation.md`, keeping versions sorted by recency. 2. Add download link to spark-website Updates `js/downloads.js` with the new version's metadata for the downloads page. Replaces existing entry if it's a patch; inserts new entry otherwise. Uses different package lists for Spark 3 vs. Spark 4. 3. Generate news & release notes Creates a news post and release notes file as below. Note that I skipped the short link generation step here. - For minor/major releases (x.y.0), describes new features - For patch/maintenance releases (x.y.z, z > 0), mentions stability fixes and encourages upgrades. 4. Build the Website Runs Jekyll to generate updated HTML files for the website. 5. Update latest symlink (only for major/minor) Updates the `site/docs/latest` symlink to point to the new version only if it's a major or minor release (x.y.0), so maintenance releases don’t affect the default documentation version. If the release manager needs to have better release notes, they can create a separate PR to update this. ### Why are the changes needed? To make the release process easier. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? I manually tested them in my Mac for now, and checked that they are compatible with Ubuntu. It has to be tested in the official release later again. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51260 from HyukjinKwon/SPARK-52562. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 82ab680) Signed-off-by: Hyukjin Kwon <[email protected]>
…ly when size matches ### What changes were proposed in this pull request? A follow-up for apache#51043 that sorts paths in InMemoryFileIndex#equal only when size matches ### Why are the changes needed? Avoid potential perf regression. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test from apache#51043 ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51263 from yaooqinn/SPARK-52339. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit 1cfe07c) Signed-off-by: Kent Yao <[email protected]>
… finalize step ### What changes were proposed in this pull request? This PR proposes to make release script to support preview releases as well. ### Why are the changes needed? To make the release easier. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested against spark-website. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51291 from HyukjinKwon/SPARK-52584. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 5432402) Signed-off-by: Hyukjin Kwon <[email protected]>
…v/test-dependencies.sh` Cherry-pick apache#51273 to branch 3.5 ### What changes were proposed in this pull request? Fix `exec-maven-plugin` version used by `dev/test-dependencies.sh` to use the `exec-maven-plugin.version` defined in `pom.xml`, instead of the hardcoded old version(which does not work with Maven 4). ### Why are the changes needed? Keep toolchain version consistency, and prepare for Maven 4 support. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Run `./dev/test-dependencies.sh` ``` ... ++ build/mvn help:evaluate -Dexpression=exec-maven-plugin.version -q -DforceStdout ++ grep -E '[0-9]+\.[0-9]+\.[0-9]+' Using `mvn` from path: /Users/chengpan/Projects/apache-spark-3.5/build/apache-maven-3.9.6/bin/mvn + MVN_EXEC_PLUGIN_VERSION=3.1.0 + set +e ++ build/mvn -q -Dexec.executable=echo '-Dexec.args=${project.version}' --non-recursive org.codehaus.mojo:exec-maven-plugin:3.1.0:exec ... ``` And pass GHA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51288 from pan3793/SPARK-52568-3.5. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…oveRedundantAliases… … configuration ### What changes were proposed in this pull request? This PR fixes the added version of `spark.sql.optimizer.excludeSubqueryRefsFromRemoveRedundantAliases.enabled` to 3.5.1 (also in [SPARK-52611]) ### Why are the changes needed? To show the correct version added. ### Does this PR introduce _any_ user-facing change? Yes but only in the unreleased branches. It will change the version shown in SQL documentation. ### How was this patch tested? Not tested. Jenkins will test it out. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51318 from atongpu/SPARK-52611. Authored-by: Dongpu Li <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 9fcacba) Signed-off-by: Hyukjin Kwon <[email protected]>
…kListenerEvent ### What changes were proposed in this pull request? JsonProtocol tidy up. Only parse JSON relating to Spark events. https://issues.apache.org/jira/browse/SPARK-52381 ### Why are the changes needed? Tidier code and https://lists.apache.org/thread/9zwkdo85wcdfppgqvbhjly8wdgf595yp ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51323 from pjfanning/SPARK-52381-br3.5. Authored-by: PJ Fanning <[email protected]> Signed-off-by: yangjie01 <[email protected]>
…uct] from udaf ### What changes were proposed in this pull request? This fixes so defining a udaf returning a `Option[Product]` produces correct results instead of the current behavior. Where it throws an exception, segfaults or produces incorrect results. ### Why are the changes needed? Fix correctness issue. ### Does this PR introduce _any_ user-facing change? Fixes a correctness issue. ### How was this patch tested? Existing and new unittest. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50827 from eejbyfeldt/SPARK-52023. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Herman van Hovell <[email protected]> (cherry picked from commit 5e6e8f1) Signed-off-by: Herman van Hovell <[email protected]>
…ion[Product] from udaf" This reverts commit 3f2a3ba.
…[Product] from udaf ### What changes were proposed in this pull request? This fixes so defining a udaf returning a `Option[Product]` produces correct results instead of the current behavior. Where it throws an exception, segfaults or produces incorrect results. ### Why are the changes needed? Fix correctness issue. ### Does this PR introduce _any_ user-facing change? Fixes a correctness issue. ### How was this patch tested? Existing and new unittest. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51347 from eejbyfeldt/3.5-SPARK-52023. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to upgrade ORC to 1.9.7. ### Why are the changes needed? To bring the latest bug fixes. - apache/orc#2226 Here is the full release note. - https://github.com/apache/orc/releases/tag/v1.9.7 - https://orc.apache.org/news/2025/07/04/ORC-1.9.7/ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51336 from dongjoon-hyun/orc197. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…g execution errors ### What changes were proposed in this pull request? This PR makes CACHE TABLE commands atomic while encountering execution errors ### Why are the changes needed? For now, when an AnalysisException occurs, no cache or view will be created, but an execution one occurs, a view or an erroneous 'cache' is created. ### Does this PR introduce _any_ user-facing change? Yes, but it's a bugfix. It only affects rare corner case that a user leverages this bug to create an erroneous 'cache'/view for some particular purposes ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#51386 from yaooqinn/SPARK-52684-35. Authored-by: Kent Yao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…versions ### What changes were proposed in this pull request? This PR proposes to remove preview postfix when looking up the JIRA versions ### Why are the changes needed? Otherwise, preview builds fail. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51399 from HyukjinKwon/SPARK-52707. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 51bbae0) Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? backport apache@871fe3d There is race condition between `CachedRDDBuilder.cachedColumnBuffers` and `CachedRDDBuilder.clearCache`: when they interleave each other, `cachedColumnBuffers` might return a `nullptr`. This looks like a day-1 bug introduced from apache@20ca208#diff-4068fce361a50e3d32af2ba2d4231905f500e7b2da9f46d5ddd99b758c30fd43 ### Why are the changes needed? The race condition might lead to NPE from [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L303) which is basically a null `RDD` returned from `CachedRDDBuilder.cachedColumnBuffers` ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Theoretically this race condition might be triggered as long as cache materialization and unpersistence happen on different thread. But there is no reliable way to construct unit test. ### Was this patch authored or co-authored using generative AI tooling? NO Closes apache#52199 from liuzqt/SPARK-53435-3.5. Authored-by: ziqi liu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR proposes to fix download of preview releases in the news when releasing. ### Why are the changes needed? To have the correct download links for previews when they are released. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52208 from HyukjinKwon/fix-download-links. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 6476dbc) Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? This PR proposes to remove todos (that are tested). ### Why are the changes needed? To note what's tested or not for developement. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52207 from HyukjinKwon/remove-what-is-tested. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit fb22d37) Signed-off-by: Hyukjin Kwon <[email protected]>
…asing ### What changes were proposed in this pull request? This PR proposes to remove `preview` postfix in `documentation.md` when releasing ### Why are the changes needed? To be consistent. `preview` postfix is not needed, see https://github.com/apache/spark-website/blob/asf-site/documentation.md ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52206 from HyukjinKwon/remove-preview-postfix. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 41c4346) Signed-off-by: Hyukjin Kwon <[email protected]>
…tml files ### Why are the changes needed? `page.redirect.to` defaults to pages with the absolute site root. In this PR, we revise it to the docs relative. ### Does this PR introduce _any_ user-facing change? doc fix Check https://dist.apache.org/repos/dist/dev/spark/v4.0.1-rc1-docs/_site/ for https://dist.apache.org/repos/dist/dev/spark/v4.0.1-rc1-docs/_site/building-with-maven.html ### How was this patch tested? build docs locally. ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#52217 from yaooqinn/SPARK-53472. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]> (cherry picked from commit 7635204) Signed-off-by: Kent Yao <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.