Skip to content

Conversation

vatj
Copy link

@vatj vatj commented Aug 1, 2025

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

gibchikafa and others added 30 commits June 11, 2025 09:48
…lid_return_type compatible with Python only client
### What changes were proposed in this pull request?

This is a back-port of apache#51043.

This PR changes `InMemoryFileIndex#equals` to compare a non-distinct collection of root paths rather than a distinct set of root paths. Without this change, `InMemoryFileIndex#equals` considers the following two collections of root paths to be equal, even though they represent a different number of rows:
```
["/tmp/test", "/tmp/test"]
["/tmp/test", "/tmp/test", "/tmp/test"]
```

### Why are the changes needed?

The bug can cause correctness issues, e.g.
```
// create test data
val data = Seq((1, 2), (2, 3)).toDF("a", "b")
data.write.mode("overwrite").csv("/tmp/test")

val fileList1 = List.fill(2)("/tmp/test")
val fileList2 = List.fill(3)("/tmp/test")

val df1 = spark.read.schema("a int, b int").csv(fileList1: _*)
val df2 = spark.read.schema("a int, b int").csv(fileList2: _*)

df1.count() // correctly returns 4
df2.count() // correctly returns 6

// the following is the same as above, except df1 is persisted
val df1 = spark.read.schema("a int, b int").csv(fileList1: _*).persist
val df2 = spark.read.schema("a int, b int").csv(fileList2: _*)

df1.count() // correctly returns 4
df2.count() // incorrectly returns 4!!
```
In the above example, df1 and df2 were created with a different number of paths: df1 has 2, and df2 has 3. But since the distinct set of root paths is the same (e.g., `Set("/tmp/test") == Set("/tmp/test"))`, the two dataframes are considered equal. Thus, when df1 is persisted, df2 uses df1's cached plan.

The same bug also causes inappropriate exchange reuse.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51256 from bersprockets/multi_path_issue_br35.

Authored-by: Bruce Robbins <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
…nd push

### What changes were proposed in this pull request?

This PR proposes to add the automatic release note process. Here is what it does:

1. Add download link to docs
  Inserts the new release version link into `documentation.md`, keeping versions sorted by recency.

2. Add download link to spark-website
  Updates `js/downloads.js` with the new version's metadata for the downloads page. Replaces existing entry if it's a patch; inserts new entry otherwise. Uses different package lists for Spark 3 vs. Spark 4.

3. Generate news & release notes
  Creates a news post and release notes file as below. Note that I skipped the short link generation step here.
    - For minor/major releases (x.y.0), describes new features
    - For patch/maintenance releases (x.y.z, z > 0), mentions stability fixes and encourages upgrades.

4. Build the Website
  Runs Jekyll to generate updated HTML files for the website.

5. Update latest symlink (only for major/minor)
  Updates the `site/docs/latest` symlink to point to the new version only if it's a major or minor release (x.y.0), so maintenance releases don’t affect the default documentation version.

If the release manager needs to have better release notes, they can create a separate PR to update this.

### Why are the changes needed?

To make the release process easier.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

I manually tested them in my Mac for now, and checked that they are compatible with Ubuntu. It has to be tested in the official release later again.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51260 from HyukjinKwon/SPARK-52562.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 82ab680)
Signed-off-by: Hyukjin Kwon <[email protected]>
…ly when size matches

### What changes were proposed in this pull request?

A follow-up for apache#51043 that sorts paths in InMemoryFileIndex#equal only when size matches

### Why are the changes needed?

Avoid potential perf regression.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Existing test from apache#51043

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#51263 from yaooqinn/SPARK-52339.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit 1cfe07c)
Signed-off-by: Kent Yao <[email protected]>
… finalize step

### What changes were proposed in this pull request?

This PR proposes to make release script to support preview releases as well.

### Why are the changes needed?

To make the release easier.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested against spark-website.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51291 from HyukjinKwon/SPARK-52584.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 5432402)
Signed-off-by: Hyukjin Kwon <[email protected]>
…v/test-dependencies.sh`

Cherry-pick apache#51273 to branch 3.5

### What changes were proposed in this pull request?

Fix `exec-maven-plugin` version used by `dev/test-dependencies.sh` to use the `exec-maven-plugin.version` defined in `pom.xml`, instead of the hardcoded old version(which does not work with Maven 4).

### Why are the changes needed?

Keep toolchain version consistency, and prepare for Maven 4 support.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Run `./dev/test-dependencies.sh`
```
...
++ build/mvn help:evaluate -Dexpression=exec-maven-plugin.version -q -DforceStdout
++ grep -E '[0-9]+\.[0-9]+\.[0-9]+'
Using `mvn` from path: /Users/chengpan/Projects/apache-spark-3.5/build/apache-maven-3.9.6/bin/mvn
+ MVN_EXEC_PLUGIN_VERSION=3.1.0
+ set +e
++ build/mvn -q -Dexec.executable=echo '-Dexec.args=${project.version}' --non-recursive org.codehaus.mojo:exec-maven-plugin:3.1.0:exec
...
```
And pass GHA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51288 from pan3793/SPARK-52568-3.5.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…oveRedundantAliases…

… configuration

### What changes were proposed in this pull request?

This PR fixes the added version of `spark.sql.optimizer.excludeSubqueryRefsFromRemoveRedundantAliases.enabled` to
3.5.1 (also in [SPARK-52611])
### Why are the changes needed?

To show the correct version added.

### Does this PR introduce _any_ user-facing change?

Yes but only in the unreleased branches. It will change the version shown in SQL documentation.

### How was this patch tested?

Not tested. Jenkins will test it out.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#51318 from atongpu/SPARK-52611.

Authored-by: Dongpu Li <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 9fcacba)
Signed-off-by: Hyukjin Kwon <[email protected]>
…kListenerEvent

### What changes were proposed in this pull request?

JsonProtocol tidy up. Only parse JSON relating to Spark events.
https://issues.apache.org/jira/browse/SPARK-52381

### Why are the changes needed?

Tidier code and https://lists.apache.org/thread/9zwkdo85wcdfppgqvbhjly8wdgf595yp

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#51323 from pjfanning/SPARK-52381-br3.5.

Authored-by: PJ Fanning <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
…uct] from udaf

### What changes were proposed in this pull request?

This fixes so defining a udaf returning a `Option[Product]` produces correct results instead of the current behavior. Where it throws an exception, segfaults or produces incorrect results.

### Why are the changes needed?

Fix correctness issue.

### Does this PR introduce _any_ user-facing change?

Fixes a correctness issue.

### How was this patch tested?

Existing and new unittest.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#50827 from eejbyfeldt/SPARK-52023.

Authored-by: Emil Ejbyfeldt <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
(cherry picked from commit 5e6e8f1)
Signed-off-by: Herman van Hovell <[email protected]>
…[Product] from udaf

### What changes were proposed in this pull request?

This fixes so defining a udaf returning a `Option[Product]` produces correct results instead of the current behavior. Where it throws an exception, segfaults or produces incorrect results.

### Why are the changes needed?

Fix correctness issue.

### Does this PR introduce _any_ user-facing change?

Fixes a correctness issue.

### How was this patch tested?

Existing and new unittest.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#51347 from eejbyfeldt/3.5-SPARK-52023.

Authored-by: Emil Ejbyfeldt <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR aims to upgrade ORC to 1.9.7.

### Why are the changes needed?

To bring the latest bug fixes.
- apache/orc#2226

Here is the full release note.
- https://github.com/apache/orc/releases/tag/v1.9.7
- https://orc.apache.org/news/2025/07/04/ORC-1.9.7/

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51336 from dongjoon-hyun/orc197.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…g execution errors

### What changes were proposed in this pull request?

This PR makes CACHE TABLE commands atomic while encountering execution errors

### Why are the changes needed?

For now, when an AnalysisException occurs, no cache or view will be created, but an execution one occurs, a view or an erroneous 'cache' is created.

### Does this PR introduce _any_ user-facing change?

Yes, but it's a bugfix. It only affects rare corner case that a user leverages this bug to create an erroneous 'cache'/view for some particular purposes

### How was this patch tested?
new tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#51386 from yaooqinn/SPARK-52684-35.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…versions

### What changes were proposed in this pull request?

This PR proposes to remove preview postfix when looking up the JIRA versions

### Why are the changes needed?

Otherwise, preview builds fail.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51399 from HyukjinKwon/SPARK-52707.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 51bbae0)
Signed-off-by: Hyukjin Kwon <[email protected]>
vatj and others added 17 commits September 1, 2025 13:48
### What changes were proposed in this pull request?
backport apache@871fe3d

There is race condition between `CachedRDDBuilder.cachedColumnBuffers` and `CachedRDDBuilder.clearCache`: when they interleave each other, `cachedColumnBuffers` might return a `nullptr`.

This looks like a day-1 bug introduced from  apache@20ca208#diff-4068fce361a50e3d32af2ba2d4231905f500e7b2da9f46d5ddd99b758c30fd43

### Why are the changes needed?
The race condition might lead to NPE from [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L303) which is basically a null `RDD` returned from `CachedRDDBuilder.cachedColumnBuffers`

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Theoretically this race condition might be triggered as long as cache materialization and unpersistence happen on different thread. But there is no reliable way to construct unit test.

### Was this patch authored or co-authored using generative AI tooling?
NO

Closes apache#52199 from liuzqt/SPARK-53435-3.5.

Authored-by: ziqi liu <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR proposes to fix download of preview releases in the news when releasing.

### Why are the changes needed?

To have the correct download links for previews when they are released.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#52208 from HyukjinKwon/fix-download-links.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 6476dbc)
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?

This PR proposes to remove todos (that are tested).

### Why are the changes needed?

To note what's tested or not for developement.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#52207 from HyukjinKwon/remove-what-is-tested.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit fb22d37)
Signed-off-by: Hyukjin Kwon <[email protected]>
…asing

### What changes were proposed in this pull request?

This PR proposes to remove `preview` postfix in `documentation.md` when releasing

### Why are the changes needed?

To be consistent. `preview` postfix is not needed, see https://github.com/apache/spark-website/blob/asf-site/documentation.md

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#52206 from HyukjinKwon/remove-preview-postfix.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 41c4346)
Signed-off-by: Hyukjin Kwon <[email protected]>
…tml files

### Why are the changes needed?

`page.redirect.to` defaults to pages with the absolute site root. In this PR, we revise it to the docs relative.

### Does this PR introduce _any_ user-facing change?
doc fix

Check https://dist.apache.org/repos/dist/dev/spark/v4.0.1-rc1-docs/_site/ for https://dist.apache.org/repos/dist/dev/spark/v4.0.1-rc1-docs/_site/building-with-maven.html

### How was this patch tested?
build docs locally.

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#52217 from yaooqinn/SPARK-53472.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
(cherry picked from commit 7635204)
Signed-off-by: Kent Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.