Skip to content

Commit 56a0f4c

Browse files
authored
Merge pull request #4300 from NvTimLiu/release-tmp
Merge source to main branch from branch-21.12 [skip ci]
2 parents 63dabac + a4aec42 commit 56a0f4c

File tree

429 files changed

+17260
-9012
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

429 files changed

+17260
-9012
lines changed

.github/workflows/auto-merge.yml

+4-4
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ name: auto-merge HEAD to BASE
1818
on:
1919
pull_request_target:
2020
branches:
21-
- branch-21.10
21+
- branch-21.12
2222
types: [closed]
2323

2424
jobs:
@@ -29,13 +29,13 @@ jobs:
2929
steps:
3030
- uses: actions/checkout@v2
3131
with:
32-
ref: branch-21.10 # force to fetch from latest upstream instead of PR ref
32+
ref: branch-21.12 # force to fetch from latest upstream instead of PR ref
3333

3434
- name: auto-merge job
3535
uses: ./.github/workflows/auto-merge
3636
env:
3737
OWNER: NVIDIA
3838
REPO_NAME: spark-rapids
39-
HEAD: branch-21.10
40-
BASE: branch-21.12
39+
HEAD: branch-21.12
40+
BASE: branch-22.02
4141
AUTOMERGE_TOKEN: ${{ secrets.AUTOMERGE_TOKEN }} # use to merge PR

.github/workflows/auto-merge/automerge

+15
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,21 @@ def auto_merge(number, sha):
9090
```
9191
{r.json()}
9292
```
93+
94+
Please use the following steps to fix the merge conflicts manually:
95+
```
96+
# Assume upstream is NVIDIA/spark-rapids remote
97+
git fetch upstream {HEAD} {BASE}
98+
git checkout -b fix-auto-merge-conflict-{number} upstream/{BASE}
99+
git merge upstream/{HEAD}
100+
# Fix any merge conflicts caused by this merge
101+
git commit -am "Merge {HEAD} into {BASE}"
102+
git push <personal fork> fix-auto-merge-conflict-{number}
103+
# Open a PR targets NVIDIA/spark-rapids {BASE}
104+
```
105+
**IMPORTANT:** Before merging this PR, be sure to change the merging strategy to `Create a merge commit` (repo admin only).
106+
107+
Once this PR is merged, the auto-merge PR should automatically be closed since it contains the same commit hashes
93108
""")
94109
print(f'status code: {r.status_code}')
95110
print(r.json())

.github/workflows/blossom-ci.yml

+1
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ jobs:
6363
zhanga5,\
6464
nvliyuan,\
6565
res-life,\
66+
HaoYang670,\
6667
', format('{0},', github.actor)) && github.event.comment.body == 'build'
6768
steps:
6869
- name: Check if comment is issued by authorized person

CHANGELOG.md

+278-2
Large diffs are not rendered by default.

CONTRIBUTING.md

+127-28
Original file line numberDiff line numberDiff line change
@@ -50,39 +50,56 @@ You can find all available build versions in the top level pom.xml file. If you
5050
for Databricks then you should use the `jenkins/databricks/build.sh` script and modify it for
5151
the version you want.
5252

53-
To get an uber jar with more than 1 version you have to `mvn install` each version
54-
and then use one of the defined profiles in the dist module. See the next section
55-
for more details.
53+
To get an uber jar with more than 1 version you have to `mvn package` each version
54+
and then use one of the defined profiles in the dist module, or a comma-separated list of
55+
build versions. See the next section for more details.
5656

5757
### Building a Distribution for Multiple Versions of Spark
5858

5959
By default the distribution jar only includes code for a single version of Spark. If you want
60-
to create a jar with multiple versions we currently have 4 options.
60+
to create a jar with multiple versions we have the following options.
6161

6262
1. Build for all Apache Spark versions and CDH with no SNAPSHOT versions of Spark, only released. Use `-PnoSnapshots`.
6363
2. Build for all Apache Spark versions and CDH including SNAPSHOT versions of Spark we have supported for. Use `-Psnapshots`.
6464
3. Build for all Apache Spark versions, CDH and Databricks with no SNAPSHOT versions of Spark, only released. Use `-PnoSnaphsotsWithDatabricks`.
6565
4. Build for all Apache Spark versions, CDH and Databricks including SNAPSHOT versions of Spark we have supported for. Use `-PsnapshotsWithDatabricks`
66+
5. Build for an arbitrary combination of comma-separated build versions using `-Dincluded_buildvers=<CSV list of build versions>`.
67+
E.g., `-Dincluded_buildvers=312,330`
6668

67-
You must first build and install each of the versions of Spark and then build one final time using the profile for the option you want.
68-
69-
There is a build script `build/buildall` to build everything with snapshots and this will have more options to build later.
69+
You must first build each of the versions of Spark and then build one final time using the profile for the option you want.
7070

7171
You can also install some manually and build a combined jar. For instance to build non-snapshot versions:
7272

7373
```shell script
74-
mvn -Dbuildver=301 clean install -DskipTests
75-
mvn -Dbuildver=302 clean install -Drat.skip=true -DskipTests
76-
mvn -Dbuildver=303 clean install -Drat.skip=true -DskipTests
77-
mvn -Dbuildver=311 clean install -Drat.skip=true -DskipTests
78-
mvn -Dbuildver=312 clean install -Drat.skip=true -DskipTests
79-
mvn -Dbuildver=311cdh clean install -Drat.skip=true -DskipTests
74+
mvn clean
75+
mvn -Dbuildver=301 install -DskipTests
76+
mvn -Dbuildver=302 install -Drat.skip=true -DskipTests
77+
mvn -Dbuildver=303 install -Drat.skip=true -DskipTests
78+
mvn -Dbuildver=311 install -Drat.skip=true -DskipTests
79+
mvn -Dbuildver=312 install -Drat.skip=true -DskipTests
80+
mvn -Dbuildver=320 install -Drat.skip=true -DskipTests
81+
mvn -Dbuildver=311cdh install -Drat.skip=true -DskipTests
8082
mvn -pl dist -PnoSnapshots package -DskipTests
8183
```
84+
#### Building with buildall script
85+
86+
There is a build script `build/buildall` that automates the local build process. Use
87+
`./buid/buildall --help` for up-to-date use information.
88+
89+
By default, it builds everything that is needed to create a distribution jar for all released (noSnapshots) Spark versions except for Databricks. Other profiles that you can pass using `--profile=<distribution profile>` include
90+
- `snapshots`
91+
- `minimumFeatureVersionMix` that currently includes 302, 311cdh, 312, 320 is recommended for catching incompatibilities already in the local development cycle
92+
93+
For initial quick iterations we can use `--profile=<buildver>` to build a single-shim version. e.g., `--profile=301` for Spark 3.0.1.
94+
95+
The option `--module=<module>` allows to limit the number of build steps. When iterating, we often don't have the need for the entire build. We may be interested in building everything necessary just to run integration tests (`--module=integration_tests`), or we may want to just rebuild the distribution jar (`--module=dist`)
96+
97+
By default, `buildall` builds up to 4 shims in parallel using `xargs -P <n>`. This can be adjusted by
98+
specifying the environment variable `BUILD_PARALLEL=<n>`.
8299

83100
### Building against different CUDA Toolkit versions
84101

85-
You can build against different versions of the CUDA Toolkit by using one of the following profiles:
102+
You can build against different versions of the CUDA Toolkit by using qone of the following profiles:
86103
* `-Pcuda11` (CUDA 11.0/11.1/11.2, default)
87104

88105
## Code contributions
@@ -98,6 +115,13 @@ dedicated shim modules.
98115
Thus, the conventional source code root directories `src/main/<language>` contain the files that
99116
are source-compatible with all supported Spark releases, both upstream and vendor-specific.
100117

118+
The following acronyms may appear in directory names:
119+
120+
|Acronym|Definition |Example|Example Explanation |
121+
|-------|------------|-------|----------------------------------------------|
122+
|cdh |Cloudera CDH|311cdh |Cloudera CDH Spark based on Apache Spark 3.1.1|
123+
|db |Databricks |312db |Databricks Spark based on Spark 3.1.2 |
124+
101125
The version-specific directory names have one of the following forms / use cases:
102126
- `src/main/312/scala` contains Scala source code for a single Spark version, 3.1.2 in this case
103127
- `src/main/312+-apache/scala`contains Scala source code for *upstream* **Apache** Spark builds,
@@ -107,14 +131,18 @@ The version-specific directory names have one of the following forms / use cases
107131
3.1.2 *exclusive*
108132
- `src/main/302to312-cdh` contains code that applies to Cloudera CDH shims between 3.0.2 *inclusive*,
109133
3.1.2 *inclusive*
134+
- `src/main/pre320-treenode` contains shims for the Catalyst `TreeNode` class before the
135+
[children trait specialization in Apache Spark 3.2.0](https://issues.apache.org/jira/browse/SPARK-34906).
136+
- `src/main/post320-treenode` contains shims for the Catalyst `TreeNode` class after the
137+
[children trait specialization in Apache Spark 3.2.0](https://issues.apache.org/jira/browse/SPARK-34906).
110138

111139

112140
### Setting up an Integrated Development Environment
113141

114-
Our project currently uses `build-helper-maven-plugin` for shimming against conflicting definitions of superclasses
115-
in upstream versions that cannot be resolved without significant code duplication otherwise. To this end different
116-
source directories with differently implemented same-named classes are
117-
[added](https://www.mojohaus.org/build-helper-maven-plugin/add-source-mojo.html)
142+
Our project currently uses `build-helper-maven-plugin` for shimming against conflicting definitions of superclasses
143+
in upstream versions that cannot be resolved without significant code duplication otherwise. To this end different
144+
source directories with differently implemented same-named classes are
145+
[added](https://www.mojohaus.org/build-helper-maven-plugin/add-source-mojo.html)
118146
for compilation depending on the targeted Spark version.
119147

120148
This may require some modifications to IDEs' standard Maven import functionality.
@@ -123,27 +151,98 @@ This may require some modifications to IDEs' standard Maven import functionality
123151

124152
_Last tested with 2021.2.1 Community Edition_
125153

126-
To start working with the project in IDEA is as easy as
127-
[opening](https://blog.jetbrains.com/idea/2008/03/opening-maven-projects-is-easy-as-pie/) the top level (parent)
128-
[pom.xml](pom.xml).
154+
To start working with the project in IDEA is as easy as
155+
[opening](https://blog.jetbrains.com/idea/2008/03/opening-maven-projects-is-easy-as-pie/) the top level (parent)
156+
[pom.xml](pom.xml).
129157

130158
In order to make sure that IDEA handles profile-specific source code roots within a single Maven module correctly,
131159
[unselect](https://www.jetbrains.com/help/idea/2021.2/maven-importing.html) "Keep source and test folders on reimport".
132160

133161
If you develop a feature that has to interact with the Shim layer or simply need to test the Plugin with a different
134162
Spark version, open [Maven tool window](https://www.jetbrains.com/help/idea/2021.2/maven-projects-tool-window.html) and
135-
select one of the `release3xx` profiles (e.g, `release320`) for Apache Spark 3.2.0, and click "Reload"
163+
select one of the `release3xx` profiles (e.g, `release320`) for Apache Spark 3.2.0, and click "Reload"
136164
if not triggered automatically.
137165

138-
There is a known issue with the shims/spark3xx submodules. After being enabled once, a module such as shims/spark312
166+
There is a known issue with the shims/spark3xx submodules. After being enabled once, a module such as shims/spark312
139167
may remain active in IDEA even though you explicitly disable the Maven profile `release312` in the Maven tool window.
140-
With an extra IDEA shim module loaded the IDEA internal build "Build->Build Project" is likely to fail
168+
With an extra IDEA shim module loaded the IDEA internal build "Build->Build Project" is likely to fail
141169
(whereas it has no adverse effect on Maven build). As a workaround, locate the pom.xml under the extraneous IDEA module,
142170
right-click on it and select "Maven->Ignore Projects".
143171

144172
If you see Scala symbols unresolved (highlighted red) in IDEA please try the following steps to resolve it:
145-
- Make sure there are no relevant poms in "File->Settings->Build Tools->Maven->Ignored Files"
146-
- Restart IDEA and click "Reload All Maven Projects" again
173+
- Make sure there are no relevant poms in "File->Settings->Build Tools->Maven->Ignored Files"
174+
- Restart IDEA and click "Reload All Maven Projects" again
175+
176+
#### Bloop Build Server
177+
178+
[Bloop](https://scalacenter.github.io/bloop/) is a build server and a set of tools around Build
179+
Server Protocol (BSP) for Scala providing an integration path with IDEs that support it. In fact,
180+
you can generate a Bloop project from Maven just for the Maven modules and profiles you are
181+
interested in. For example, to generate the Bloop projects for the Spark 3.2.0 dependency
182+
just for the production code run:
183+
184+
```shell script
185+
mvn install ch.epfl.scala:maven-bloop_2.13:1.4.9:bloopInstall -pl aggregator -am \
186+
-DdownloadSources=true \
187+
-Dbuildver=320 \
188+
-DskipTests \
189+
-Dskip \
190+
-Dmaven.javadoc.skip \
191+
-Dmaven.scalastyle.skip=true \
192+
-Dmaven.updateconfig.skip=true
193+
```
194+
195+
With `--generate-bloop` we integrated Bloop project generation into `buildall`. It makes it easier
196+
to generate projects for multiple Spark dependencies using the same profiles as our regular build.
197+
It makes sure that the project files belonging to different Spark dependencies are
198+
not clobbered by repeated `bloopInstall` Maven plugin invocations, and it uses
199+
[jq](https://stedolan.github.io/jq/) to post-process JSON-formatted project files such that they
200+
compile project classes into non-overlapping set of output directories.
201+
202+
You can now open the spark-rapids as a
203+
[BSP project in IDEA](https://www.jetbrains.com/help/idea/bsp-support.html)
204+
205+
# Bloop, Scala Metals, and Visual Studio Code
206+
207+
_Last tested with 1.63.0-insider (Universal) Commit: bedf867b5b02c1c800fbaf4d6ce09cefba_
208+
209+
Another, and arguably more popular, use of Bloop arises in connection with
210+
[Scala Metals](https://scalameta.org/metals/) and [VS @Code](https://code.visualstudio.com/).
211+
Scala Metals implements the
212+
[Language Server Protocol (LSP)](https://microsoft.github.io/language-server-protocol/) for Scala,
213+
and enables features such as context-aware autocomplete, and code browsing between Scala symbol
214+
definitions, references and vice versa. LSP is supported by many editors including Vim and Emacs.
215+
216+
Here we document the integration with VS code. It makes development on a remote node almost
217+
as easy as local development, which comes very handy when working in Cloud environments.
218+
219+
Run `./build/buildall --generate-bloop --profile=<profile>` to generate Bloop projects
220+
for required Spark dependencies, e.g. `--profile=320` for Spark 3.2.0. When developing
221+
remotely this is done on the remote node.
222+
223+
Install [Scala Metals extension](https://scalameta.org/metals/docs/editors/vscode) in VS Code,
224+
either locally or into a Remote-SSH extension destination depending on your target environment.
225+
When your project folder is open in VS Code, it may prompt you to import Maven project.
226+
IMPORTANT: always decline with "Don't ask again", otherwise it will overwrite the Bloop projects
227+
generated with the default `301` profile. If you need to use a different profile, always rerun the
228+
command above manually. When regenerating projects it's recommended to proceed to Metals
229+
"Build commands" View, and click:
230+
1. "Restart build server"
231+
1. "Clean compile workspace"
232+
to avoid stale class files.
233+
234+
Now you should be able to see Scala class members in the Explorer's Outline view and in the
235+
Breadcrumbs view at the top of the Editor with a Scala file open.
236+
237+
Check Metals logs, "Run Doctor", etc if something is not working as expected. You can also verify
238+
that the Bloop build server and the Metals language server are running by executing `jps` in the
239+
Terminal window:
240+
```shell script
241+
jps -l
242+
72960 sun.tools.jps.Jps
243+
72356 bloop.Server
244+
72349 scala.meta.metals.Main
245+
```
147246

148247
#### Other IDEs
149248
We welcome pull requests with tips how to setup your favorite IDE!
@@ -254,7 +353,7 @@ Please visit the [testing doc](tests/README.md) for details about how to run tes
254353

255354
### Pre-commit hooks
256355
We provide a basic config `.pre-commit-config.yaml` for [pre-commit](https://pre-commit.com/) to
257-
automate some aspects of the development process. As a convenience you can enable automatic
356+
automate some aspects of the development process. As a convenience you can enable automatic
258357
copyright year updates by following the installation instructions on the
259358
[pre-commit homepage](https://pre-commit.com/).
260359

@@ -299,7 +398,7 @@ manually trigger it by commenting `build`. It includes following steps,
299398
1. Mergeable check
300399
2. Blackduck vulnerability scan
301400
3. Fetch merged code (merge the pull request HEAD into BASE branch, e.g. fea-001 into branch-x)
302-
4. Run `mvn verify` and unit tests for multiple Spark versions in parallel.
401+
4. Run `mvn verify` and unit tests for multiple Spark versions in parallel.
303402
Ref: [spark-premerge-build.sh](jenkins/spark-premerge-build.sh)
304403

305404
If it fails, you can click the `Details` link of this check, and go to `Upload log -> Jenkins log for pull request xxx (click here)` to

LICENSE

+2-2
Original file line numberDiff line numberDiff line change
@@ -178,15 +178,15 @@
178178
APPENDIX: How to apply the Apache License to your work.
179179

180180
To apply the Apache License to your work, attach the following
181-
boilerplate notice, with the fields enclosed by brackets "{}"
181+
boilerplate notice, with the fields enclosed by brackets "[]"
182182
replaced with your own identifying information. (Don't include
183183
the brackets!) The text should be enclosed in the appropriate
184184
comment syntax for the file format. We also recommend that a
185185
file or class name and description of purpose be included on the
186186
same "printed page" as the copyright notice for easier
187187
identification within third-party archives.
188188

189-
Copyright 2019 nvspark
189+
Copyright [yyyy] [name of copyright owner]
190190

191191
Licensed under the Apache License, Version 2.0 (the "License");
192192
you may not use this file except in compliance with the License.

0 commit comments

Comments
 (0)