From 47b1306b7b5fb8f043a29310dd3b23713d8eef4f Mon Sep 17 00:00:00 2001 From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com> Date: Sun, 19 Oct 2025 15:31:08 +0000 Subject: [PATCH 1/6] Bump black in /dev in the pip group across 1 directory Bumps the pip group with 1 update in the /dev directory: [black](https://github.com/psf/black). Updates `black` from 23.12.1 to 24.3.0 - [Release notes](https://github.com/psf/black/releases) - [Changelog](https://github.com/psf/black/blob/main/CHANGES.md) - [Commits](https://github.com/psf/black/compare/23.12.1...24.3.0) --- updated-dependencies: - dependency-name: black dependency-version: 24.3.0 dependency-type: direct:production dependency-group: pip ... Signed-off-by: dependabot[bot] --- dev/requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/requirements.txt b/dev/requirements.txt index 40e7fa46cf14b..8dec699878933 100644 --- a/dev/requirements.txt +++ b/dev/requirements.txt @@ -57,7 +57,7 @@ jira>=3.5.2 PyGithub # pandas API on Spark Code formatter. -black==23.12.1 +black==24.3.0 py # Spark Connect (required) From 3db5efb1bd13cba354551cf1f6b46e5f9d52c53d Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sun, 19 Oct 2025 17:08:22 +0000 Subject: [PATCH 2/6] Initial plan From 87a1a2206b13d27f4dc4efdc16d4957ebed44c17 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sun, 19 Oct 2025 17:22:55 +0000 Subject: [PATCH 3/6] Add comprehensive documentation files for Spark architecture and modules Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com> --- ARCHITECTURE.md | 280 ++++++++++++++++++++++ DEVELOPMENT.md | 462 +++++++++++++++++++++++++++++++++++++ bin/README.md | 453 ++++++++++++++++++++++++++++++++++++ common/README.md | 472 +++++++++++++++++++++++++++++++++++++ core/README.md | 360 +++++++++++++++++++++++++++++ examples/README.md | 432 ++++++++++++++++++++++++++++++++++ graphx/README.md | 549 ++++++++++++++++++++++++++++++++++++++++++++ mllib/README.md | 514 +++++++++++++++++++++++++++++++++++++++++ streaming/README.md | 430 ++++++++++++++++++++++++++++++++++ 9 files changed, 3952 insertions(+) create mode 100644 ARCHITECTURE.md create mode 100644 DEVELOPMENT.md create mode 100644 bin/README.md create mode 100644 common/README.md create mode 100644 core/README.md create mode 100644 examples/README.md create mode 100644 graphx/README.md create mode 100644 mllib/README.md create mode 100644 streaming/README.md diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 0000000000000..ca8e58920596d --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,280 @@ +# Apache Spark Architecture + +This document provides an overview of the Apache Spark architecture and its key components. + +## Table of Contents + +- [Overview](#overview) +- [Core Components](#core-components) +- [Execution Model](#execution-model) +- [Key Subsystems](#key-subsystems) +- [Data Flow](#data-flow) +- [Module Structure](#module-structure) + +## Overview + +Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. + +### Design Principles + +1. **Unified Engine**: Single system for batch processing, streaming, machine learning, and graph processing +2. **In-Memory Computing**: Leverages RAM for fast iterative algorithms and interactive queries +3. **Lazy Evaluation**: Operations are not executed until an action is called +4. **Fault Tolerance**: Resilient Distributed Datasets (RDDs) provide automatic fault recovery +5. **Scalability**: Scales from a single machine to thousands of nodes + +## Core Components + +### 1. Spark Core + +The foundation of the Spark platform, providing: + +- **Task scheduling and dispatch** +- **Memory management** +- **Fault recovery** +- **Interaction with storage systems** +- **RDD API** - The fundamental data abstraction + +Location: `core/` directory + +Key classes: +- `SparkContext`: Main entry point for Spark functionality +- `RDD`: Resilient Distributed Dataset, the fundamental data structure +- `DAGScheduler`: Schedules stages based on DAG of operations +- `TaskScheduler`: Launches tasks on executors + +### 2. Spark SQL + +Module for structured data processing with: + +- **DataFrame and Dataset APIs** +- **SQL query engine** +- **Data source connectors** (Parquet, JSON, JDBC, etc.) +- **Catalyst optimizer** for query optimization + +Location: `sql/` directory + +Key components: +- Query parsing and analysis +- Logical and physical query planning +- Code generation for efficient execution +- Catalog management + +### 3. Spark Streaming + +Framework for scalable, high-throughput, fault-tolerant stream processing: + +- **DStreams** (Discretized Streams) - Legacy API +- **Structured Streaming** - Modern streaming API built on Spark SQL + +Location: `streaming/` directory + +Key features: +- Micro-batch processing model +- Exactly-once semantics +- Integration with Kafka, Flume, Kinesis, and more + +### 4. MLlib (Machine Learning Library) + +Scalable machine learning library providing: + +- **Classification and regression** +- **Clustering** +- **Collaborative filtering** +- **Dimensionality reduction** +- **Feature extraction and transformation** +- **ML Pipelines** for building workflows + +Location: `mllib/` and `mllib-local/` directories + +### 5. GraphX + +Graph processing framework with: + +- **Graph abstraction** built on top of RDDs +- **Graph algorithms** (PageRank, connected components, triangle counting, etc.) +- **Pregel-like API** for iterative graph computations + +Location: `graphx/` directory + +## Execution Model + +### Spark Application Lifecycle + +1. **Initialization**: User creates a `SparkContext` or `SparkSession` +2. **Job Submission**: Actions trigger job submission to the DAG scheduler +3. **Stage Creation**: DAG scheduler breaks jobs into stages based on shuffle boundaries +4. **Task Scheduling**: Task scheduler assigns tasks to executors +5. **Execution**: Executors run tasks and return results +6. **Result Collection**: Results are collected back to the driver or written to storage + +### Driver and Executors + +- **Driver Program**: Runs the main() function and creates SparkContext + - Converts user program into tasks + - Schedules tasks on executors + - Maintains metadata about the application + +- **Executors**: Processes that run on worker nodes + - Run tasks assigned by the driver + - Store data in memory or disk + - Return results to the driver + +### Cluster Managers + +Spark supports multiple cluster managers: + +- **Standalone**: Built-in cluster manager +- **Apache YARN**: Hadoop's resource manager +- **Apache Mesos**: General-purpose cluster manager +- **Kubernetes**: Container orchestration platform + +Location: `resource-managers/` directory + +## Key Subsystems + +### Memory Management + +Spark manages memory in several regions: + +1. **Execution Memory**: For shuffles, joins, sorts, and aggregations +2. **Storage Memory**: For caching and broadcasting data +3. **User Memory**: For user data structures and metadata +4. **Reserved Memory**: System reserved memory + +Configuration: Unified memory management allows dynamic allocation between execution and storage. + +### Shuffle Subsystem + +Handles data redistribution across partitions: + +- **Shuffle Write**: Map tasks write data to local disk +- **Shuffle Read**: Reduce tasks fetch data from map outputs +- **Shuffle Service**: External shuffle service for improved reliability + +Location: `core/src/main/scala/org/apache/spark/shuffle/` + +### Storage Subsystem + +Manages cached data and intermediate results: + +- **Block Manager**: Manages storage of data blocks +- **Memory Store**: In-memory cache +- **Disk Store**: Disk-based storage +- **Off-Heap Storage**: Direct memory storage + +Location: `core/src/main/scala/org/apache/spark/storage/` + +### Serialization + +Efficient serialization is critical for performance: + +- **Java Serialization**: Default, but slower +- **Kryo Serialization**: Faster and more compact (recommended) +- **Custom Serializers**: For specific data types + +Location: `core/src/main/scala/org/apache/spark/serializer/` + +## Data Flow + +### Transformation and Action Pipeline + +1. **Transformations**: Lazy operations that define a new RDD/DataFrame + - Examples: `map`, `filter`, `join`, `groupBy` + - Build up a DAG of operations + +2. **Actions**: Operations that trigger computation + - Examples: `count`, `collect`, `save`, `reduce` + - Cause DAG execution + +3. **Stages**: Groups of tasks that can be executed together + - Separated by shuffle operations + - Pipeline operations within a stage + +4. **Tasks**: Unit of work sent to executors + - One task per partition + - Execute transformations and return results + +## Module Structure + +### Project Organization + +``` +spark/ +├── assembly/ # Builds the final Spark assembly JAR +├── bin/ # User-facing command-line scripts +├── build/ # Build-related scripts +├── common/ # Common utilities shared across modules +├── conf/ # Configuration file templates +├── connector/ # External data source connectors +├── core/ # Spark Core engine +├── data/ # Sample data for examples +├── dev/ # Development scripts and tools +├── docs/ # Documentation source files +├── examples/ # Example programs +├── graphx/ # Graph processing library +├── hadoop-cloud/ # Cloud storage integration +├── launcher/ # Application launcher +├── mllib/ # Machine learning library (RDD-based) +├── mllib-local/ # Local ML algorithms +├── python/ # PySpark - Python API +├── R/ # SparkR - R API +├── repl/ # Interactive Scala shell +├── resource-managers/ # Cluster manager integrations +├── sbin/ # Admin scripts for cluster management +├── sql/ # Spark SQL and DataFrames +├── streaming/ # Streaming processing +└── tools/ # Various utility tools +``` + +### Module Dependencies + +- **Core**: Foundation for all other modules +- **SQL**: Depends on Core, used by Streaming, MLlib +- **Streaming**: Depends on Core and SQL +- **MLlib**: Depends on Core and SQL +- **GraphX**: Depends on Core +- **Python/R**: Language bindings to Core APIs + +## Building and Testing + +For detailed build instructions, see [building-spark.md](docs/building-spark.md). + +Quick start: +```bash +# Build Spark +./build/mvn -DskipTests clean package + +# Run tests +./dev/run-tests + +# Run specific module tests +./build/mvn test -pl core +``` + +## Performance Tuning + +Key areas for optimization: + +1. **Memory Configuration**: Adjust executor memory and memory fractions +2. **Parallelism**: Set appropriate partition counts +3. **Serialization**: Use Kryo for better performance +4. **Caching**: Cache frequently accessed data +5. **Broadcast Variables**: Efficiently distribute large read-only data +6. **Data Locality**: Ensure tasks run close to their data + +See [tuning.md](docs/tuning.md) for detailed tuning guidelines. + +## Contributing + +See [CONTRIBUTING.md](CONTRIBUTING.md) and the [contributing guide](https://spark.apache.org/contributing.html) for information on how to contribute to Apache Spark. + +## Further Reading + +- [Programming Guide](docs/programming-guide.md) +- [SQL Programming Guide](docs/sql-programming-guide.md) +- [Structured Streaming Guide](docs/structured-streaming-programming-guide.md) +- [MLlib Guide](docs/ml-guide.md) +- [GraphX Guide](docs/graphx-programming-guide.md) +- [Cluster Overview](docs/cluster-overview.md) +- [Configuration](docs/configuration.md) diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md new file mode 100644 index 0000000000000..2e5baeb6e0d36 --- /dev/null +++ b/DEVELOPMENT.md @@ -0,0 +1,462 @@ +# Spark Development Guide + +This guide provides information for developers working on Apache Spark. + +## Table of Contents + +- [Getting Started](#getting-started) +- [Development Environment](#development-environment) +- [Building Spark](#building-spark) +- [Testing](#testing) +- [Code Style](#code-style) +- [IDE Setup](#ide-setup) +- [Debugging](#debugging) +- [Working with Git](#working-with-git) +- [Common Development Tasks](#common-development-tasks) + +## Getting Started + +### Prerequisites + +- Java 17 or Java 21 (for Spark 4.x) +- Maven 3.9.9 or later +- Python 3.9+ (for PySpark development) +- R 4.0+ (for SparkR development) +- Git + +### Initial Setup + +1. **Clone the repository:** + ```bash + git clone https://github.com/apache/spark.git + cd spark + ``` + +2. **Build Spark:** + ```bash + ./build/mvn -DskipTests clean package + ``` + +3. **Verify the build:** + ```bash + ./bin/spark-shell + ``` + +## Development Environment + +### Directory Structure + +``` +spark/ +├── assembly/ # Final assembly JAR creation +├── bin/ # User command scripts (spark-submit, spark-shell, etc.) +├── build/ # Build scripts and Maven wrapper +├── common/ # Common utilities and modules +├── conf/ # Configuration templates +├── core/ # Spark Core +├── dev/ # Development tools (run-tests, lint, etc.) +├── docs/ # Documentation (Jekyll-based) +├── examples/ # Example programs +├── python/ # PySpark implementation +├── R/ # SparkR implementation +├── sbin/ # Admin scripts (start-all.sh, stop-all.sh, etc.) +├── sql/ # Spark SQL +└── [other modules] +``` + +### Key Development Directories + +- `dev/`: Contains scripts for testing, linting, and releasing +- `dev/run-tests`: Main test runner +- `dev/lint-*`: Various linting tools +- `build/mvn`: Maven wrapper script + +## Building Spark + +### Full Build + +```bash +# Build all modules, skip tests +./build/mvn -DskipTests clean package + +# Build with specific Hadoop version +./build/mvn -Phadoop-3.4 -DskipTests clean package + +# Build with Hive support +./build/mvn -Phive -Phive-thriftserver -DskipTests package +``` + +### Module-Specific Builds + +```bash +# Build only core module +./build/mvn -pl core -DskipTests package + +# Build core and its dependencies +./build/mvn -pl core -am -DskipTests package + +# Build SQL module +./build/mvn -pl sql/core -am -DskipTests package +``` + +### Build Profiles + +Common Maven profiles: + +- `-Phadoop-3.4`: Build with Hadoop 3.4 +- `-Pyarn`: Include YARN support +- `-Pkubernetes`: Include Kubernetes support +- `-Phive`: Include Hive support +- `-Phive-thriftserver`: Include Hive Thrift Server +- `-Pscala-2.13`: Build with Scala 2.13 + +### Fast Development Builds + +For faster iteration during development: + +```bash +# Skip Scala and Java style checks +./build/mvn -DskipTests -Dcheckstyle.skip package + +# Build specific module quickly +./build/mvn -pl sql/core -am -DskipTests -Dcheckstyle.skip package +``` + +## Testing + +### Running All Tests + +```bash +# Run all tests (takes several hours) +./dev/run-tests + +# Run tests for specific modules +./dev/run-tests --modules sql +``` + +### Running Specific Test Suites + +#### Scala/Java Tests + +```bash +# Run all tests in a module +./build/mvn test -pl core + +# Run a specific test suite +./build/mvn test -pl core -Dtest=SparkContextSuite + +# Run specific test methods +./build/mvn test -pl core -Dtest=SparkContextSuite#testJobInterruption +``` + +#### Python Tests + +```bash +# Run all PySpark tests +cd python && python run-tests.py + +# Run specific test file +cd python && python -m pytest pyspark/tests/test_context.py + +# Run specific test method +cd python && python -m pytest pyspark/tests/test_context.py::SparkContextTests::test_stop +``` + +#### R Tests + +```bash +cd R +R CMD check --no-manual --no-build-vignettes spark +``` + +### Test Coverage + +```bash +# Generate coverage report +./build/mvn clean install -DskipTests +./dev/run-tests --coverage +``` + +## Code Style + +### Scala Code Style + +Spark uses Scalastyle for Scala code checking: + +```bash +# Check Scala style +./dev/lint-scala + +# Auto-format (if scalafmt is configured) +./build/mvn scala:format +``` + +Key style guidelines: +- 2-space indentation +- Max line length: 100 characters +- Follow [Scala style guide](https://docs.scala-lang.org/style/) + +### Java Code Style + +Java code follows Google Java Style: + +```bash +# Check Java style +./dev/lint-java +``` + +Key guidelines: +- 2-space indentation +- Max line length: 100 characters +- Use Java 17+ features appropriately + +### Python Code Style + +PySpark follows PEP 8: + +```bash +# Check Python style +./dev/lint-python + +# Auto-format with black (if available) +cd python && black pyspark/ +``` + +Key guidelines: +- 4-space indentation +- Max line length: 100 characters +- Type hints encouraged for new code + +## IDE Setup + +### IntelliJ IDEA + +1. **Import Project:** + - File → Open → Select `pom.xml` + - Choose "Open as Project" + - Import Maven projects automatically + +2. **Configure JDK:** + - File → Project Structure → Project SDK → Select Java 17 or 21 + +3. **Recommended Plugins:** + - Scala plugin + - Python plugin + - Maven plugin + +4. **Code Style:** + - Import Spark code style from `dev/scalastyle-config.xml` + +### Visual Studio Code + +1. **Recommended Extensions:** + - Scala (Metals) + - Python + - Maven for Java + +2. **Workspace Settings:** + ```json + { + "java.configuration.maven.userSettings": ".mvn/settings.xml", + "python.linting.enabled": true, + "python.linting.pylintEnabled": true + } + ``` + +### Eclipse + +1. **Import Project:** + - File → Import → Maven → Existing Maven Projects + +2. **Install Plugins:** + - Scala IDE + - Maven Integration + +## Debugging + +### Debugging Scala/Java Code + +#### Using IDE Debugger + +1. Run tests with debugging enabled in your IDE +2. Set breakpoints in source code +3. Run test in debug mode + +#### Command Line Debugging + +```bash +# Enable remote debugging +export SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" +./bin/spark-shell +``` + +Then attach your IDE debugger to port 5005. + +### Debugging PySpark + +```bash +# Enable Python debugging +export PYSPARK_PYTHON=python +export PYSPARK_DRIVER_PYTHON=python + +# Run with pdb +python -m pdb your_spark_script.py +``` + +### Logging + +Adjust log levels in `conf/log4j2.properties`: + +```properties +# Set root logger level +rootLogger.level = info + +# Set specific logger +logger.spark.name = org.apache.spark +logger.spark.level = debug +``` + +## Working with Git + +### Branch Naming + +- Feature branches: `feature/description` +- Bug fixes: `fix/issue-number-description` +- Documentation: `docs/description` + +### Commit Messages + +Follow conventional commit format: + +``` +[SPARK-XXXXX] Brief description (max 72 chars) + +Detailed description of the change, motivation, and impact. + +- Bullet points for specific changes +- Reference related issues + +Closes #XXXXX +``` + +### Creating Pull Requests + +1. **Fork the repository** on GitHub +2. **Create a feature branch** from master +3. **Make your changes** with clear commits +4. **Push to your fork** +5. **Open a Pull Request** with: + - Clear title and description + - Link to JIRA issue if applicable + - Unit tests for new functionality + - Documentation updates if needed + +### Code Review + +- Address review comments promptly +- Keep discussions professional and constructive +- Be open to suggestions and improvements + +## Common Development Tasks + +### Adding a New Configuration + +1. Define config in appropriate config file (e.g., `sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala`) +2. Document the configuration +3. Add tests +4. Update documentation in `docs/configuration.md` + +### Adding a New API + +1. Implement the API with proper documentation +2. Add comprehensive unit tests +3. Update relevant documentation +4. Consider backward compatibility +5. Add deprecation notices if replacing old APIs + +### Adding a New Data Source + +1. Implement `DataSourceV2` interface +2. Add read/write support +3. Include integration tests +4. Document usage in `docs/sql-data-sources-*.md` + +### Performance Optimization + +1. Identify bottleneck with profiling +2. Create benchmark to measure improvement +3. Implement optimization +4. Verify performance gain +5. Ensure no functionality regression + +### Updating Dependencies + +1. Check for security vulnerabilities +2. Test compatibility +3. Update version in `pom.xml` +4. Update `LICENSE` and `NOTICE` files if needed +5. Run full test suite + +## Useful Commands + +```bash +# Clean build artifacts +./build/mvn clean + +# Skip Scalastyle checks +./build/mvn -Dscalastyle.skip package + +# Generate API documentation +./build/mvn scala:doc + +# Check for dependency updates +./build/mvn versions:display-dependency-updates + +# Profile a build +./build/mvn clean package -Dprofile + +# Run Spark locally with different memory +./bin/spark-shell --driver-memory 4g --executor-memory 4g +``` + +## Troubleshooting + +### Build Issues + +- **Out of Memory**: Increase Maven memory with `export MAVEN_OPTS="-Xmx4g"` +- **Compilation errors**: Clean build with `./build/mvn clean` +- **Version conflicts**: Update local Maven repo: `./build/mvn -U package` + +### Test Failures + +- Run single test to isolate issue +- Check for environment-specific problems +- Review logs in `target/` directories +- Enable debug logging for more detail + +### IDE Issues + +- Reimport Maven project +- Invalidate caches and restart +- Check SDK and language level settings + +## Resources + +- [Apache Spark Website](https://spark.apache.org/) +- [Spark Developer Tools](https://spark.apache.org/developer-tools.html) +- [Spark Wiki](https://cwiki.apache.org/confluence/display/SPARK) +- [Spark Mailing Lists](https://spark.apache.org/community.html#mailing-lists) +- [Spark JIRA](https://issues.apache.org/jira/projects/SPARK) + +## Getting Help + +- Ask questions on [user@spark.apache.org](mailto:user@spark.apache.org) +- Report bugs on [JIRA](https://issues.apache.org/jira/projects/SPARK) +- Discuss on [dev@spark.apache.org](mailto:dev@spark.apache.org) +- Chat on the [Spark Slack](https://spark.apache.org/community.html) + +## Contributing Back + +See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed contribution guidelines. + +Remember: Quality over quantity. Well-tested, documented changes are more valuable than large, poorly understood patches. diff --git a/bin/README.md b/bin/README.md new file mode 100644 index 0000000000000..e83fbf583746c --- /dev/null +++ b/bin/README.md @@ -0,0 +1,453 @@ +# Spark Binary Scripts + +This directory contains user-facing command-line scripts for running Spark applications and interactive shells. + +## Overview + +These scripts provide convenient entry points for: +- Running Spark applications +- Starting interactive shells (Scala, Python, R, SQL) +- Managing Spark clusters +- Utility operations + +## Main Scripts + +### spark-submit + +Submit Spark applications to a cluster. + +**Usage:** +```bash +./bin/spark-submit \ + --class \ + --master \ + --deploy-mode \ + --conf = \ + ... # other options + \ + [application-arguments] +``` + +**Examples:** +```bash +# Run on local mode with 4 cores +./bin/spark-submit --class org.example.App --master local[4] app.jar + +# Run on YARN cluster +./bin/spark-submit --class org.example.App --master yarn --deploy-mode cluster app.jar + +# Run Python application +./bin/spark-submit --master local[2] script.py + +# Run with specific memory and executor settings +./bin/spark-submit \ + --master spark://master:7077 \ + --executor-memory 4G \ + --total-executor-cores 8 \ + --class org.example.App \ + app.jar +``` + +**Key Options:** +- `--master`: Master URL (local, spark://, yarn, k8s://, mesos://) +- `--deploy-mode`: client or cluster +- `--class`: Application main class (for Java/Scala) +- `--name`: Application name +- `--jars`: Additional JARs to include +- `--packages`: Maven coordinates of packages +- `--conf`: Spark configuration property +- `--driver-memory`: Driver memory (e.g., 1g, 2g) +- `--executor-memory`: Executor memory +- `--executor-cores`: Cores per executor +- `--num-executors`: Number of executors (YARN only) + +See [submitting-applications.md](../docs/submitting-applications.md) for complete documentation. + +### spark-shell + +Interactive Scala shell with Spark support. + +**Usage:** +```bash +./bin/spark-shell [options] +``` + +**Examples:** +```bash +# Start local shell +./bin/spark-shell + +# Connect to remote cluster +./bin/spark-shell --master spark://master:7077 + +# With specific memory +./bin/spark-shell --driver-memory 4g + +# With additional packages +./bin/spark-shell --packages org.apache.spark:spark-avro_2.13:3.5.0 +``` + +**In the shell:** +```scala +scala> val data = spark.range(1000) +scala> data.count() +res0: Long = 1000 + +scala> spark.read.json("data.json").show() +``` + +### pyspark + +Interactive Python shell with PySpark support. + +**Usage:** +```bash +./bin/pyspark [options] +``` + +**Examples:** +```bash +# Start local shell +./bin/pyspark + +# Connect to remote cluster +./bin/pyspark --master spark://master:7077 + +# With specific Python version +PYSPARK_PYTHON=python3.11 ./bin/pyspark +``` + +**In the shell:** +```python +>>> df = spark.range(1000) +>>> df.count() +1000 + +>>> spark.read.json("data.json").show() +``` + +### sparkR + +Interactive R shell with SparkR support. + +**Usage:** +```bash +./bin/sparkR [options] +``` + +**Examples:** +```bash +# Start local shell +./bin/sparkR + +# Connect to remote cluster +./bin/sparkR --master spark://master:7077 +``` + +**In the shell:** +```r +> df <- createDataFrame(iris) +> head(df) +> count(df) +``` + +### spark-sql + +Interactive SQL shell for running SQL queries. + +**Usage:** +```bash +./bin/spark-sql [options] +``` + +**Examples:** +```bash +# Start SQL shell +./bin/spark-sql + +# Connect to Hive metastore +./bin/spark-sql --conf spark.sql.warehouse.dir=/path/to/warehouse + +# Run SQL file +./bin/spark-sql -f query.sql + +# Execute inline query +./bin/spark-sql -e "SELECT * FROM table" +``` + +**In the shell:** +```sql +spark-sql> CREATE TABLE test (id INT, name STRING); +spark-sql> INSERT INTO test VALUES (1, 'Alice'), (2, 'Bob'); +spark-sql> SELECT * FROM test; +``` + +### run-example + +Run Spark example programs. + +**Usage:** +```bash +./bin/run-example [params] +``` + +**Examples:** +```bash +# Run SparkPi example +./bin/run-example SparkPi 100 + +# Run with specific master +MASTER=spark://master:7077 ./bin/run-example SparkPi + +# Run SQL example +./bin/run-example sql.SparkSQLExample +``` + +## Utility Scripts + +### spark-class + +Internal script to run Spark classes. Usually not called directly by users. + +**Usage:** +```bash +./bin/spark-class [options] +``` + +### load-spark-env.sh + +Loads Spark environment variables from conf/spark-env.sh. Sourced by other scripts. + +## Configuration + +Scripts read configuration from: + +1. **Environment variables**: Set in shell or `conf/spark-env.sh` +2. **Command-line options**: Passed via `--conf` or specific flags +3. **Configuration files**: `conf/spark-defaults.conf` + +### Common Environment Variables + +```bash +# Java +export JAVA_HOME=/path/to/java + +# Spark +export SPARK_HOME=/path/to/spark +export SPARK_MASTER_HOST=master-hostname +export SPARK_MASTER_PORT=7077 + +# Python +export PYSPARK_PYTHON=python3 +export PYSPARK_DRIVER_PYTHON=python3 + +# Memory +export SPARK_DRIVER_MEMORY=2g +export SPARK_EXECUTOR_MEMORY=4g + +# Logging +export SPARK_LOG_DIR=/var/log/spark +``` + +Set these in `conf/spark-env.sh` for persistence. + +## Master URLs + +Scripts accept various master URL formats: + +- **local**: Run locally with one worker thread +- **local[K]**: Run locally with K worker threads +- **local[*]**: Run locally with as many worker threads as cores +- **spark://HOST:PORT**: Connect to Spark standalone cluster +- **yarn**: Connect to YARN cluster +- **k8s://HOST:PORT**: Connect to Kubernetes cluster +- **mesos://HOST:PORT**: Connect to Mesos cluster + +## Advanced Usage + +### Configuring Logging + +Create `conf/log4j2.properties`: +```properties +rootLogger.level = info +logger.spark.name = org.apache.spark +logger.spark.level = warn +``` + +### Using with Jupyter Notebook + +```bash +# Set environment variables +export PYSPARK_DRIVER_PYTHON=jupyter +export PYSPARK_DRIVER_PYTHON_OPTS='notebook' + +# Start PySpark (opens Jupyter) +./bin/pyspark +``` + +### Connecting to Remote Clusters + +```bash +# Standalone cluster +./bin/spark-submit --master spark://master:7077 app.jar + +# YARN +./bin/spark-submit --master yarn --deploy-mode cluster app.jar + +# Kubernetes +./bin/spark-submit --master k8s://https://k8s-api:6443 \ + --deploy-mode cluster \ + --conf spark.kubernetes.container.image=spark:3.5.0 \ + app.jar +``` + +### Dynamic Resource Allocation + +```bash +./bin/spark-submit \ + --conf spark.dynamicAllocation.enabled=true \ + --conf spark.dynamicAllocation.minExecutors=1 \ + --conf spark.dynamicAllocation.maxExecutors=10 \ + app.jar +``` + +## Debugging + +### Enable Verbose Output + +```bash +./bin/spark-submit --verbose ... +``` + +### Check Spark Configuration + +```bash +./bin/spark-submit --class org.example.App app.jar 2>&1 | grep -i "spark\." +``` + +### Remote Debugging + +```bash +# Driver debugging +./bin/spark-submit \ + --conf spark.driver.extraJavaOptions="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" \ + app.jar + +# Executor debugging +./bin/spark-submit \ + --conf spark.executor.extraJavaOptions="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5006" \ + app.jar +``` + +## Security + +### Kerberos Authentication + +```bash +./bin/spark-submit \ + --principal user@REALM \ + --keytab /path/to/user.keytab \ + --master yarn \ + app.jar +``` + +### SSL Configuration + +```bash +./bin/spark-submit \ + --conf spark.ssl.enabled=true \ + --conf spark.ssl.keyStore=/path/to/keystore \ + --conf spark.ssl.keyStorePassword=password \ + app.jar +``` + +## Performance Tuning + +### Memory Configuration + +```bash +./bin/spark-submit \ + --driver-memory 4g \ + --executor-memory 8g \ + --conf spark.memory.fraction=0.8 \ + app.jar +``` + +### Parallelism + +```bash +./bin/spark-submit \ + --conf spark.default.parallelism=100 \ + --conf spark.sql.shuffle.partitions=200 \ + app.jar +``` + +### Serialization + +```bash +./bin/spark-submit \ + --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ + app.jar +``` + +## Troubleshooting + +### Common Issues + +**Java not found:** +```bash +export JAVA_HOME=/path/to/java +``` + +**Class not found:** +```bash +# Add dependencies +./bin/spark-submit --jars dependency.jar app.jar +``` + +**Out of memory:** +```bash +# Increase memory +./bin/spark-submit --driver-memory 8g --executor-memory 16g app.jar +``` + +**Connection refused:** +```bash +# Check master URL and firewall settings +# Verify master is running with: jps | grep Master +``` + +## Script Internals + +### Script Hierarchy + +``` +spark-submit +├── spark-class +│ └── load-spark-env.sh +└── Actual Java/Python execution +``` + +### How spark-submit Works + +1. Parse command-line arguments +2. Load configuration from `spark-defaults.conf` +3. Set up classpath and Java options +4. Call `spark-class` with appropriate arguments +5. Launch JVM with Spark application + +## Related Scripts + +For cluster management scripts, see [../sbin/README.md](../sbin/README.md). + +## Further Reading + +- [Submitting Applications](../docs/submitting-applications.md) +- [Spark Configuration](../docs/configuration.md) +- [Cluster Mode Overview](../docs/cluster-overview.md) +- [Running on YARN](../docs/running-on-yarn.md) +- [Running on Kubernetes](../docs/running-on-kubernetes.md) + +## Examples + +More examples in [../examples/](../examples/). diff --git a/common/README.md b/common/README.md new file mode 100644 index 0000000000000..1d2890b14e6c2 --- /dev/null +++ b/common/README.md @@ -0,0 +1,472 @@ +# Spark Common Modules + +This directory contains common utilities and libraries shared across all Spark modules. + +## Overview + +The common modules provide foundational functionality used throughout Spark: + +- Network communication +- Memory management utilities +- Serialization helpers +- Configuration management +- Logging infrastructure +- Testing utilities + +These modules have no dependencies on Spark Core, allowing them to be used by any Spark component. + +## Modules + +### common/kvstore + +Key-value store abstraction for metadata storage. + +**Purpose:** +- Store application metadata +- Track job and stage information +- Persist UI data + +**Location**: `kvstore/` + +**Key classes:** +- `KVStore`: Interface for key-value storage +- `LevelDB`: LevelDB-based implementation +- `InMemoryStore`: In-memory implementation for testing + +**Usage:** +```scala +val store = new LevelDB(path) +store.write(new StoreKey(id), value) +val data = store.read(classOf[ValueType], id) +``` + +### common/network-common + +Core networking abstractions and utilities. + +**Purpose:** +- RPC framework +- Block transfer protocol +- Network servers and clients + +**Location**: `network-common/` + +**Key components:** +- `TransportContext`: Network communication setup +- `TransportClient`: Network client +- `TransportServer`: Network server +- `MessageHandler`: Message processing +- `StreamManager`: Stream data management + +**Features:** +- Netty-based implementation +- Zero-copy transfers +- SSL/TLS support +- Flow control + +### common/network-shuffle + +Network shuffle service for serving shuffle data. + +**Purpose:** +- External shuffle service +- Serves shuffle blocks to executors +- Improves executor reliability + +**Location**: `network-shuffle/` + +**Key classes:** +- `ExternalShuffleService`: Standalone shuffle service +- `ExternalShuffleClient`: Client for fetching shuffle data +- `ShuffleBlockResolver`: Resolves shuffle block locations + +**Benefits:** +- Executors can be killed without losing shuffle data +- Better resource utilization +- Improved fault tolerance + +**Configuration:** +```properties +spark.shuffle.service.enabled=true +spark.shuffle.service.port=7337 +``` + +### common/network-yarn + +YARN-specific network integration. + +**Purpose:** +- Integration with YARN shuffle service +- YARN auxiliary service implementation + +**Location**: `network-yarn/` + +**Usage:** Automatically used when running on YARN with shuffle service enabled. + +### common/sketch + +Data sketching and approximate algorithms. + +**Purpose:** +- Memory-efficient approximate computations +- Probabilistic data structures + +**Location**: `sketch/` + +**Algorithms:** +- Count-Min Sketch: Frequency estimation +- Bloom Filter: Set membership testing +- HyperLogLog: Cardinality estimation + +**Usage:** +```scala +import org.apache.spark.util.sketch._ + +// Create bloom filter +val bf = BloomFilter.create(expectedItems, falsePositiveRate) +bf.put("item1") +bf.mightContain("item1") // true + +// Create count-min sketch +val cms = CountMinSketch.create(depth, width, seed) +cms.add("item", count) +val estimate = cms.estimateCount("item") +``` + +### common/tags + +Test tags for categorizing tests. + +**Purpose:** +- Tag tests for selective execution +- Categorize slow/flaky tests +- Enable/disable test groups + +**Location**: `tags/` + +**Example tags:** +- `@SlowTest`: Long-running tests +- `@ExtendedTest`: Extended test suite +- `@DockerTest`: Tests requiring Docker + +### common/unsafe + +Unsafe operations for performance-critical code. + +**Purpose:** +- Direct memory access +- Serialization without reflection +- Performance optimizations + +**Location**: `unsafe/` + +**Key classes:** +- `Platform`: Platform-specific operations +- `UnsafeAlignedOffset`: Aligned memory access +- Memory utilities for sorting and hashing + +**Warning:** These APIs are internal and subject to change. + +## Architecture + +### Layering + +``` +Spark Core / SQL / Streaming / MLlib + ↓ + Common Modules (network, kvstore, etc.) + ↓ + JVM / Netty / OS +``` + +### Design Principles + +1. **No Spark Core dependencies**: Can be used independently +2. **Minimal external dependencies**: Reduce classpath conflicts +3. **High performance**: Optimized for throughput and latency +4. **Reusability**: Shared across all Spark components + +## Networking Architecture + +### Transport Layer + +The network-common module provides the foundation for all network communication in Spark. + +**Components:** + +1. **TransportContext**: Sets up network infrastructure +2. **TransportClient**: Sends requests and receives responses +3. **TransportServer**: Accepts connections and handles requests +4. **MessageHandler**: Processes incoming messages + +**Flow:** +``` +Client Server + | | + |------ Request Message ------->| + | | (Process in MessageHandler) + |<----- Response Message -------| + | | +``` + +### RPC Framework + +Built on top of the transport layer: + +```scala +// Server side +val rpcEnv = RpcEnv.create("name", host, port, conf) +val endpoint = new MyEndpoint(rpcEnv) +rpcEnv.setupEndpoint("my-endpoint", endpoint) + +// Client side +val ref = rpcEnv.setupEndpointRef("spark://host:port/my-endpoint") +val response = ref.askSync[Response](request) +``` + +### Block Transfer + +Optimized for transferring large data blocks: + +```scala +val blockTransferService = new NettyBlockTransferService(conf) +blockTransferService.fetchBlocks( + host, port, execId, blockIds, + blockFetchingListener, tempFileManager +) +``` + +## Building and Testing + +### Build Common Modules + +```bash +# Build all common modules +./build/mvn -pl 'common/*' -am package + +# Build specific module +./build/mvn -pl common/network-common -am package +``` + +### Run Tests + +```bash +# Run all common tests +./build/mvn test -pl 'common/*' + +# Run specific module tests +./build/mvn test -pl common/network-common + +# Run specific test +./build/mvn test -pl common/network-common -Dtest=TransportClientSuite +``` + +## Module Dependencies + +``` +common/unsafe (no dependencies) + ↓ +common/network-common + ↓ +common/network-shuffle + ↓ +common/network-yarn + +common/sketch (independent) +common/tags (independent) +common/kvstore (independent) +``` + +## Source Code Organization + +``` +common/ +├── kvstore/ # Key-value store +│ └── src/main/java/org/apache/spark/util/kvstore/ +├── network-common/ # Core networking +│ └── src/main/java/org/apache/spark/network/ +│ ├── client/ # Client implementation +│ ├── server/ # Server implementation +│ ├── buffer/ # Buffer management +│ ├── crypto/ # Encryption +│ ├── protocol/ # Protocol messages +│ └── util/ # Utilities +├── network-shuffle/ # Shuffle service +│ └── src/main/java/org/apache/spark/network/shuffle/ +├── network-yarn/ # YARN integration +│ └── src/main/java/org/apache/spark/network/yarn/ +├── sketch/ # Sketching algorithms +│ └── src/main/java/org/apache/spark/util/sketch/ +├── tags/ # Test tags +│ └── src/main/java/org/apache/spark/tags/ +└── unsafe/ # Unsafe operations + └── src/main/java/org/apache/spark/unsafe/ +``` + +## Performance Considerations + +### Zero-Copy Transfer + +Network modules use zero-copy techniques: +- FileRegion for file-based transfers +- Direct buffers to avoid copying +- Netty's native transport when available + +### Memory Management + +```java +// Use pooled buffers +ByteBufAllocator allocator = PooledByteBufAllocator.DEFAULT; +ByteBuf buffer = allocator.directBuffer(size); +try { + // Use buffer +} finally { + buffer.release(); +} +``` + +### Connection Pooling + +Clients reuse connections: +```java +TransportClientFactory factory = context.createClientFactory(); +TransportClient client = factory.createClient(host, port); +// Client is cached and reused +``` + +## Security + +### SSL/TLS Support + +Enable encryption in network communication: + +```properties +spark.ssl.enabled=true +spark.ssl.protocol=TLSv1.2 +spark.ssl.keyStore=/path/to/keystore +spark.ssl.keyStorePassword=password +spark.ssl.trustStore=/path/to/truststore +spark.ssl.trustStorePassword=password +``` + +### SASL Authentication + +Support for SASL-based authentication: + +```properties +spark.authenticate=true +spark.authenticate.secret=shared-secret +``` + +## Monitoring + +### Network Metrics + +Key metrics tracked: +- Active connections +- Bytes sent/received +- Request latency +- Connection failures + +**Access via Spark UI**: `http://:4040/metrics/` + +### Logging + +Enable detailed network logging: + +```properties +log4j.logger.org.apache.spark.network=DEBUG +log4j.logger.io.netty=DEBUG +``` + +## Configuration + +### Network Settings + +```properties +# Connection timeout +spark.network.timeout=120s + +# I/O threads +spark.network.io.numConnectionsPerPeer=1 + +# Buffer sizes +spark.network.io.preferDirectBufs=true + +# Maximum retries +spark.network.io.maxRetries=3 + +# Connection pooling +spark.rpc.numRetries=3 +spark.rpc.retry.wait=3s +``` + +### Shuffle Service + +```properties +spark.shuffle.service.enabled=true +spark.shuffle.service.port=7337 +spark.shuffle.service.index.cache.size=100m +``` + +## Best Practices + +1. **Reuse connections**: Don't create new clients unnecessarily +2. **Release buffers**: Always release ByteBuf instances +3. **Handle backpressure**: Implement flow control in handlers +4. **Enable encryption**: Use SSL for sensitive data +5. **Monitor metrics**: Track network performance +6. **Configure timeouts**: Set appropriate timeout values +7. **Use external shuffle service**: For production deployments + +## Troubleshooting + +### Connection Issues + +**Problem**: Connection refused or timeout + +**Solutions:** +- Check firewall settings +- Verify host and port +- Increase timeout values +- Check network connectivity + +### Memory Leaks + +**Problem**: Growing memory usage in network layer + +**Solutions:** +- Ensure ByteBuf.release() is called +- Check for unclosed connections +- Monitor Netty buffer pool metrics + +### Slow Performance + +**Problem**: High network latency + +**Solutions:** +- Enable native transport +- Increase I/O threads +- Adjust buffer sizes +- Check network bandwidth + +## Internal APIs + +**Note**: All classes in common modules are internal APIs and may change between versions. They are not part of the public Spark API. + +## Further Reading + +- [Cluster Mode Overview](../docs/cluster-overview.md) +- [Configuration Guide](../docs/configuration.md) +- [Security Guide](../docs/security.md) + +## Contributing + +For contributing to common modules, see [CONTRIBUTING.md](../CONTRIBUTING.md). + +When adding functionality: +- Keep dependencies minimal +- Write comprehensive tests +- Document public methods +- Consider performance implications +- Maintain backward compatibility where possible diff --git a/core/README.md b/core/README.md new file mode 100644 index 0000000000000..4a5be68b0342e --- /dev/null +++ b/core/README.md @@ -0,0 +1,360 @@ +# Spark Core + +Spark Core is the foundation of the Apache Spark platform. It provides the basic functionality for distributed task dispatching, scheduling, and I/O operations. + +## Overview + +Spark Core contains the fundamental abstractions and components that all other Spark modules build upon: + +- **Resilient Distributed Datasets (RDDs)**: The fundamental data abstraction in Spark +- **SparkContext**: The main entry point for Spark functionality +- **Task Scheduling**: DAG scheduler and task scheduler for distributed execution +- **Memory Management**: Unified memory management for execution and storage +- **Shuffle System**: Data redistribution across partitions +- **Storage System**: In-memory and disk-based storage for cached data +- **Network Communication**: RPC and data transfer between driver and executors + +## Key Components + +### RDD (Resilient Distributed Dataset) + +The core abstraction in Spark - an immutable, distributed collection of objects that can be processed in parallel. + +**Key characteristics:** +- **Resilient**: Fault-tolerant through lineage information +- **Distributed**: Data is partitioned across cluster nodes +- **Immutable**: Cannot be changed once created + +**Location**: `src/main/scala/org/apache/spark/rdd/` + +**Main classes:** +- `RDD.scala`: Base RDD class with transformations and actions +- `HadoopRDD.scala`: RDD for reading from Hadoop +- `ParallelCollectionRDD.scala`: RDD created from a local collection +- `MapPartitionsRDD.scala`: Result of map-like transformations + +### SparkContext + +The main entry point for Spark functionality. Creates RDDs, accumulators, and broadcast variables. + +**Location**: `src/main/scala/org/apache/spark/SparkContext.scala` + +**Key responsibilities:** +- Connects to cluster manager +- Acquires executors +- Sends application code to executors +- Creates and manages RDDs +- Schedules and executes jobs + +### Scheduling + +#### DAGScheduler + +Computes a DAG of stages for each job and submits them to the TaskScheduler. + +**Location**: `src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala` + +**Responsibilities:** +- Determines preferred locations for tasks based on cache status +- Handles task failures and stage retries +- Identifies shuffle boundaries to split stages +- Manages job completion and failure + +#### TaskScheduler + +Submits task sets to the cluster, manages task execution, and retries failed tasks. + +**Location**: `src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala` + +**Implementations:** +- `TaskSchedulerImpl`: Default implementation +- `YarnScheduler`: YARN-specific implementation +- Cluster manager-specific schedulers + +### Memory Management + +Unified memory management system that dynamically allocates memory between execution and storage. + +**Location**: `src/main/scala/org/apache/spark/memory/` + +**Components:** +- `MemoryManager`: Base memory management interface +- `UnifiedMemoryManager`: Dynamic allocation between execution and storage +- `StorageMemoryPool`: Memory pool for caching +- `ExecutionMemoryPool`: Memory pool for shuffles and joins + +**Memory regions:** +1. **Execution Memory**: Shuffles, joins, sorts, aggregations +2. **Storage Memory**: Caching and broadcasting +3. **User Memory**: User data structures +4. **Reserved Memory**: System overhead + +### Shuffle System + +Handles data redistribution between stages. + +**Location**: `src/main/scala/org/apache/spark/shuffle/` + +**Key classes:** +- `ShuffleManager`: Interface for shuffle implementations +- `SortShuffleManager`: Default shuffle implementation +- `ShuffleWriter`: Writes shuffle data +- `ShuffleReader`: Reads shuffle data + +**Shuffle process:** +1. **Shuffle Write**: Map tasks write partitioned data to disk +2. **Shuffle Fetch**: Reduce tasks fetch data from map outputs +3. **Shuffle Service**: External service for serving shuffle data + +### Storage System + +Block-based storage abstraction for cached data and shuffle outputs. + +**Location**: `src/main/scala/org/apache/spark/storage/` + +**Components:** +- `BlockManager`: Manages data blocks in memory and disk +- `MemoryStore`: In-memory block storage +- `DiskStore`: Disk-based block storage +- `BlockManagerMaster`: Master for coordinating block managers + +**Storage levels:** +- `MEMORY_ONLY`: Store in memory only +- `MEMORY_AND_DISK`: Spill to disk if memory is full +- `DISK_ONLY`: Store on disk only +- `OFF_HEAP`: Store in off-heap memory + +### Network Layer + +Communication infrastructure for driver-executor and executor-executor communication. + +**Location**: `src/main/scala/org/apache/spark/network/` and `common/network-*/` + +**Components:** +- `NettyRpcEnv`: Netty-based RPC implementation +- `TransportContext`: Network communication setup +- `BlockTransferService`: Block data transfer + +### Serialization + +Efficient serialization for data and closures. + +**Location**: `src/main/scala/org/apache/spark/serializer/` + +**Serializers:** +- `JavaSerializer`: Default Java serialization (slower) +- `KryoSerializer`: Faster, more compact serialization (recommended) + +**Configuration:** +```scala +conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") +``` + +## API Overview + +### Creating RDDs + +```scala +// From a local collection +val data = Array(1, 2, 3, 4, 5) +val rdd = sc.parallelize(data) + +// From external storage +val textFile = sc.textFile("hdfs://path/to/file") + +// From another RDD +val mapped = rdd.map(_ * 2) +``` + +### Transformations + +Lazy operations that define a new RDD: + +```scala +val mapped = rdd.map(x => x * 2) +val filtered = rdd.filter(x => x > 10) +val flatMapped = rdd.flatMap(x => x.toString.split(" ")) +``` + +### Actions + +Operations that trigger computation: + +```scala +val count = rdd.count() +val collected = rdd.collect() +val reduced = rdd.reduce(_ + _) +rdd.saveAsTextFile("hdfs://path/to/output") +``` + +### Caching + +```scala +// Cache in memory +rdd.cache() + +// Cache with specific storage level +rdd.persist(StorageLevel.MEMORY_AND_DISK) + +// Remove from cache +rdd.unpersist() +``` + +## Configuration + +Key configuration parameters (set via `SparkConf`): + +### Memory +- `spark.executor.memory`: Executor memory (default: 1g) +- `spark.memory.fraction`: Fraction for execution and storage (default: 0.6) +- `spark.memory.storageFraction`: Fraction of spark.memory.fraction for storage (default: 0.5) + +### Parallelism +- `spark.default.parallelism`: Default number of partitions (default: number of cores) +- `spark.sql.shuffle.partitions`: Partitions for shuffle operations (default: 200) + +### Scheduling +- `spark.scheduler.mode`: FIFO or FAIR (default: FIFO) +- `spark.locality.wait`: Wait time for data-local tasks (default: 3s) + +### Shuffle +- `spark.shuffle.compress`: Compress shuffle output (default: true) +- `spark.shuffle.spill.compress`: Compress shuffle spills (default: true) + +See [configuration.md](../docs/configuration.md) for complete list. + +## Architecture + +### Job Execution Flow + +1. **Action called** → Triggers job submission +2. **DAG construction** → DAGScheduler creates stages +3. **Task creation** → Each stage becomes a task set +4. **Task scheduling** → TaskScheduler assigns tasks to executors +5. **Task execution** → Executors run tasks +6. **Result collection** → Results returned to driver + +### Fault Tolerance + +Spark achieves fault tolerance through: + +1. **RDD Lineage**: Each RDD knows how to recompute from its parent RDDs +2. **Task Retry**: Failed tasks are automatically retried +3. **Stage Retry**: Failed stages are re-executed +4. **Checkpoint**: Optionally save RDD to stable storage + +## Building and Testing + +### Build Core Module + +```bash +# Build core only +./build/mvn -pl core -DskipTests package + +# Build core with dependencies +./build/mvn -pl core -am -DskipTests package +``` + +### Run Tests + +```bash +# Run all core tests +./build/mvn test -pl core + +# Run specific test suite +./build/mvn test -pl core -Dtest=SparkContextSuite + +# Run specific test +./build/mvn test -pl core -Dtest=SparkContextSuite#testJobCancellation +``` + +## Source Code Organization + +``` +core/src/main/ +├── java/ # Java sources +│ └── org/apache/spark/ +│ ├── api/ # Java API +│ ├── shuffle/ # Shuffle implementation +│ └── unsafe/ # Unsafe operations +├── scala/ # Scala sources +│ └── org/apache/spark/ +│ ├── rdd/ # RDD implementations +│ ├── scheduler/ # Scheduling components +│ ├── storage/ # Storage system +│ ├── memory/ # Memory management +│ ├── shuffle/ # Shuffle system +│ ├── broadcast/ # Broadcast variables +│ ├── deploy/ # Deployment components +│ ├── executor/ # Executor implementation +│ ├── io/ # I/O utilities +│ ├── network/ # Network layer +│ ├── serializer/ # Serialization +│ └── util/ # Utilities +└── resources/ # Resource files +``` + +## Performance Tuning + +### Memory Optimization + +1. Adjust memory fractions based on workload +2. Use off-heap memory for large datasets +3. Choose appropriate storage levels +4. Avoid excessive caching + +### Shuffle Optimization + +1. Minimize shuffle operations +2. Use `reduceByKey` instead of `groupByKey` +3. Increase shuffle parallelism +4. Enable compression + +### Serialization Optimization + +1. Use Kryo serialization +2. Register custom classes with Kryo +3. Avoid closures with large objects + +### Data Locality + +1. Ensure data and compute are co-located +2. Increase `spark.locality.wait` if needed +3. Use appropriate storage levels + +## Common Issues and Solutions + +### OutOfMemoryError + +- Increase executor memory +- Reduce parallelism +- Use disk-based storage levels +- Enable off-heap memory + +### Shuffle Failures + +- Increase shuffle memory +- Increase shuffle parallelism +- Enable external shuffle service + +### Slow Performance + +- Check data skew +- Optimize shuffle operations +- Increase parallelism +- Enable speculation + +## Further Reading + +- [RDD Programming Guide](../docs/rdd-programming-guide.md) +- [Cluster Mode Overview](../docs/cluster-overview.md) +- [Tuning Guide](../docs/tuning.md) +- [Job Scheduling](../docs/job-scheduling.md) +- [Hardware Provisioning](../docs/hardware-provisioning.md) + +## Related Modules + +- [common/](../common/) - Common utilities shared across modules +- [launcher/](../launcher/) - Application launcher +- [sql/](../sql/) - Spark SQL and DataFrames +- [streaming/](../streaming/) - Spark Streaming diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 0000000000000..964dfaf3393c3 --- /dev/null +++ b/examples/README.md @@ -0,0 +1,432 @@ +# Spark Examples + +This directory contains example programs for Apache Spark in Scala, Java, Python, and R. + +## Overview + +The examples demonstrate various Spark features and APIs: + +- **Core Examples**: Basic RDD operations and transformations +- **SQL Examples**: DataFrame and SQL operations +- **Streaming Examples**: Stream processing with DStreams and Structured Streaming +- **MLlib Examples**: Machine learning algorithms and pipelines +- **GraphX Examples**: Graph processing algorithms + +## Running Examples + +### Using spark-submit + +The recommended way to run examples: + +```bash +# Run a Scala/Java example +./bin/run-example [params] + +# Example: Run SparkPi +./bin/run-example SparkPi 100 + +# Example: Run with specific master +MASTER=spark://host:7077 ./bin/run-example SparkPi 100 +``` + +### Direct spark-submit + +```bash +# Scala/Java examples +./bin/spark-submit \ + --class org.apache.spark.examples.SparkPi \ + --master local[4] \ + examples/target/scala-2.13/jars/spark-examples*.jar \ + 100 + +# Python examples +./bin/spark-submit examples/src/main/python/pi.py 100 + +# R examples +./bin/spark-submit examples/src/main/r/dataframe.R +``` + +### Interactive Shells + +```bash +# Scala shell with examples on classpath +./bin/spark-shell --jars examples/target/scala-2.13/jars/spark-examples*.jar + +# Python shell +./bin/pyspark +# Then run: exec(open('examples/src/main/python/pi.py').read()) + +# R shell +./bin/sparkR +# Then: source('examples/src/main/r/dataframe.R') +``` + +## Example Categories + +### Core Examples + +**Basic RDD Operations** + +- `SparkPi`: Estimates π using Monte Carlo method +- `SparkLR`: Logistic regression using gradient descent +- `SparkKMeans`: K-means clustering +- `SparkPageRank`: PageRank algorithm implementation +- `GroupByTest`: Tests groupBy performance + +**Locations:** +- Scala: `src/main/scala/org/apache/spark/examples/` +- Java: `src/main/java/org/apache/spark/examples/` +- Python: `src/main/python/` +- R: `src/main/r/` + +### SQL Examples + +**DataFrame and SQL Operations** + +- `SparkSQLExample`: Basic DataFrame operations +- `SQLDataSourceExample`: Working with various data sources +- `RDDRelation`: Converting between RDDs and DataFrames +- `UserDefinedFunction`: Creating and using UDFs +- `CsvDataSource`: Reading and writing CSV files + +**Running:** +```bash +# Scala +./bin/run-example sql.SparkSQLExample + +# Python +./bin/spark-submit examples/src/main/python/sql/basic.py + +# R +./bin/spark-submit examples/src/main/r/RSparkSQLExample.R +``` + +### Streaming Examples + +**DStream Examples (Legacy)** + +- `NetworkWordCount`: Count words from network stream +- `StatefulNetworkWordCount`: Stateful word count +- `RecoverableNetworkWordCount`: Checkpoint and recovery +- `KafkaWordCount`: Read from Apache Kafka +- `QueueStream`: Create DStream from queue + +**Structured Streaming Examples** + +- `StructuredNetworkWordCount`: Word count using Structured Streaming +- `StructuredKafkaWordCount`: Kafka integration +- `StructuredSessionization`: Session window operations + +**Running:** +```bash +# DStream example +./bin/run-example streaming.NetworkWordCount localhost 9999 + +# Structured Streaming +./bin/run-example sql.streaming.StructuredNetworkWordCount localhost 9999 + +# Python Structured Streaming +./bin/spark-submit examples/src/main/python/sql/streaming/structured_network_wordcount.py localhost 9999 +``` + +### MLlib Examples + +**Classification** +- `LogisticRegressionExample`: Binary and multiclass classification +- `DecisionTreeClassificationExample`: Decision tree classifier +- `RandomForestClassificationExample`: Random forest classifier +- `GradientBoostedTreeClassifierExample`: GBT classifier +- `NaiveBayesExample`: Naive Bayes classifier + +**Regression** +- `LinearRegressionExample`: Linear regression +- `DecisionTreeRegressionExample`: Decision tree regressor +- `RandomForestRegressionExample`: Random forest regressor +- `AFTSurvivalRegressionExample`: Survival regression + +**Clustering** +- `KMeansExample`: K-means clustering +- `BisectingKMeansExample`: Bisecting K-means +- `GaussianMixtureExample`: Gaussian mixture model +- `LDAExample`: Latent Dirichlet Allocation + +**Pipelines** +- `PipelineExample`: ML Pipeline with multiple stages +- `CrossValidatorExample`: Model selection with cross-validation +- `TrainValidationSplitExample`: Model selection with train/validation split + +**Running:** +```bash +# Scala +./bin/run-example ml.LogisticRegressionExample + +# Java +./bin/run-example ml.JavaLogisticRegressionExample + +# Python +./bin/spark-submit examples/src/main/python/ml/logistic_regression.py +``` + +### GraphX Examples + +**Graph Algorithms** + +- `PageRankExample`: PageRank algorithm +- `ConnectedComponentsExample`: Finding connected components +- `TriangleCountExample`: Counting triangles +- `SocialNetworkExample`: Social network analysis + +**Running:** +```bash +./bin/run-example graphx.PageRankExample +``` + +## Example Datasets + +Many examples use sample data from the `data/` directory: + +- `data/mllib/`: MLlib sample datasets + - `sample_libsvm_data.txt`: LibSVM format data + - `sample_binary_classification_data.txt`: Binary classification + - `sample_multiclass_classification_data.txt`: Multiclass classification + +- `data/graphx/`: GraphX sample data + - `followers.txt`: Social network follower data + - `users.txt`: User information + +## Building Examples + +### Build All Examples + +```bash +# Build examples module +./build/mvn -pl examples -am package + +# Skip tests +./build/mvn -pl examples -am -DskipTests package +``` + +### Build Specific Language Examples + +The examples are compiled together, but you can run them separately by language. + +## Creating Your Own Examples + +### Scala Example Template + +```scala +package org.apache.spark.examples + +import org.apache.spark.sql.SparkSession + +object MyExample { + def main(args: Array[String]): Unit = { + val spark = SparkSession + .builder() + .appName("My Example") + .getOrCreate() + + try { + // Your Spark code here + import spark.implicits._ + val df = spark.range(100).toDF("number") + df.show() + } finally { + spark.stop() + } + } +} +``` + +### Python Example Template + +```python +from pyspark.sql import SparkSession + +def main(): + spark = SparkSession \ + .builder \ + .appName("My Example") \ + .getOrCreate() + + try: + # Your Spark code here + df = spark.range(100) + df.show() + finally: + spark.stop() + +if __name__ == "__main__": + main() +``` + +### Java Example Template + +```java +package org.apache.spark.examples; + +import org.apache.spark.sql.SparkSession; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; + +public class MyExample { + public static void main(String[] args) { + SparkSession spark = SparkSession + .builder() + .appName("My Example") + .getOrCreate(); + + try { + // Your Spark code here + Dataset df = spark.range(100); + df.show(); + } finally { + spark.stop(); + } + } +} +``` + +### R Example Template + +```r +library(SparkR) + +sparkR.session(appName = "My Example") + +# Your Spark code here +df <- createDataFrame(data.frame(number = 1:100)) +head(df) + +sparkR.session.stop() +``` + +## Example Directory Structure + +``` +examples/src/main/ +├── java/org/apache/spark/examples/ # Java examples +│ ├── JavaSparkPi.java +│ ├── JavaWordCount.java +│ ├── ml/ # ML examples +│ ├── sql/ # SQL examples +│ └── streaming/ # Streaming examples +├── python/ # Python examples +│ ├── pi.py +│ ├── wordcount.py +│ ├── ml/ # ML examples +│ ├── sql/ # SQL examples +│ └── streaming/ # Streaming examples +├── r/ # R examples +│ ├── RSparkSQLExample.R +│ ├── ml.R +│ └── dataframe.R +└── scala/org/apache/spark/examples/ # Scala examples + ├── SparkPi.scala + ├── SparkLR.scala + ├── ml/ # ML examples + ├── sql/ # SQL examples + ├── streaming/ # Streaming examples + └── graphx/ # GraphX examples +``` + +## Common Patterns + +### Reading Data + +```scala +// Text file +val textData = spark.read.textFile("path/to/file.txt") + +// CSV +val csvData = spark.read.option("header", "true").csv("path/to/file.csv") + +// JSON +val jsonData = spark.read.json("path/to/file.json") + +// Parquet +val parquetData = spark.read.parquet("path/to/file.parquet") +``` + +### Writing Data + +```scala +// Save as text +df.write.text("output/path") + +// Save as CSV +df.write.option("header", "true").csv("output/path") + +// Save as Parquet +df.write.parquet("output/path") + +// Save as JSON +df.write.json("output/path") +``` + +### Working with Partitions + +```scala +// Repartition for more parallelism +val repartitioned = df.repartition(10) + +// Coalesce to reduce partitions +val coalesced = df.coalesce(2) + +// Partition by column when writing +df.write.partitionBy("year", "month").parquet("output/path") +``` + +## Performance Tips for Examples + +1. **Use Local Mode for Testing**: Start with `local[*]` for development +2. **Adjust Partitions**: Use appropriate partition counts for your data size +3. **Cache When Reusing**: Cache DataFrames/RDDs that are accessed multiple times +4. **Monitor Jobs**: Use Spark UI at http://localhost:4040 to monitor execution + +## Troubleshooting + +### Common Issues + +**OutOfMemoryError** +```bash +# Increase driver memory +./bin/spark-submit --driver-memory 4g examples/... + +# Increase executor memory +./bin/spark-submit --executor-memory 4g examples/... +``` + +**Class Not Found** +```bash +# Make sure examples JAR is built +./build/mvn -pl examples -am package +``` + +**File Not Found** +```bash +# Use absolute paths or ensure working directory is spark root +./bin/run-example SparkPi # Run from spark root directory +``` + +## Additional Resources + +- [Quick Start Guide](../docs/quick-start.md) +- [Programming Guide](../docs/programming-guide.md) +- [SQL Programming Guide](../docs/sql-programming-guide.md) +- [MLlib Guide](../docs/ml-guide.md) +- [Structured Streaming Guide](../docs/structured-streaming-programming-guide.md) +- [GraphX Guide](../docs/graphx-programming-guide.md) + +## Contributing Examples + +When adding new examples: + +1. Follow existing code style and structure +2. Include clear comments explaining the example +3. Add appropriate documentation +4. Test the example with various inputs +5. Add to the appropriate category +6. Update this README + +For more information, see [CONTRIBUTING.md](../CONTRIBUTING.md). diff --git a/graphx/README.md b/graphx/README.md new file mode 100644 index 0000000000000..08c841b6c04d5 --- /dev/null +++ b/graphx/README.md @@ -0,0 +1,549 @@ +# GraphX + +GraphX is Apache Spark's API for graphs and graph-parallel computation. + +## Overview + +GraphX unifies ETL (Extract, Transform, and Load), exploratory analysis, and iterative graph computation within a single system. It provides: + +- **Graph Abstraction**: Efficient representation of property graphs +- **Graph Algorithms**: PageRank, Connected Components, Triangle Counting, and more +- **Pregel API**: For iterative graph computations +- **Graph Builders**: Tools to construct graphs from RDDs or files +- **Graph Operators**: Transformations and structural operations + +## Key Concepts + +### Property Graph + +A directed multigraph with properties attached to each vertex and edge. + +**Components:** +- **Vertices**: Nodes with unique IDs and properties +- **Edges**: Directed connections between vertices with properties +- **Triplets**: A view joining vertices and edges + +```scala +import org.apache.spark.graphx._ + +// Create vertices RDD +val vertices: RDD[(VertexId, String)] = sc.parallelize(Array( + (1L, "Alice"), + (2L, "Bob"), + (3L, "Charlie") +)) + +// Create edges RDD +val edges: RDD[Edge[String]] = sc.parallelize(Array( + Edge(1L, 2L, "friend"), + Edge(2L, 3L, "follow") +)) + +// Build the graph +val graph: Graph[String, String] = Graph(vertices, edges) +``` + +### Graph Structure + +``` +Graph[VD, ED] + - vertices: VertexRDD[VD] // Vertices with properties of type VD + - edges: EdgeRDD[ED] // Edges with properties of type ED + - triplets: RDD[EdgeTriplet[VD, ED]] // Combined view +``` + +## Core Components + +### Graph Class + +The main graph abstraction. + +**Location**: `src/main/scala/org/apache/spark/graphx/Graph.scala` + +**Key methods:** +- `vertices: VertexRDD[VD]`: Access vertices +- `edges: EdgeRDD[ED]`: Access edges +- `triplets: RDD[EdgeTriplet[VD, ED]]`: Access triplets +- `mapVertices[VD2](map: (VertexId, VD) => VD2)`: Transform vertex properties +- `mapEdges[ED2](map: Edge[ED] => ED2)`: Transform edge properties +- `subgraph(epred, vpred)`: Create subgraph based on predicates + +### VertexRDD + +Optimized RDD for vertex data. + +**Location**: `src/main/scala/org/apache/spark/graphx/VertexRDD.scala` + +**Features:** +- Fast lookups by vertex ID +- Efficient joins with edge data +- Reuse of vertex indices + +### EdgeRDD + +Optimized RDD for edge data. + +**Location**: `src/main/scala/org/apache/spark/graphx/EdgeRDD.scala` + +**Features:** +- Compact edge storage +- Fast filtering and mapping +- Efficient partitioning + +### EdgeTriplet + +Represents a edge with its source and destination vertex properties. + +**Structure:** +```scala +class EdgeTriplet[VD, ED] extends Edge[ED] { + var srcAttr: VD // Source vertex property + var dstAttr: VD // Destination vertex property + var attr: ED // Edge property +} +``` + +## Graph Operators + +### Property Operators + +```scala +// Map vertex properties +val newGraph = graph.mapVertices((id, attr) => attr.toUpperCase) + +// Map edge properties +val newGraph = graph.mapEdges(e => e.attr + "relationship") + +// Map triplets (access to src and dst properties) +val newGraph = graph.mapTriplets(triplet => + (triplet.srcAttr, triplet.attr, triplet.dstAttr) +) +``` + +### Structural Operators + +```scala +// Reverse edge directions +val reversedGraph = graph.reverse + +// Create subgraph +val subgraph = graph.subgraph( + epred = e => e.srcId != e.dstId, // No self-loops + vpred = (id, attr) => attr.length > 0 // Non-empty names +) + +// Mask graph (keep only edges/vertices in another graph) +val maskedGraph = graph.mask(subgraph) + +// Group edges +val groupedGraph = graph.groupEdges((e1, e2) => e1 + e2) +``` + +### Join Operators + +```scala +// Join vertices with external data +val newData: RDD[(VertexId, NewType)] = ... +val newGraph = graph.joinVertices(newData) { + (id, oldAttr, newAttr) => (oldAttr, newAttr) +} + +// Outer join vertices +val newGraph = graph.outerJoinVertices(newData) { + (id, oldAttr, newAttr) => newAttr.getOrElse(oldAttr) +} +``` + +## Graph Algorithms + +GraphX includes several common graph algorithms. + +**Location**: `src/main/scala/org/apache/spark/graphx/lib/` + +### PageRank + +Measures the importance of each vertex based on link structure. + +```scala +import org.apache.spark.graphx.lib.PageRank + +// Static PageRank (fixed iterations) +val ranks = graph.staticPageRank(numIter = 10) + +// Dynamic PageRank (convergence-based) +val ranks = graph.pageRank(tol = 0.001) + +// Get top ranked vertices +val topRanked = ranks.vertices.top(10)(Ordering.by(_._2)) +``` + +**File**: `src/main/scala/org/apache/spark/graphx/lib/PageRank.scala` + +### Connected Components + +Finds connected components in the graph. + +```scala +import org.apache.spark.graphx.lib.ConnectedComponents + +// Find connected components +val cc = graph.connectedComponents() + +// Count vertices in each component +val componentCounts = cc.vertices + .map { case (id, component) => (component, 1) } + .reduceByKey(_ + _) +``` + +**File**: `src/main/scala/org/apache/spark/graphx/lib/ConnectedComponents.scala` + +### Triangle Counting + +Counts triangles (3-cliques) in the graph. + +```scala +import org.apache.spark.graphx.lib.TriangleCount + +// Count triangles +val triCounts = graph.triangleCount() + +// Get vertices with most triangles +val topTriangles = triCounts.vertices.top(10)(Ordering.by(_._2)) +``` + +**File**: `src/main/scala/org/apache/spark/graphx/lib/TriangleCount.scala` + +### Label Propagation + +Community detection algorithm. + +```scala +import org.apache.spark.graphx.lib.LabelPropagation + +// Run label propagation +val communities = graph.labelPropagation(maxSteps = 5) + +// Group vertices by community +val communityGroups = communities.vertices + .map { case (id, label) => (label, Set(id)) } + .reduceByKey(_ ++ _) +``` + +**File**: `src/main/scala/org/apache/spark/graphx/lib/LabelPropagation.scala` + +### Strongly Connected Components + +Finds strongly connected components in a directed graph. + +```scala +import org.apache.spark.graphx.lib.StronglyConnectedComponents + +// Find strongly connected components +val scc = graph.stronglyConnectedComponents(numIter = 10) +``` + +**File**: `src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala` + +### Shortest Paths + +Computes shortest paths from source vertices to all reachable vertices. + +```scala +import org.apache.spark.graphx.lib.ShortestPaths + +// Compute shortest paths from vertices 1 and 2 +val landmarks = Seq(1L, 2L) +val results = graph.shortestPaths(landmarks) + +// Results contain distance to each landmark +results.vertices.foreach { case (id, distances) => + println(s"Vertex $id: $distances") +} +``` + +**File**: `src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala` + +## Pregel API + +Bulk-synchronous parallel messaging abstraction for iterative graph algorithms. + +```scala +def pregel[A: ClassTag]( + initialMsg: A, + maxIterations: Int = Int.MaxValue, + activeDirection: EdgeDirection = EdgeDirection.Either +)( + vprog: (VertexId, VD, A) => VD, + sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)], + mergeMsg: (A, A) => A +): Graph[VD, ED] +``` + +**Example: Single-Source Shortest Path** + +```scala +val sourceId: VertexId = 1L + +// Initialize distances +val initialGraph = graph.mapVertices((id, _) => + if (id == sourceId) 0.0 else Double.PositiveInfinity +) + +// Run Pregel +val sssp = initialGraph.pregel(Double.PositiveInfinity)( + // Vertex program: update vertex value with minimum distance + (id, dist, newDist) => math.min(dist, newDist), + + // Send message: send distance + edge weight to neighbors + triplet => { + if (triplet.srcAttr + triplet.attr < triplet.dstAttr) { + Iterator((triplet.dstId, triplet.srcAttr + triplet.attr)) + } else { + Iterator.empty + } + }, + + // Merge messages: take minimum distance + (a, b) => math.min(a, b) +) +``` + +**File**: `src/main/scala/org/apache/spark/graphx/Pregel.scala` + +## Graph Builders + +### From Edge List + +```scala +// Load edge list from file +val graph = GraphLoader.edgeListFile(sc, "path/to/edges.txt") + +// Edge file format: source destination +// Example: +// 1 2 +// 2 3 +// 3 1 +``` + +### From RDDs + +```scala +val vertices: RDD[(VertexId, VD)] = ... +val edges: RDD[Edge[ED]] = ... + +val graph = Graph(vertices, edges) + +// With default vertex property +val graph = Graph.fromEdges(edges, defaultValue = "Unknown") + +// From edge tuples +val edgeTuples: RDD[(VertexId, VertexId)] = ... +val graph = Graph.fromEdgeTuples(edgeTuples, defaultValue = 1) +``` + +## Partitioning Strategies + +Efficient graph partitioning is crucial for performance. + +**Available strategies:** +- `EdgePartition1D`: Partition edges by source vertex +- `EdgePartition2D`: 2D matrix partitioning +- `RandomVertexCut`: Random edge partitioning (default) +- `CanonicalRandomVertexCut`: Similar to RandomVertexCut but canonical + +```scala +import org.apache.spark.graphx.PartitionStrategy + +val graph = Graph(vertices, edges) + .partitionBy(PartitionStrategy.EdgePartition2D) +``` + +**Location**: `src/main/scala/org/apache/spark/graphx/PartitionStrategy.scala` + +## Performance Optimization + +### Caching + +```scala +// Cache graph in memory +graph.cache() + +// Or persist with storage level +graph.persist(StorageLevel.MEMORY_AND_DISK) + +// Unpersist when done +graph.unpersist() +``` + +### Partitioning + +```scala +// Repartition for better balance +val partitionedGraph = graph + .partitionBy(PartitionStrategy.EdgePartition2D, numPartitions = 100) + .cache() +``` + +### Checkpointing + +For iterative algorithms, checkpoint periodically: + +```scala +sc.setCheckpointDir("hdfs://checkpoint") + +var graph = initialGraph +for (i <- 1 to maxIterations) { + // Perform iteration + graph = performIteration(graph) + + // Checkpoint every 10 iterations + if (i % 10 == 0) { + graph.checkpoint() + } +} +``` + +## Building and Testing + +### Build GraphX Module + +```bash +# Build graphx module +./build/mvn -pl graphx -am package + +# Skip tests +./build/mvn -pl graphx -am -DskipTests package +``` + +### Run Tests + +```bash +# Run all graphx tests +./build/mvn test -pl graphx + +# Run specific test suite +./build/mvn test -pl graphx -Dtest=GraphSuite +``` + +## Source Code Organization + +``` +graphx/src/main/ +├── scala/org/apache/spark/graphx/ +│ ├── Graph.scala # Main graph class +│ ├── GraphOps.scala # Graph operations +│ ├── VertexRDD.scala # Vertex RDD +│ ├── EdgeRDD.scala # Edge RDD +│ ├── Edge.scala # Edge class +│ ├── EdgeTriplet.scala # Edge triplet +│ ├── Pregel.scala # Pregel API +│ ├── GraphLoader.scala # Graph loading utilities +│ ├── PartitionStrategy.scala # Partitioning strategies +│ ├── impl/ # Implementation details +│ │ ├── GraphImpl.scala # Graph implementation +│ │ ├── VertexRDDImpl.scala # VertexRDD implementation +│ │ ├── EdgeRDDImpl.scala # EdgeRDD implementation +│ │ └── ReplicatedVertexView.scala # Vertex replication +│ ├── lib/ # Graph algorithms +│ │ ├── PageRank.scala +│ │ ├── ConnectedComponents.scala +│ │ ├── TriangleCount.scala +│ │ ├── LabelPropagation.scala +│ │ ├── StronglyConnectedComponents.scala +│ │ └── ShortestPaths.scala +│ └── util/ # Utilities +│ ├── BytecodeUtils.scala +│ └── GraphGenerators.scala # Test graph generation +└── resources/ +``` + +## Examples + +See [examples/src/main/scala/org/apache/spark/examples/graphx/](../examples/src/main/scala/org/apache/spark/examples/graphx/) for complete examples. + +**Key examples:** +- `PageRankExample.scala`: PageRank on social network +- `ConnectedComponentsExample.scala`: Finding connected components +- `SocialNetworkExample.scala`: Complete social network analysis + +## Common Use Cases + +### Social Network Analysis + +```scala +// Load social network +val users: RDD[(VertexId, String)] = sc.textFile("users.txt") + .map(line => (line.split(",")(0).toLong, line.split(",")(1))) + +val relationships: RDD[Edge[String]] = sc.textFile("relationships.txt") + .map { line => + val fields = line.split(",") + Edge(fields(0).toLong, fields(1).toLong, fields(2)) + } + +val graph = Graph(users, relationships) + +// Find influential users (PageRank) +val ranks = graph.pageRank(0.001).vertices + +// Find communities +val communities = graph.labelPropagation(5) + +// Count mutual friends (triangles) +val triangles = graph.triangleCount() +``` + +### Web Graph Analysis + +```scala +// Load web graph +val graph = GraphLoader.edgeListFile(sc, "web-graph.txt") + +// Compute PageRank +val ranks = graph.pageRank(0.001) + +// Find authoritative pages +val topPages = ranks.vertices.top(100)(Ordering.by(_._2)) +``` + +### Road Network Analysis + +```scala +// Vertices are intersections, edges are roads +val roadNetwork: Graph[String, Double] = ... + +// Find shortest paths from landmarks +val landmarks = Seq(1L, 2L, 3L) +val distances = roadNetwork.shortestPaths(landmarks) + +// Find highly connected intersections +val degrees = roadNetwork.degrees +val busyIntersections = degrees.top(10)(Ordering.by(_._2)) +``` + +## Best Practices + +1. **Partition carefully**: Use appropriate partitioning strategy for your workload +2. **Cache graphs**: Cache graphs that are accessed multiple times +3. **Avoid unnecessary materialization**: GraphX uses lazy evaluation +4. **Use GraphLoader**: For simple edge lists, use GraphLoader +5. **Monitor memory**: Graph algorithms can be memory-intensive +6. **Checkpoint long lineages**: Checkpoint periodically in iterative algorithms +7. **Consider edge direction**: Many operations respect edge direction + +## Limitations and Considerations + +- **No mutable graphs**: Graphs are immutable; modifications create new graphs +- **Memory overhead**: Vertex replication can increase memory usage +- **Edge direction**: Operations may behave differently on directed vs undirected graphs +- **Single-machine graphs**: For small graphs (< 1M edges), NetworkX or igraph may be faster + +## Further Reading + +- [GraphX Programming Guide](../docs/graphx-programming-guide.md) +- [GraphX Paper](http://www.vldb.org/pvldb/vol7/p1673-xin.pdf) +- [Pregel: A System for Large-Scale Graph Processing](https://kowshik.github.io/JPregel/pregel_paper.pdf) + +## Contributing + +For contributing to GraphX, see [CONTRIBUTING.md](../CONTRIBUTING.md). diff --git a/mllib/README.md b/mllib/README.md new file mode 100644 index 0000000000000..dd62159f84fef --- /dev/null +++ b/mllib/README.md @@ -0,0 +1,514 @@ +# MLlib - Machine Learning Library + +MLlib is Apache Spark's scalable machine learning library. + +## Overview + +MLlib provides: + +- **ML Algorithms**: Classification, regression, clustering, collaborative filtering +- **Featurization**: Feature extraction, transformation, dimensionality reduction, selection +- **Pipelines**: Tools for constructing, evaluating, and tuning ML workflows +- **Utilities**: Linear algebra, statistics, data handling + +## Important Note + +MLlib includes two packages: + +1. **`spark.ml`** (DataFrame-based API) - **Primary API** (Recommended) +2. **`spark.mllib`** (RDD-based API) - **Maintenance mode only** + +The RDD-based API (`spark.mllib`) is in maintenance mode. The DataFrame-based API (`spark.ml`) is the primary API and is recommended for all new applications. + +## Package Structure + +### spark.ml (Primary API) + +**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/` + +DataFrame-based API with: +- **ML Pipeline API**: For building ML workflows +- **Transformers**: Feature transformers +- **Estimators**: Learning algorithms +- **Models**: Fitted models + +```scala +import org.apache.spark.ml.classification.LogisticRegression +import org.apache.spark.ml.feature.VectorAssembler + +// Create pipeline +val assembler = new VectorAssembler() + .setInputCols(Array("feature1", "feature2")) + .setOutputCol("features") + +val lr = new LogisticRegression() + .setMaxIter(10) + +val pipeline = new Pipeline().setStages(Array(assembler, lr)) + +// Fit model +val model = pipeline.fit(trainingData) + +// Make predictions +val predictions = model.transform(testData) +``` + +### spark.mllib (RDD-based API - Maintenance Mode) + +**Location**: `src/main/scala/org/apache/spark/mllib/` + +RDD-based API with: +- Classic algorithms using RDDs +- Maintained for backward compatibility +- No new features added + +```scala +import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS +import org.apache.spark.mllib.regression.LabeledPoint + +// Train model (old API) +val data: RDD[LabeledPoint] = ... +val model = LogisticRegressionWithLBFGS.train(data) + +// Make predictions +val predictions = data.map { point => model.predict(point.features) } +``` + +## Key Concepts + +### Pipeline API (spark.ml) + +Machine learning pipelines provide: + +1. **DataFrame**: Unified data representation +2. **Transformer**: Algorithms that transform DataFrames +3. **Estimator**: Algorithms that fit on DataFrames to produce Transformers +4. **Pipeline**: Chains multiple Transformers and Estimators +5. **Parameter**: Common API for specifying parameters + +**Example Pipeline:** +```scala +import org.apache.spark.ml.{Pipeline, PipelineModel} +import org.apache.spark.ml.classification.LogisticRegression +import org.apache.spark.ml.feature.{HashingTF, Tokenizer} + +// Configure pipeline stages +val tokenizer = new Tokenizer() + .setInputCol("text") + .setOutputCol("words") + +val hashingTF = new HashingTF() + .setInputCol("words") + .setOutputCol("features") + +val lr = new LogisticRegression() + .setMaxIter(10) + +val pipeline = new Pipeline() + .setStages(Array(tokenizer, hashingTF, lr)) + +// Fit the pipeline +val model = pipeline.fit(trainingData) + +// Make predictions +model.transform(testData) +``` + +### Transformers + +Algorithms that transform one DataFrame into another. + +**Examples:** +- `Tokenizer`: Splits text into words +- `HashingTF`: Maps word sequences to feature vectors +- `StandardScaler`: Normalizes features +- `VectorAssembler`: Combines multiple columns into a vector +- `PCA`: Dimensionality reduction + +### Estimators + +Algorithms that fit on a DataFrame to produce a Transformer. + +**Examples:** +- `LogisticRegression`: Produces LogisticRegressionModel +- `DecisionTreeClassifier`: Produces DecisionTreeClassificationModel +- `KMeans`: Produces KMeansModel +- `StringIndexer`: Produces StringIndexerModel + +## ML Algorithms + +### Classification + +**Binary and Multiclass:** +- Logistic Regression +- Decision Tree Classifier +- Random Forest Classifier +- Gradient-Boosted Tree Classifier +- Naive Bayes +- Linear Support Vector Machine + +**Multilabel:** +- OneVsRest + +**Example:** +```scala +import org.apache.spark.ml.classification.LogisticRegression + +val lr = new LogisticRegression() + .setMaxIter(10) + .setRegParam(0.3) + .setElasticNetParam(0.8) + +val model = lr.fit(trainingData) +val predictions = model.transform(testData) +``` + +**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/classification/` + +### Regression + +- Linear Regression +- Generalized Linear Regression +- Decision Tree Regression +- Random Forest Regression +- Gradient-Boosted Tree Regression +- Survival Regression (AFT) +- Isotonic Regression + +**Example:** +```scala +import org.apache.spark.ml.regression.LinearRegression + +val lr = new LinearRegression() + .setMaxIter(10) + .setRegParam(0.3) + .setElasticNetParam(0.8) + +val model = lr.fit(trainingData) +``` + +**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/regression/` + +### Clustering + +- K-means +- Latent Dirichlet Allocation (LDA) +- Bisecting K-means +- Gaussian Mixture Model (GMM) + +**Example:** +```scala +import org.apache.spark.ml.clustering.KMeans + +val kmeans = new KMeans() + .setK(3) + .setSeed(1L) + +val model = kmeans.fit(dataset) +val predictions = model.transform(dataset) +``` + +**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/clustering/` + +### Collaborative Filtering + +Alternating Least Squares (ALS) for recommendation systems. + +**Example:** +```scala +import org.apache.spark.ml.recommendation.ALS + +val als = new ALS() + .setMaxIter(10) + .setRegParam(0.01) + .setUserCol("userId") + .setItemCol("movieId") + .setRatingCol("rating") + +val model = als.fit(ratings) +val predictions = model.transform(testData) +``` + +**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/recommendation/` + +## Feature Engineering + +### Feature Extractors + +- `TF-IDF`: Text feature extraction +- `Word2Vec`: Word embeddings +- `CountVectorizer`: Converts text to vectors of token counts + +### Feature Transformers + +- `Tokenizer`: Text tokenization +- `StopWordsRemover`: Removes stop words +- `StringIndexer`: Encodes string labels to indices +- `IndexToString`: Converts indices back to strings +- `OneHotEncoder`: One-hot encoding +- `VectorAssembler`: Combines columns into feature vector +- `StandardScaler`: Standardizes features +- `MinMaxScaler`: Scales features to a range +- `Normalizer`: Normalizes vectors to unit norm +- `Binarizer`: Binarizes based on threshold + +### Feature Selectors + +- `VectorSlicer`: Extracts subset of features +- `RFormula`: R model formula for feature specification +- `ChiSqSelector`: Chi-square feature selection + +**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/feature/` + +## Model Selection and Tuning + +### Cross-Validation + +```scala +import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder} +import org.apache.spark.ml.evaluation.RegressionEvaluator + +val paramGrid = new ParamGridBuilder() + .addGrid(lr.regParam, Array(0.1, 0.01)) + .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)) + .build() + +val cv = new CrossValidator() + .setEstimator(lr) + .setEvaluator(new RegressionEvaluator()) + .setEstimatorParamMaps(paramGrid) + .setNumFolds(3) + +val cvModel = cv.fit(trainingData) +``` + +### Train-Validation Split + +```scala +import org.apache.spark.ml.tuning.TrainValidationSplit + +val trainValidationSplit = new TrainValidationSplit() + .setEstimator(lr) + .setEvaluator(new RegressionEvaluator()) + .setEstimatorParamMaps(paramGrid) + .setTrainRatio(0.8) + +val model = trainValidationSplit.fit(trainingData) +``` + +**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/tuning/` + +## Evaluation Metrics + +### Classification + +```scala +import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator + +val evaluator = new MulticlassClassificationEvaluator() + .setLabelCol("label") + .setPredictionCol("prediction") + .setMetricName("accuracy") + +val accuracy = evaluator.evaluate(predictions) +``` + +### Regression + +```scala +import org.apache.spark.ml.evaluation.RegressionEvaluator + +val evaluator = new RegressionEvaluator() + .setLabelCol("label") + .setPredictionCol("prediction") + .setMetricName("rmse") + +val rmse = evaluator.evaluate(predictions) +``` + +**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/evaluation/` + +## Linear Algebra + +MLlib provides distributed linear algebra through Breeze. + +**Location**: `src/main/scala/org/apache/spark/mllib/linalg/` + +**Local vectors and matrices:** +```scala +import org.apache.spark.ml.linalg.{Vector, Vectors, Matrix, Matrices} + +// Dense vector +val dv: Vector = Vectors.dense(1.0, 0.0, 3.0) + +// Sparse vector +val sv: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)) + +// Dense matrix +val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0)) +``` + +**Distributed matrices:** +- `RowMatrix`: Distributed row-oriented matrix +- `IndexedRowMatrix`: Indexed rows +- `CoordinateMatrix`: Coordinate list format +- `BlockMatrix`: Block-partitioned matrix + +## Statistics + +Basic statistics and hypothesis testing. + +**Location**: `src/main/scala/org/apache/spark/mllib/stat/` + +**Examples:** +- Summary statistics +- Correlations +- Stratified sampling +- Hypothesis testing +- Random data generation + +## Building and Testing + +### Build MLlib Module + +```bash +# Build mllib module (RDD-based) +./build/mvn -pl mllib -am package + +# The DataFrame-based ml package is in sql/core +./build/mvn -pl sql/core -am package +``` + +### Run Tests + +```bash +# Run mllib tests +./build/mvn test -pl mllib + +# Run specific test +./build/mvn test -pl mllib -Dtest=LinearRegressionSuite +``` + +## Source Code Organization + +``` +mllib/src/main/ +├── scala/org/apache/spark/mllib/ +│ ├── classification/ # Classification algorithms (RDD-based) +│ ├── clustering/ # Clustering algorithms (RDD-based) +│ ├── evaluation/ # Evaluation metrics (RDD-based) +│ ├── feature/ # Feature engineering (RDD-based) +│ ├── fpm/ # Frequent pattern mining +│ ├── linalg/ # Linear algebra +│ ├── optimization/ # Optimization algorithms +│ ├── recommendation/ # Collaborative filtering (RDD-based) +│ ├── regression/ # Regression algorithms (RDD-based) +│ ├── stat/ # Statistics +│ ├── tree/ # Decision trees (RDD-based) +│ └── util/ # Utilities +└── resources/ +``` + +## Performance Considerations + +### Caching + +Cache datasets that are used multiple times: +```scala +val trainingData = data.cache() +``` + +### Parallelism + +Adjust parallelism for better performance: +```scala +import org.apache.spark.ml.classification.LogisticRegression + +val lr = new LogisticRegression() + .setMaxIter(10) + .setParallelism(4) // Parallel model fitting +``` + +### Data Format + +Use Parquet format for efficient storage and reading: +```scala +df.write.parquet("training_data.parquet") +val data = spark.read.parquet("training_data.parquet") +``` + +### Feature Scaling + +Normalize features for better convergence: +```scala +import org.apache.spark.ml.feature.StandardScaler + +val scaler = new StandardScaler() + .setInputCol("features") + .setOutputCol("scaledFeatures") + .setWithStd(true) + .setWithMean(false) +``` + +## Best Practices + +1. **Use spark.ml**: Prefer DataFrame-based API over RDD-based API +2. **Build pipelines**: Use Pipeline API for reproducible workflows +3. **Cache data**: Cache datasets used in iterative algorithms +4. **Scale features**: Normalize features for better performance +5. **Cross-validate**: Use cross-validation for model selection +6. **Monitor convergence**: Check convergence for iterative algorithms +7. **Save models**: Persist trained models for reuse +8. **Use appropriate algorithms**: Choose algorithms based on data characteristics + +## Model Persistence + +Save and load models: + +```scala +// Save model +model.write.overwrite().save("path/to/model") + +// Load model +val loadedModel = PipelineModel.load("path/to/model") +``` + +## Migration Guide + +### From RDD-based API to DataFrame-based API + +**Old (RDD-based):** +```scala +import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS +import org.apache.spark.mllib.regression.LabeledPoint + +val data: RDD[LabeledPoint] = ... +val model = LogisticRegressionWithLBFGS.train(data) +``` + +**New (DataFrame-based):** +```scala +import org.apache.spark.ml.classification.LogisticRegression + +val data: DataFrame = ... +val lr = new LogisticRegression() +val model = lr.fit(data) +``` + +## Examples + +See [examples/src/main/scala/org/apache/spark/examples/ml/](../examples/src/main/scala/org/apache/spark/examples/ml/) for complete examples. + +## Further Reading + +- [ML Programming Guide](../docs/ml-guide.md) (DataFrame-based API) +- [MLlib Programming Guide](../docs/mllib-guide.md) (RDD-based API - legacy) +- [ML Pipelines](../docs/ml-pipeline.md) +- [ML Tuning](../docs/ml-tuning.md) +- [Feature Extraction](../docs/ml-features.md) + +## Contributing + +For contributing to MLlib, see [CONTRIBUTING.md](../CONTRIBUTING.md). + +New features should use the DataFrame-based API (`spark.ml`). diff --git a/streaming/README.md b/streaming/README.md new file mode 100644 index 0000000000000..4e16b8f12b11e --- /dev/null +++ b/streaming/README.md @@ -0,0 +1,430 @@ +# Spark Streaming + +Spark Streaming provides scalable, high-throughput, fault-tolerant stream processing of live data streams. + +## Overview + +Spark Streaming supports two APIs: + +1. **DStreams (Discretized Streams)** - Legacy API (Deprecated as of Spark 3.4) +2. **Structured Streaming** - Modern API built on Spark SQL (Recommended) + +**Note**: DStreams are deprecated. For new applications, use **Structured Streaming** which is located in the `sql/core` module. + +## DStreams (Legacy API) + +### What are DStreams? + +DStreams represent a continuous stream of data, internally represented as a sequence of RDDs. + +**Key characteristics:** +- Micro-batch processing model +- Integration with Kafka, Flume, Kinesis, TCP sockets, and more +- Windowing operations for time-based aggregations +- Stateful transformations with updateStateByKey +- Fault tolerance through checkpointing + +### Location + +- Scala/Java: `src/main/scala/org/apache/spark/streaming/` +- Python: `../python/pyspark/streaming/` + +### Basic Example + +```scala +import org.apache.spark.streaming._ +import org.apache.spark.SparkConf + +val conf = new SparkConf().setAppName("NetworkWordCount") +val ssc = new StreamingContext(conf, Seconds(1)) + +// Create DStream from TCP source +val lines = ssc.socketTextStream("localhost", 9999) + +// Process the stream +val words = lines.flatMap(_.split(" ")) +val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) + +// Print results +wordCounts.print() + +// Start the computation +ssc.start() +ssc.awaitTermination() +``` + +### Key Components + +#### StreamingContext + +The main entry point for streaming functionality. + +**File**: `src/main/scala/org/apache/spark/streaming/StreamingContext.scala` + +**Usage:** +```scala +val ssc = new StreamingContext(sparkContext, Seconds(batchInterval)) +// or +val ssc = new StreamingContext(conf, Seconds(batchInterval)) +``` + +#### DStream + +The fundamental abstraction for a continuous data stream. + +**File**: `src/main/scala/org/apache/spark/streaming/dstream/DStream.scala` + +**Operations:** +- **Transformations**: map, flatMap, filter, reduce, join, window +- **Output Operations**: print, saveAsTextFiles, foreachRDD + +#### Input Sources + +**Built-in sources:** +- `socketTextStream`: TCP socket source +- `textFileStream`: File system monitoring +- `queueStream`: Queue-based testing source + +**Advanced sources** (require external libraries): +- Kafka: `KafkaUtils.createStream` +- Flume: `FlumeUtils.createStream` +- Kinesis: `KinesisUtils.createStream` + +**Location**: `src/main/scala/org/apache/spark/streaming/dstream/` + +### Windowing Operations + +Process data over sliding windows: + +```scala +val windowedStream = lines + .window(Seconds(30), Seconds(10)) // 30s window, 10s slide + +val windowedWordCounts = words + .map(x => (x, 1)) + .reduceByKeyAndWindow(_ + _, Seconds(30), Seconds(10)) +``` + +### Stateful Operations + +Maintain state across batches: + +```scala +def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = { + val newCount = runningCount.getOrElse(0) + newValues.sum + Some(newCount) +} + +val runningCounts = pairs.updateStateByKey(updateFunction) +``` + +### Checkpointing + +Essential for stateful operations and fault tolerance: + +```scala +ssc.checkpoint("hdfs://checkpoint/directory") +``` + +**What gets checkpointed:** +- Configuration +- DStream operations +- Incomplete batches +- State data (for stateful operations) + +### Performance Tuning + +**Batch Interval** +- Set based on processing time and latency requirements +- Too small: overhead increases +- Too large: latency increases + +**Parallelism** +```scala +// Increase receiver parallelism +val numStreams = 5 +val streams = (1 to numStreams).map(_ => ssc.socketTextStream(...)) +val unifiedStream = ssc.union(streams) + +// Repartition for processing +val repartitioned = dstream.repartition(10) +``` + +**Memory Management** +```scala +conf.set("spark.streaming.receiver.maxRate", "10000") +conf.set("spark.streaming.kafka.maxRatePerPartition", "1000") +``` + +## Structured Streaming (Recommended) + +For new applications, use Structured Streaming instead of DStreams. + +**Location**: `../sql/core/src/main/scala/org/apache/spark/sql/streaming/` + +**Example:** +```scala +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.streaming._ + +val spark = SparkSession.builder() + .appName("StructuredNetworkWordCount") + .getOrCreate() + +import spark.implicits._ + +// Create DataFrame from stream source +val lines = spark + .readStream + .format("socket") + .option("host", "localhost") + .option("port", 9999) + .load() + +// Process the stream +val words = lines.as[String].flatMap(_.split(" ")) +val wordCounts = words.groupBy("value").count() + +// Output the stream +val query = wordCounts + .writeStream + .outputMode("complete") + .format("console") + .start() + +query.awaitTermination() +``` + +**Advantages over DStreams:** +- Unified API with batch processing +- Better performance with Catalyst optimizer +- Exactly-once semantics +- Event time processing +- Watermarking for late data +- Easier to reason about + +See [Structured Streaming Guide](../docs/structured-streaming-programming-guide.md) for details. + +## Building and Testing + +### Build Streaming Module + +```bash +# Build streaming module +./build/mvn -pl streaming -am package + +# Skip tests +./build/mvn -pl streaming -am -DskipTests package +``` + +### Run Tests + +```bash +# Run all streaming tests +./build/mvn test -pl streaming + +# Run specific test suite +./build/mvn test -pl streaming -Dtest=BasicOperationsSuite +``` + +## Source Code Organization + +``` +streaming/src/main/ +├── scala/org/apache/spark/streaming/ +│ ├── StreamingContext.scala # Main entry point +│ ├── Time.scala # Time utilities +│ ├── Checkpoint.scala # Checkpointing +│ ├── dstream/ +│ │ ├── DStream.scala # Base DStream +│ │ ├── InputDStream.scala # Input sources +│ │ ├── ReceiverInputDStream.scala # Receiver-based input +│ │ ├── WindowedDStream.scala # Windowing operations +│ │ ├── StateDStream.scala # Stateful operations +│ │ └── PairDStreamFunctions.scala # Key-value operations +│ ├── receiver/ +│ │ ├── Receiver.scala # Base receiver class +│ │ ├── ReceiverSupervisor.scala # Receiver management +│ │ └── BlockGenerator.scala # Block generation +│ ├── scheduler/ +│ │ ├── JobScheduler.scala # Job scheduling +│ │ ├── JobGenerator.scala # Job generation +│ │ └── ReceiverTracker.scala # Receiver tracking +│ └── ui/ +│ └── StreamingTab.scala # Web UI +└── resources/ +``` + +## Integration with External Systems + +### Apache Kafka + +**Deprecated DStreams approach:** +```scala +import org.apache.spark.streaming.kafka010._ + +val kafkaParams = Map[String, Object]( + "bootstrap.servers" -> "localhost:9092", + "key.deserializer" -> classOf[StringDeserializer], + "value.deserializer" -> classOf[StringDeserializer], + "group.id" -> "test-group" +) + +val stream = KafkaUtils.createDirectStream[String, String]( + ssc, + PreferConsistent, + Subscribe[String, String](topics, kafkaParams) +) +``` + +**Recommended Structured Streaming approach:** +```scala +val df = spark + .readStream + .format("kafka") + .option("kafka.bootstrap.servers", "localhost:9092") + .option("subscribe", "topic1") + .load() +``` + +See [Kafka Integration Guide](../docs/streaming-kafka-integration.md). + +### Amazon Kinesis + +```scala +import org.apache.spark.streaming.kinesis._ + +val stream = KinesisInputDStream.builder + .streamingContext(ssc) + .endpointUrl("https://kinesis.us-east-1.amazonaws.com") + .regionName("us-east-1") + .streamName("myStream") + .build() +``` + +See [Kinesis Integration Guide](../docs/streaming-kinesis-integration.md). + +## Monitoring and Debugging + +### Streaming UI + +Access at: `http://:4040/streaming/` + +**Metrics:** +- Batch processing times +- Input rates +- Scheduling delays +- Active batches + +### Logs + +Enable detailed logging: +```properties +log4j.logger.org.apache.spark.streaming=DEBUG +``` + +### Metrics + +Key metrics to monitor: +- **Batch Processing Time**: Should be < batch interval +- **Scheduling Delay**: Should be minimal +- **Total Delay**: End-to-end delay +- **Input Rate**: Records per second + +## Common Issues + +### Batch Processing Time > Batch Interval + +**Symptoms**: Scheduling delay increases over time + +**Solutions:** +- Increase parallelism +- Optimize transformations +- Increase resources (executors, memory) +- Reduce batch interval data volume + +### Out of Memory Errors + +**Solutions:** +- Increase executor memory +- Enable compression +- Reduce window/batch size +- Persist less data + +### Receiver Failures + +**Solutions:** +- Enable WAL (Write-Ahead Logs) +- Increase receiver memory +- Add multiple receivers +- Use Structured Streaming with better fault tolerance + +## Migration from DStreams to Structured Streaming + +**Why migrate:** +- DStreams are deprecated +- Better performance and semantics +- Unified API with batch processing +- Active development and support + +**Key differences:** +- DataFrame/Dataset API instead of RDDs +- Declarative operations +- Built-in support for event time +- Exactly-once semantics by default + +**Migration guide**: See [Structured Streaming Migration Guide](../docs/ss-migration-guide.md) + +## Examples + +See [examples/src/main/scala/org/apache/spark/examples/streaming/](../examples/src/main/scala/org/apache/spark/examples/streaming/) for more examples. + +**Key examples:** +- `NetworkWordCount.scala`: Basic word count +- `StatefulNetworkWordCount.scala`: Stateful processing +- `WindowedNetworkWordCount.scala`: Window operations +- `KafkaWordCount.scala`: Kafka integration + +## Configuration + +Key configuration parameters: + +```properties +# Batch interval (set in code) +# StreamingContext(conf, Seconds(batchInterval)) + +# Backpressure (rate limiting) +spark.streaming.backpressure.enabled=true + +# Receiver memory +spark.streaming.receiver.maxRate=10000 + +# Checkpoint interval +spark.streaming.checkpoint.interval=10s + +# Graceful shutdown +spark.streaming.stopGracefullyOnShutdown=true +``` + +## Best Practices + +1. **Use Structured Streaming for new applications** +2. **Set appropriate batch intervals** based on latency requirements +3. **Enable checkpointing** for stateful operations +4. **Monitor batch processing times** to ensure they're less than batch interval +5. **Use backpressure** to handle variable input rates +6. **Test failure scenarios** with checkpointing +7. **Consider using Kafka** for reliable message delivery + +## Further Reading + +- [Structured Streaming Programming Guide](../docs/structured-streaming-programming-guide.md) (Recommended) +- [DStreams Programming Guide](../docs/streaming-programming-guide.md) (Legacy) +- [Kafka Integration](../docs/streaming-kafka-integration.md) +- [Kinesis Integration](../docs/streaming-kinesis-integration.md) + +## Contributing + +For contributing to Spark Streaming, see [CONTRIBUTING.md](../CONTRIBUTING.md). + +Note: New features should focus on Structured Streaming rather than DStreams. From 6293b49cde89c561316e97515f21e9f380d051cc Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sun, 19 Oct 2025 17:26:03 +0000 Subject: [PATCH 4/6] Add README files for launcher, resource-managers, and sbin directories Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com> --- launcher/README.md | 475 +++++++++++++++++++++++++++++++++ resource-managers/README.md | 514 ++++++++++++++++++++++++++++++++++++ sbin/README.md | 514 ++++++++++++++++++++++++++++++++++++ 3 files changed, 1503 insertions(+) create mode 100644 launcher/README.md create mode 100644 resource-managers/README.md create mode 100644 sbin/README.md diff --git a/launcher/README.md b/launcher/README.md new file mode 100644 index 0000000000000..7b49bcba4cfae --- /dev/null +++ b/launcher/README.md @@ -0,0 +1,475 @@ +# Spark Launcher + +The Spark Launcher library provides a programmatic interface for launching Spark applications. + +## Overview + +The Launcher module allows you to: +- Launch Spark applications programmatically from Java/Scala code +- Monitor application state and output +- Manage Spark processes +- Build command-line arguments programmatically + +This is an alternative to invoking `spark-submit` via shell commands. + +## Key Components + +### SparkLauncher + +The main class for launching Spark applications. + +**Location**: `src/main/java/org/apache/spark/launcher/SparkLauncher.java` + +**Basic Usage:** +```java +import org.apache.spark.launcher.SparkLauncher; + +SparkLauncher launcher = new SparkLauncher() + .setAppResource("/path/to/app.jar") + .setMainClass("com.example.MyApp") + .setMaster("spark://master:7077") + .setConf(SparkLauncher.DRIVER_MEMORY, "2g") + .setConf(SparkLauncher.EXECUTOR_MEMORY, "4g") + .addAppArgs("arg1", "arg2"); + +Process spark = launcher.launch(); +spark.waitFor(); +``` + +### SparkAppHandle + +Interface for monitoring launched applications. + +**Location**: `src/main/java/org/apache/spark/launcher/SparkAppHandle.java` + +**Usage:** +```java +import org.apache.spark.launcher.SparkAppHandle; + +SparkAppHandle handle = launcher.startApplication(); + +// Add listener for state changes +handle.addListener(new SparkAppHandle.Listener() { + @Override + public void stateChanged(SparkAppHandle handle) { + System.out.println("State: " + handle.getState()); + } + + @Override + public void infoChanged(SparkAppHandle handle) { + System.out.println("App ID: " + handle.getAppId()); + } +}); + +// Wait for completion +while (!handle.getState().isFinal()) { + Thread.sleep(1000); +} +``` + +## API Reference + +### Configuration Methods + +```java +SparkLauncher launcher = new SparkLauncher(); + +// Application settings +launcher.setAppResource("/path/to/app.jar"); +launcher.setMainClass("com.example.MainClass"); +launcher.setAppName("MyApplication"); + +// Cluster settings +launcher.setMaster("spark://master:7077"); +launcher.setDeployMode("cluster"); + +// Resource settings +launcher.setConf(SparkLauncher.DRIVER_MEMORY, "2g"); +launcher.setConf(SparkLauncher.EXECUTOR_MEMORY, "4g"); +launcher.setConf(SparkLauncher.EXECUTOR_CORES, "2"); + +// Additional configurations +launcher.setConf("spark.executor.instances", "5"); +launcher.setConf("spark.sql.shuffle.partitions", "200"); + +// Dependencies +launcher.addJar("/path/to/dependency.jar"); +launcher.addFile("/path/to/file.txt"); +launcher.addPyFile("/path/to/module.py"); + +// Application arguments +launcher.addAppArgs("arg1", "arg2", "arg3"); + +// Environment +launcher.setSparkHome("/path/to/spark"); +launcher.setPropertiesFile("/path/to/spark-defaults.conf"); +launcher.setVerbose(true); +``` + +### Launch Methods + +```java +// Launch and return Process handle +Process process = launcher.launch(); + +// Launch and return SparkAppHandle for monitoring +SparkAppHandle handle = launcher.startApplication(); + +// For child process mode (rare) +SparkAppHandle handle = launcher.startApplication( + new SparkAppHandle.Listener() { + // Listener implementation + } +); +``` + +### Constants + +Common configuration keys are available as constants: + +```java +SparkLauncher.SPARK_MASTER // "spark.master" +SparkLauncher.APP_RESOURCE // "spark.app.resource" +SparkLauncher.APP_NAME // "spark.app.name" +SparkLauncher.DRIVER_MEMORY // "spark.driver.memory" +SparkLauncher.DRIVER_EXTRA_CLASSPATH // "spark.driver.extraClassPath" +SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS // "spark.driver.extraJavaOptions" +SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH // "spark.driver.extraLibraryPath" +SparkLauncher.EXECUTOR_MEMORY // "spark.executor.memory" +SparkLauncher.EXECUTOR_CORES // "spark.executor.cores" +SparkLauncher.EXECUTOR_EXTRA_CLASSPATH // "spark.executor.extraClassPath" +SparkLauncher.EXECUTOR_EXTRA_JAVA_OPTIONS // "spark.executor.extraJavaOptions" +SparkLauncher.EXECUTOR_EXTRA_LIBRARY_PATH // "spark.executor.extraLibraryPath" +``` + +## Application States + +The `SparkAppHandle.State` enum represents application lifecycle states: + +- `UNKNOWN`: Initial state +- `CONNECTED`: Connected to Spark +- `SUBMITTED`: Application submitted +- `RUNNING`: Application running +- `FINISHED`: Completed successfully +- `FAILED`: Failed with error +- `KILLED`: Killed by user +- `LOST`: Connection lost + +**Check if final:** +```java +if (handle.getState().isFinal()) { + // Application has completed +} +``` + +## Examples + +### Launch Scala Application + +```java +import org.apache.spark.launcher.SparkLauncher; + +public class LaunchSparkApp { + public static void main(String[] args) throws Exception { + Process spark = new SparkLauncher() + .setAppResource("/path/to/app.jar") + .setMainClass("com.example.SparkApp") + .setMaster("local[2]") + .setConf(SparkLauncher.DRIVER_MEMORY, "2g") + .launch(); + + spark.waitFor(); + System.exit(spark.exitValue()); + } +} +``` + +### Launch Python Application + +```java +SparkLauncher launcher = new SparkLauncher() + .setAppResource("/path/to/app.py") + .setMaster("yarn") + .setDeployMode("cluster") + .setConf(SparkLauncher.EXECUTOR_MEMORY, "4g") + .addPyFile("/path/to/dependency.py") + .addAppArgs("--input", "/data/input", "--output", "/data/output"); + +SparkAppHandle handle = launcher.startApplication(); +``` + +### Monitor Application with Listener + +```java +import org.apache.spark.launcher.SparkAppHandle; + +class MyListener implements SparkAppHandle.Listener { + @Override + public void stateChanged(SparkAppHandle handle) { + SparkAppHandle.State state = handle.getState(); + System.out.println("Application state changed to: " + state); + + if (state.isFinal()) { + if (state == SparkAppHandle.State.FINISHED) { + System.out.println("Application completed successfully"); + } else { + System.out.println("Application failed: " + state); + } + } + } + + @Override + public void infoChanged(SparkAppHandle handle) { + System.out.println("Application ID: " + handle.getAppId()); + } +} + +// Use the listener +SparkAppHandle handle = new SparkLauncher() + .setAppResource("/path/to/app.jar") + .setMainClass("com.example.App") + .setMaster("spark://master:7077") + .startApplication(new MyListener()); +``` + +### Capture Output + +```java +import java.io.*; + +Process spark = new SparkLauncher() + .setAppResource("/path/to/app.jar") + .setMainClass("com.example.App") + .setMaster("local") + .redirectOutput(ProcessBuilder.Redirect.PIPE) + .redirectError(ProcessBuilder.Redirect.PIPE) + .launch(); + +// Read output +BufferedReader reader = new BufferedReader( + new InputStreamReader(spark.getInputStream()) +); +String line; +while ((line = reader.readLine()) != null) { + System.out.println(line); +} + +spark.waitFor(); +``` + +### Kill Running Application + +```java +SparkAppHandle handle = launcher.startApplication(); + +// Later, kill the application +handle.kill(); + +// Or stop gracefully +handle.stop(); +``` + +## In-Process Launcher + +For testing or special cases, launch Spark in the same JVM: + +```java +import org.apache.spark.launcher.InProcessLauncher; + +InProcessLauncher launcher = new InProcessLauncher(); +// Configure launcher... +SparkAppHandle handle = launcher.startApplication(); +``` + +**Note**: This is primarily for testing. Production code should use `SparkLauncher`. + +## Building and Testing + +### Build Launcher Module + +```bash +# Build launcher module +./build/mvn -pl launcher -am package + +# Skip tests +./build/mvn -pl launcher -am -DskipTests package +``` + +### Run Tests + +```bash +# Run all launcher tests +./build/mvn test -pl launcher + +# Run specific test +./build/mvn test -pl launcher -Dtest=SparkLauncherSuite +``` + +## Source Code Organization + +``` +launcher/src/main/java/org/apache/spark/launcher/ +├── SparkLauncher.java # Main launcher class +├── SparkAppHandle.java # Application handle interface +├── AbstractLauncher.java # Base launcher implementation +├── InProcessLauncher.java # In-process launcher (testing) +├── Main.java # Entry point for spark-submit +├── SparkSubmitCommandBuilder.java # Builds spark-submit commands +├── CommandBuilderUtils.java # Command building utilities +└── LauncherBackend.java # Backend communication +``` + +## Integration with spark-submit + +The Launcher library is used internally by `spark-submit`: + +``` +spark-submit script + ↓ +Main.main() + ↓ +SparkSubmitCommandBuilder + ↓ +Launch JVM with SparkSubmit +``` + +## Configuration Priority + +Configuration values are resolved in this order (highest priority first): + +1. Values set via `setConf()` or specific setters +2. Properties file specified with `setPropertiesFile()` +3. `conf/spark-defaults.conf` in `SPARK_HOME` +4. Environment variables + +## Environment Variables + +The launcher respects these environment variables: + +- `SPARK_HOME`: Spark installation directory +- `JAVA_HOME`: Java installation directory +- `SPARK_CONF_DIR`: Configuration directory +- `HADOOP_CONF_DIR`: Hadoop configuration directory +- `YARN_CONF_DIR`: YARN configuration directory + +## Security Considerations + +When launching applications programmatically: + +1. **Validate inputs**: Sanitize application arguments +2. **Secure credentials**: Don't hardcode secrets +3. **Limit permissions**: Run with minimal required privileges +4. **Monitor processes**: Track launched applications +5. **Clean up resources**: Always close handles and processes + +## Common Use Cases + +### Workflow Orchestration + +Launch Spark jobs as part of data pipelines: + +```java +public class DataPipeline { + public void runStage(String stageName, String mainClass) throws Exception { + SparkAppHandle handle = new SparkLauncher() + .setAppResource("/path/to/pipeline.jar") + .setMainClass(mainClass) + .setMaster("yarn") + .setAppName("Pipeline-" + stageName) + .startApplication(); + + // Wait for completion + while (!handle.getState().isFinal()) { + Thread.sleep(1000); + } + + if (handle.getState() != SparkAppHandle.State.FINISHED) { + throw new RuntimeException("Stage " + stageName + " failed"); + } + } +} +``` + +### Testing + +Launch Spark applications in integration tests: + +```java +@Test +public void testSparkApp() throws Exception { + SparkAppHandle handle = new SparkLauncher() + .setAppResource("target/test-app.jar") + .setMainClass("com.example.TestApp") + .setMaster("local[2]") + .startApplication(); + + // Wait for completion + handle.waitFor(60000); // 60 second timeout + + assertEquals(SparkAppHandle.State.FINISHED, handle.getState()); +} +``` + +### Resource Management + +Launch applications with dynamic resource allocation: + +```java +int executors = calculateRequiredExecutors(dataSize); +String memory = calculateMemory(dataSize); + +SparkLauncher launcher = new SparkLauncher() + .setAppResource("/path/to/app.jar") + .setMainClass("com.example.App") + .setMaster("yarn") + .setConf("spark.executor.instances", String.valueOf(executors)) + .setConf(SparkLauncher.EXECUTOR_MEMORY, memory) + .setConf("spark.dynamicAllocation.enabled", "true"); +``` + +## Best Practices + +1. **Use SparkAppHandle**: Monitor application state +2. **Add listeners**: Track state changes and failures +3. **Set timeouts**: Don't wait indefinitely +4. **Handle errors**: Check exit codes and states +5. **Clean up**: Stop handles and processes +6. **Log everything**: Record launches and outcomes +7. **Use constants**: Use SparkLauncher constants for config keys + +## Troubleshooting + +### Application Not Starting + +**Check:** +- SPARK_HOME is set correctly +- Application JAR path is correct +- Master URL is valid +- Required resources are available + +### Process Hangs + +**Solutions:** +- Add timeout: `handle.waitFor(timeout)` +- Check for deadlocks in application +- Verify cluster has capacity +- Check logs for issues + +### Cannot Monitor Application + +**Solutions:** +- Use `startApplication()` instead of `launch()` +- Add listener before starting +- Check for connection issues +- Verify cluster is accessible + +## Further Reading + +- [Submitting Applications](../docs/submitting-applications.md) +- [Cluster Mode Overview](../docs/cluster-overview.md) +- [Configuration Guide](../docs/configuration.md) + +## API Documentation + +Full JavaDoc available in the built JAR or online at: +https://spark.apache.org/docs/latest/api/java/org/apache/spark/launcher/package-summary.html diff --git a/resource-managers/README.md b/resource-managers/README.md new file mode 100644 index 0000000000000..f87ed0f06ad98 --- /dev/null +++ b/resource-managers/README.md @@ -0,0 +1,514 @@ +# Spark Resource Managers + +This directory contains integrations with various cluster resource managers. + +## Overview + +Spark can run on different cluster managers: +- **YARN** (Hadoop YARN) +- **Kubernetes** (Container orchestration) +- **Mesos** (General-purpose cluster manager) +- **Standalone** (Spark's built-in cluster manager) + +Each integration provides Spark-specific implementation for: +- Resource allocation +- Task scheduling +- Application lifecycle management +- Security integration + +## Modules + +### kubernetes/ + +Integration with Kubernetes for container-based deployments. + +**Location**: `kubernetes/` + +**Key Features:** +- Native Kubernetes resource management +- Dynamic executor allocation +- Volume mounting support +- Kerberos integration +- Custom resource definitions + +**Running on Kubernetes:** +```bash +./bin/spark-submit \ + --master k8s://https://: \ + --deploy-mode cluster \ + --name spark-pi \ + --class org.apache.spark.examples.SparkPi \ + --conf spark.executor.instances=2 \ + --conf spark.kubernetes.container.image=spark:3.5.0 \ + local:///opt/spark/examples/jars/spark-examples.jar +``` + +**Documentation**: See [running-on-kubernetes.md](../docs/running-on-kubernetes.md) + +### mesos/ + +Integration with Apache Mesos cluster manager. + +**Location**: `mesos/` + +**Key Features:** +- Fine-grained mode (one task per Mesos task) +- Coarse-grained mode (dedicated executors) +- Dynamic allocation +- Mesos frameworks integration + +**Running on Mesos:** +```bash +./bin/spark-submit \ + --master mesos://mesos-master:5050 \ + --deploy-mode cluster \ + --class org.apache.spark.examples.SparkPi \ + spark-examples.jar +``` + +**Documentation**: Check Apache Mesos documentation + +### yarn/ + +Integration with Hadoop YARN (Yet Another Resource Negotiator). + +**Location**: `yarn/` + +**Key Features:** +- Client and cluster deploy modes +- Dynamic resource allocation +- YARN container management +- Security integration (Kerberos) +- External shuffle service +- Application timeline service integration + +**Running on YARN:** +```bash +# Client mode (driver runs locally) +./bin/spark-submit \ + --master yarn \ + --deploy-mode client \ + --class org.apache.spark.examples.SparkPi \ + spark-examples.jar + +# Cluster mode (driver runs on YARN) +./bin/spark-submit \ + --master yarn \ + --deploy-mode cluster \ + --class org.apache.spark.examples.SparkPi \ + spark-examples.jar +``` + +**Documentation**: See [running-on-yarn.md](../docs/running-on-yarn.md) + +## Comparison + +### YARN + +**Best for:** +- Existing Hadoop deployments +- Enterprise environments with Hadoop ecosystem +- Multi-tenancy with resource queues +- Organizations standardized on YARN + +**Pros:** +- Mature and stable +- Rich security features +- Queue-based resource management +- Good tooling and monitoring + +**Cons:** +- Requires Hadoop installation +- More complex setup +- Higher overhead + +### Kubernetes + +**Best for:** +- Cloud-native deployments +- Containerized applications +- Modern microservices architectures +- Multi-cloud environments + +**Pros:** +- Container isolation +- Modern orchestration features +- Cloud provider integration +- Active development community + +**Cons:** +- Newer integration (less mature) +- Requires Kubernetes cluster +- Learning curve for K8s + +### Mesos + +**Best for:** +- General-purpose cluster management +- Mixed workload environments (not just Spark) +- Large-scale deployments + +**Pros:** +- Fine-grained resource allocation +- Flexible framework support +- Good for mixed workloads + +**Cons:** +- Less common than YARN/K8s +- Setup complexity +- Smaller community + +### Standalone + +**Best for:** +- Quick start and development +- Small clusters +- Dedicated Spark clusters + +**Pros:** +- Simple setup +- No dependencies +- Fast deployment + +**Cons:** +- Limited resource management +- No multi-tenancy +- Basic scheduling + +## Architecture + +### Resource Manager Integration + +``` +Spark Application + ↓ +SparkContext + ↓ +Cluster Manager Client + ↓ +Resource Manager (YARN/K8s/Mesos) + ↓ +Container/Pod/Task Launch + ↓ +Executor Processes +``` + +### Common Components + +Each integration implements: + +1. **SchedulerBackend**: Launches executors and schedules tasks +2. **ApplicationMaster/Driver**: Manages application lifecycle +3. **ExecutorBackend**: Runs tasks on executors +4. **Resource Allocation**: Requests and manages resources +5. **Security Integration**: Authentication and authorization + +## Building + +### Build All Resource Manager Modules + +```bash +# Build all resource manager integrations +./build/mvn -pl 'resource-managers/*' -am package +``` + +### Build Specific Modules + +```bash +# YARN only +./build/mvn -pl resource-managers/yarn -am package + +# Kubernetes only +./build/mvn -pl resource-managers/kubernetes/core -am package + +# Mesos only +./build/mvn -pl resource-managers/mesos -am package +``` + +### Build with Specific Profiles + +```bash +# Build with Kubernetes support +./build/mvn -Pkubernetes package + +# Build with YARN support +./build/mvn -Pyarn package + +# Build with Mesos support (requires Mesos libraries) +./build/mvn -Pmesos package +``` + +## Configuration + +### YARN Configuration + +**Key settings:** +```properties +# Resource allocation +spark.executor.instances=10 +spark.executor.memory=4g +spark.executor.cores=2 + +# YARN specific +spark.yarn.am.memory=1g +spark.yarn.am.cores=1 +spark.yarn.queue=default +spark.yarn.jars=hdfs:///spark-jars/* + +# Dynamic allocation +spark.dynamicAllocation.enabled=true +spark.dynamicAllocation.minExecutors=1 +spark.dynamicAllocation.maxExecutors=100 +spark.shuffle.service.enabled=true +``` + +### Kubernetes Configuration + +**Key settings:** +```properties +# Container image +spark.kubernetes.container.image=my-spark:latest +spark.kubernetes.container.image.pullPolicy=Always + +# Resource allocation +spark.executor.instances=5 +spark.kubernetes.executor.request.cores=1 +spark.kubernetes.executor.limit.cores=2 +spark.kubernetes.executor.request.memory=4g + +# Namespace and service account +spark.kubernetes.namespace=spark +spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa + +# Volumes +spark.kubernetes.driver.volumes.persistentVolumeClaim.data.options.claimName=spark-pvc +spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.path=/data +``` + +### Mesos Configuration + +**Key settings:** +```properties +# Mesos master +spark.mesos.coarse=true +spark.executor.uri=hdfs://path/to/spark.tgz + +# Resource allocation +spark.executor.memory=4g +spark.cores.max=20 + +# Mesos specific +spark.mesos.role=spark +spark.mesos.constraints=rack:us-east +``` + +## Source Code Organization + +``` +resource-managers/ +├── kubernetes/ +│ ├── core/ # Core K8s integration +│ │ └── src/main/scala/org/apache/spark/ +│ │ ├── deploy/k8s/ # Deployment logic +│ │ ├── scheduler/ # K8s scheduler backend +│ │ └── executor/ # K8s executor backend +│ └── integration-tests/ # K8s integration tests +├── mesos/ +│ └── src/main/scala/org/apache/spark/ +│ ├── scheduler/ # Mesos scheduler +│ └── executor/ # Mesos executor +└── yarn/ + └── src/main/scala/org/apache/spark/ + ├── deploy/yarn/ # YARN deployment + ├── scheduler/ # YARN scheduler + └── executor/ # YARN executor +``` + +## Development + +### Testing Resource Manager Integrations + +```bash +# Run YARN tests +./build/mvn test -pl resource-managers/yarn + +# Run Kubernetes tests +./build/mvn test -pl resource-managers/kubernetes/core + +# Run Mesos tests +./build/mvn test -pl resource-managers/mesos +``` + +### Integration Tests + +**Kubernetes:** +```bash +cd resource-managers/kubernetes/integration-tests +./dev/dev-run-integration-tests.sh +``` + +See `kubernetes/integration-tests/README.md` for details. + +## Security + +### YARN Security + +**Kerberos authentication:** +```bash +./bin/spark-submit \ + --master yarn \ + --principal user@REALM \ + --keytab /path/to/user.keytab \ + --class org.apache.spark.examples.SparkPi \ + spark-examples.jar +``` + +**Token renewal:** +```properties +spark.yarn.principal=user@REALM +spark.yarn.keytab=/path/to/keytab +spark.yarn.token.renewal.interval=86400 +``` + +### Kubernetes Security + +**Service account:** +```properties +spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa +spark.kubernetes.authenticate.executor.serviceAccountName=spark-sa +``` + +**Secrets:** +```bash +kubectl create secret generic spark-secret --from-literal=password=mypassword +``` + +```properties +spark.kubernetes.driver.secrets.spark-secret=/etc/secrets +``` + +### Mesos Security + +**Authentication:** +```properties +spark.mesos.principal=spark-user +spark.mesos.secret=spark-secret +``` + +## Migration Guide + +### Moving from Standalone to YARN + +1. Set up Hadoop cluster +2. Configure YARN resource manager +3. Enable external shuffle service +4. Update spark-submit commands to use `--master yarn` +5. Test dynamic allocation + +### Moving from YARN to Kubernetes + +1. Build Docker image with Spark +2. Push image to container registry +3. Create Kubernetes namespace and service account +4. Update spark-submit to use `--master k8s://` +5. Configure volume mounts for data access + +## Troubleshooting + +### YARN Issues + +**Application stuck in ACCEPTED state:** +- Check YARN capacity +- Verify queue settings +- Check resource availability + +**Container allocation failures:** +- Increase memory overhead +- Check node resources +- Verify memory/core requests + +### Kubernetes Issues + +**Image pull failures:** +- Verify image name and tag +- Check image pull secrets +- Ensure registry is accessible + +**Pod failures:** +- Check pod logs: `kubectl logs ` +- Verify service account permissions +- Check resource limits + +### Mesos Issues + +**Framework registration failures:** +- Verify Mesos master URL +- Check authentication settings +- Ensure proper role configuration + +## Best Practices + +1. **Choose the right manager**: Based on infrastructure and requirements +2. **Enable dynamic allocation**: For better resource utilization +3. **Use external shuffle service**: For executor failure tolerance +4. **Configure memory overhead**: Account for non-heap memory +5. **Monitor resource usage**: Track executor and driver metrics +6. **Use appropriate deploy mode**: Client for interactive, cluster for production +7. **Implement security**: Enable authentication and encryption +8. **Test failure scenarios**: Verify fault tolerance + +## Performance Tuning + +### YARN Performance + +```properties +# Memory overhead +spark.yarn.executor.memoryOverhead=512m + +# Locality wait +spark.locality.wait=3s + +# Container reuse +spark.yarn.executor.launch.parallelism=10 +``` + +### Kubernetes Performance + +```properties +# Resource limits +spark.kubernetes.executor.limit.cores=2 + +# Volume performance +spark.kubernetes.driver.volumes.emptyDir.cache.medium=Memory + +# Network optimization +spark.kubernetes.executor.podNamePrefix=spark-exec +``` + +### Mesos Performance + +```properties +# Fine-grained mode for better sharing +spark.mesos.coarse=false + +# Container timeout +spark.mesos.executor.docker.pullTimeout=600 +``` + +## Further Reading + +- [Running on YARN](../docs/running-on-yarn.md) +- [Running on Kubernetes](../docs/running-on-kubernetes.md) +- [Cluster Mode Overview](../docs/cluster-overview.md) +- [Configuration Guide](../docs/configuration.md) +- [Security Guide](../docs/security.md) + +## Contributing + +For contributing to resource manager integrations, see [CONTRIBUTING.md](../CONTRIBUTING.md). + +When adding features: +- Ensure cross-compatibility +- Add comprehensive tests +- Update documentation +- Consider security implications diff --git a/sbin/README.md b/sbin/README.md new file mode 100644 index 0000000000000..dbe86cc1a8aa4 --- /dev/null +++ b/sbin/README.md @@ -0,0 +1,514 @@ +# Spark Admin Scripts + +This directory contains administrative scripts for managing Spark standalone clusters. + +## Overview + +The `sbin/` scripts are used by cluster administrators to: +- Start and stop Spark standalone clusters +- Start and stop individual daemons (master, workers, history server) +- Manage cluster lifecycle +- Configure cluster nodes + +**Note**: These scripts are for **Spark Standalone** cluster mode only. For YARN, Kubernetes, or Mesos, use their respective cluster management tools. + +## Cluster Management Scripts + +### start-all.sh / stop-all.sh + +Start or stop all Spark daemons on the cluster. + +**Usage:** +```bash +# Start master and all workers +./sbin/start-all.sh + +# Stop all daemons +./sbin/stop-all.sh +``` + +**What they do:** +- `start-all.sh`: Starts master on the current machine and workers on machines listed in `conf/workers` +- `stop-all.sh`: Stops all master and worker daemons + +**Prerequisites:** +- SSH key-based authentication configured +- `conf/workers` file with worker hostnames +- Spark installed at same location on all machines + +**Configuration files:** +- `conf/workers`: List of worker hostnames (one per line) +- `conf/spark-env.sh`: Environment variables + +### start-master.sh / stop-master.sh + +Start or stop the Spark master daemon on the current machine. + +**Usage:** +```bash +# Start master +./sbin/start-master.sh + +# Stop master +./sbin/stop-master.sh +``` + +**Master Web UI**: Access at `http://:8080/` + +**Configuration:** +```bash +# In conf/spark-env.sh +export SPARK_MASTER_HOST=master-hostname +export SPARK_MASTER_PORT=7077 +export SPARK_MASTER_WEBUI_PORT=8080 +``` + +### start-worker.sh / stop-worker.sh + +Start or stop a Spark worker daemon on the current machine. + +**Usage:** +```bash +# Start worker connecting to master +./sbin/start-worker.sh spark://master:7077 + +# Stop worker +./sbin/stop-worker.sh +``` + +**Worker Web UI**: Access at `http://:8081/` + +**Configuration:** +```bash +# In conf/spark-env.sh +export SPARK_WORKER_CORES=8 # Number of cores to use +export SPARK_WORKER_MEMORY=16g # Memory to allocate +export SPARK_WORKER_PORT=7078 # Worker port +export SPARK_WORKER_WEBUI_PORT=8081 +export SPARK_WORKER_DIR=/var/spark/work # Work directory +``` + +### start-workers.sh / stop-workers.sh + +Start or stop workers on all machines listed in `conf/workers`. + +**Usage:** +```bash +# Start all workers +./sbin/start-workers.sh spark://master:7077 + +# Stop all workers +./sbin/stop-workers.sh +``` + +**Requirements:** +- `conf/workers` file configured +- SSH access to all worker machines +- Master URL (for starting) + +## History Server Scripts + +### start-history-server.sh / stop-history-server.sh + +Start or stop the Spark History Server for viewing completed application logs. + +**Usage:** +```bash +# Start history server +./sbin/start-history-server.sh + +# Stop history server +./sbin/stop-history-server.sh +``` + +**History Server UI**: Access at `http://:18080/` + +**Configuration:** +```properties +# In conf/spark-defaults.conf +spark.history.fs.logDirectory=hdfs://namenode/spark-logs +spark.history.ui.port=18080 +spark.eventLog.enabled=true +spark.eventLog.dir=hdfs://namenode/spark-logs +``` + +**Requirements:** +- Applications must have event logging enabled +- Log directory must be accessible + +## Shuffle Service Scripts + +### start-shuffle-service.sh / stop-shuffle-service.sh + +Start or stop the external shuffle service (for YARN). + +**Usage:** +```bash +# Start shuffle service +./sbin/start-shuffle-service.sh + +# Stop shuffle service +./sbin/stop-shuffle-service.sh +``` + +**Note**: Typically used only when running on YARN without the YARN auxiliary service. + +## Configuration Files + +### conf/workers + +Lists worker hostnames, one per line. + +**Example:** +``` +worker1.example.com +worker2.example.com +worker3.example.com +``` + +**Usage:** +- Used by `start-all.sh` and `start-workers.sh` +- Each line should contain a hostname or IP address +- Blank lines and lines starting with `#` are ignored + +### conf/spark-env.sh + +Environment variables for Spark daemons. + +**Example:** +```bash +#!/usr/bin/env bash + +# Java +export JAVA_HOME=/usr/lib/jvm/java-17 + +# Master settings +export SPARK_MASTER_HOST=master.example.com +export SPARK_MASTER_PORT=7077 +export SPARK_MASTER_WEBUI_PORT=8080 + +# Worker settings +export SPARK_WORKER_CORES=8 +export SPARK_WORKER_MEMORY=16g +export SPARK_WORKER_PORT=7078 +export SPARK_WORKER_WEBUI_PORT=8081 +export SPARK_WORKER_DIR=/var/spark/work + +# Directories +export SPARK_LOG_DIR=/var/log/spark +export SPARK_PID_DIR=/var/run/spark + +# History Server +export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://namenode/spark-logs" + +# Additional Java options +export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181" +``` + +**Key Variables:** + +**Master:** +- `SPARK_MASTER_HOST`: Master hostname +- `SPARK_MASTER_PORT`: Master port (default: 7077) +- `SPARK_MASTER_WEBUI_PORT`: Web UI port (default: 8080) + +**Worker:** +- `SPARK_WORKER_CORES`: Number of cores per worker +- `SPARK_WORKER_MEMORY`: Memory per worker (e.g., 16g) +- `SPARK_WORKER_PORT`: Worker communication port +- `SPARK_WORKER_WEBUI_PORT`: Worker web UI port (default: 8081) +- `SPARK_WORKER_DIR`: Directory for scratch space and logs +- `SPARK_WORKER_INSTANCES`: Number of worker instances per machine + +**General:** +- `SPARK_LOG_DIR`: Directory for daemon logs +- `SPARK_PID_DIR`: Directory for PID files +- `SPARK_IDENT_STRING`: Identifier for daemons (default: username) +- `SPARK_NICENESS`: Nice value for daemons +- `SPARK_DAEMON_MEMORY`: Memory for daemon processes + +## Setting Up a Standalone Cluster + +### Step 1: Install Spark on All Nodes + +```bash +# Download and extract Spark on each machine +tar xzf spark-X.Y.Z-bin-hadoopX.tgz +cd spark-X.Y.Z-bin-hadoopX +``` + +### Step 2: Configure spark-env.sh + +Create `conf/spark-env.sh` from template: +```bash +cp conf/spark-env.sh.template conf/spark-env.sh +# Edit conf/spark-env.sh with appropriate settings +``` + +### Step 3: Configure Workers File + +Create `conf/workers`: +```bash +cp conf/workers.template conf/workers +# Add worker hostnames, one per line +``` + +### Step 4: Configure SSH Access + +Set up password-less SSH from master to all workers: +```bash +ssh-keygen -t rsa +ssh-copy-id user@worker1 +ssh-copy-id user@worker2 +# ... for each worker +``` + +### Step 5: Synchronize Configuration + +Copy configuration to all workers: +```bash +for host in $(cat conf/workers); do + rsync -av conf/ user@$host:spark/conf/ +done +``` + +### Step 6: Start the Cluster + +```bash +./sbin/start-all.sh +``` + +### Step 7: Verify + +- Check master UI: `http://master:8080` +- Check worker UIs: `http://worker1:8081`, etc. +- Look for workers registered with master + +## High Availability + +For production deployments, configure high availability with ZooKeeper. + +### ZooKeeper-based HA Configuration + +**In conf/spark-env.sh:** +```bash +export SPARK_DAEMON_JAVA_OPTS=" + -Dspark.deploy.recoveryMode=ZOOKEEPER + -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181,zk3:2181 + -Dspark.deploy.zookeeper.dir=/spark +" +``` + +### Start Multiple Masters + +```bash +# On master1 +./sbin/start-master.sh + +# On master2 +./sbin/start-master.sh + +# On master3 +./sbin/start-master.sh +``` + +### Connect Workers to All Masters + +```bash +./sbin/start-worker.sh spark://master1:7077,master2:7077,master3:7077 +``` + +**Automatic failover:** If active master fails, standby masters detect the failure and one becomes active. + +## Monitoring and Logs + +### Log Files + +Daemon logs are written to `$SPARK_LOG_DIR` (default: `logs/`): + +```bash +# Master log +$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.master.Master-*.out + +# Worker log +$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.worker.Worker-*.out + +# History Server log +$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.history.HistoryServer-*.out +``` + +### View Logs + +```bash +# Tail master log +tail -f logs/spark-*-master-*.out + +# Tail worker log +tail -f logs/spark-*-worker-*.out + +# Search for errors +grep ERROR logs/spark-*-master-*.out +``` + +### Web UIs + +- **Master UI**: `http://:8080` - Cluster status, workers, applications +- **Worker UI**: `http://:8081` - Worker status, running executors +- **Application UI**: `http://:4040` - Running application metrics +- **History Server**: `http://:18080` - Completed applications + +## Advanced Configuration + +### Memory Overhead + +Reserve memory for system processes: +```bash +export SPARK_DAEMON_MEMORY=2g +``` + +### Multiple Workers per Machine + +Run multiple worker instances on a single machine: +```bash +export SPARK_WORKER_INSTANCES=2 +export SPARK_WORKER_CORES=4 # Cores per instance +export SPARK_WORKER_MEMORY=8g # Memory per instance +``` + +### Work Directory + +Change worker scratch space: +```bash +export SPARK_WORKER_DIR=/mnt/fast-disk/spark-work +``` + +### Port Configuration + +Use non-default ports: +```bash +export SPARK_MASTER_PORT=9077 +export SPARK_MASTER_WEBUI_PORT=9080 +export SPARK_WORKER_PORT=9078 +export SPARK_WORKER_WEBUI_PORT=9081 +``` + +## Security + +### Enable Authentication + +```bash +export SPARK_DAEMON_JAVA_OPTS=" + -Dspark.authenticate=true + -Dspark.authenticate.secret=your-secret-key +" +``` + +### Enable SSL + +```bash +export SPARK_DAEMON_JAVA_OPTS=" + -Dspark.ssl.enabled=true + -Dspark.ssl.keyStore=/path/to/keystore + -Dspark.ssl.keyStorePassword=password + -Dspark.ssl.trustStore=/path/to/truststore + -Dspark.ssl.trustStorePassword=password +" +``` + +## Troubleshooting + +### Master Won't Start + +**Check:** +1. Port 7077 is available: `netstat -an | grep 7077` +2. Hostname is resolvable: `ping $SPARK_MASTER_HOST` +3. Logs for errors: `cat logs/spark-*-master-*.out` + +### Workers Not Connecting + +**Check:** +1. Master URL is correct +2. Network connectivity: `telnet master 7077` +3. Firewall allows connections +4. Worker logs: `cat logs/spark-*-worker-*.out` + +### SSH Connection Issues + +**Solutions:** +1. Verify SSH key: `ssh worker1 echo test` +2. Check SSH config: `~/.ssh/config` +3. Use SSH agent: `eval $(ssh-agent); ssh-add` + +### Insufficient Resources + +**Check:** +- Worker has enough memory: `free -h` +- Enough cores available: `nproc` +- Disk space: `df -h` + +## Cluster Shutdown + +### Graceful Shutdown + +```bash +# Stop all workers first +./sbin/stop-workers.sh + +# Stop master +./sbin/stop-master.sh + +# Or stop everything +./sbin/stop-all.sh +``` + +### Check All Stopped + +```bash +# Check for running Java processes +jps | grep -E "(Master|Worker)" +``` + +### Force Kill if Needed + +```bash +# Kill any remaining Spark processes +pkill -f org.apache.spark.deploy +``` + +## Best Practices + +1. **Use HA in production**: Configure ZooKeeper-based HA +2. **Monitor resources**: Watch CPU, memory, disk usage +3. **Separate log directories**: Use dedicated disk for logs +4. **Regular maintenance**: Clean old logs and application data +5. **Automate startup**: Use systemd or init scripts +6. **Configure limits**: Set file descriptor and process limits +7. **Use external shuffle service**: For better fault tolerance +8. **Back up metadata**: Regularly back up ZooKeeper data + +## Scripts Reference + +| Script | Purpose | +|--------|---------| +| `start-all.sh` | Start master and all workers | +| `stop-all.sh` | Stop master and all workers | +| `start-master.sh` | Start master on current machine | +| `stop-master.sh` | Stop master | +| `start-worker.sh` | Start worker on current machine | +| `stop-worker.sh` | Stop worker | +| `start-workers.sh` | Start workers on all machines in `conf/workers` | +| `stop-workers.sh` | Stop all workers | +| `start-history-server.sh` | Start history server | +| `stop-history-server.sh` | Stop history server | + +## Further Reading + +- [Spark Standalone Mode](../docs/spark-standalone.md) +- [Cluster Mode Overview](../docs/cluster-overview.md) +- [Configuration Guide](../docs/configuration.md) +- [Security Guide](../docs/security.md) +- [Monitoring Guide](../docs/monitoring.md) + +## User-Facing Scripts + +For user-facing scripts (spark-submit, spark-shell, etc.), see [../bin/README.md](../bin/README.md). From 8c06e81657319a50f64a0ef9a7ce18e38971b68d Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sun, 19 Oct 2025 17:27:58 +0000 Subject: [PATCH 5/6] Update main README with documentation links and add code documentation guide Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com> --- CODE_DOCUMENTATION_GUIDE.md | 612 ++++++++++++++++++++++++++++++++++++ README.md | 26 +- 2 files changed, 636 insertions(+), 2 deletions(-) create mode 100644 CODE_DOCUMENTATION_GUIDE.md diff --git a/CODE_DOCUMENTATION_GUIDE.md b/CODE_DOCUMENTATION_GUIDE.md new file mode 100644 index 0000000000000..1229bc447140b --- /dev/null +++ b/CODE_DOCUMENTATION_GUIDE.md @@ -0,0 +1,612 @@ +# Code Documentation Guide + +This guide describes documentation standards for Apache Spark source code. + +## Overview + +Good documentation helps developers understand and maintain code. Spark follows industry-standard documentation practices for each language it supports. + +## Scala Documentation (Scaladoc) + +Scala code uses Scaladoc for API documentation. + +### Basic Format + +```scala +/** + * Brief one-line description. + * + * Detailed description that can span multiple lines. + * Explain what this class/method does, important behavior, + * and any constraints or assumptions. + * + * @param paramName description of parameter + * @param anotherParam description of another parameter + * @return description of return value + * @throws ExceptionType when this exception is thrown + * @since 3.5.0 + * @note Important note about usage or behavior + */ +def methodName(paramName: String, anotherParam: Int): ReturnType = { + // Implementation +} +``` + +### Class Documentation + +```scala +/** + * Brief description of the class purpose. + * + * Detailed explanation of the class functionality, usage patterns, + * and important considerations. + * + * Example usage: + * {{{ + * val example = new MyClass(param1, param2) + * example.doSomething() + * }}} + * + * @constructor Creates a new instance with the given parameters + * @param config Configuration object + * @param isLocal Whether running in local mode + * @since 3.0.0 + */ +class MyClass(config: SparkConf, isLocal: Boolean) extends Logging { + // Class implementation +} +``` + +### Code Examples + +Use triple braces for code examples: + +```scala +/** + * Transforms the RDD by applying a function to each element. + * + * Example: + * {{{ + * val rdd = sc.parallelize(1 to 10) + * val doubled = rdd.map(_ * 2) + * doubled.collect() // Array(2, 4, 6, ..., 20) + * }}} + * + * @param f function to apply to each element + * @return transformed RDD + */ +def map[U: ClassTag](f: T => U): RDD[U] +``` + +### Annotations + +Use Spark annotations for API stability: + +```scala +/** + * :: Experimental :: + * This feature is experimental and may change in future releases. + */ +@Experimental +class ExperimentalFeature + +/** + * :: DeveloperApi :: + * This is a developer API and may change between minor versions. + */ +@DeveloperApi +class DeveloperFeature + +/** + * :: Unstable :: + * This API is unstable and may change in patch releases. + */ +@Unstable +class UnstableFeature +``` + +### Internal APIs + +Mark internal classes and methods: + +```scala +/** + * Internal utility class for XYZ. + * + * @note This is an internal API and may change without notice. + */ +private[spark] class InternalUtil + +/** + * Internal method used by scheduler. + */ +private[scheduler] def internalMethod(): Unit +``` + +## Java Documentation (Javadoc) + +Java code uses Javadoc for API documentation. + +### Basic Format + +```java +/** + * Brief one-line description. + *

+ * Detailed description that can span multiple paragraphs. + * Explain what this class/method does and important behavior. + *

+ * + * @param paramName description of parameter + * @param anotherParam description of another parameter + * @return description of return value + * @throws ExceptionType when this exception is thrown + * @since 3.5.0 + */ +public ReturnType methodName(String paramName, int anotherParam) + throws ExceptionType { + // Implementation +} +``` + +### Class Documentation + +```java +/** + * Brief description of the class purpose. + *

+ * Detailed explanation of functionality, usage patterns, + * and important considerations. + *

+ *

+ * Example usage: + *

{@code
+ * MyClass example = new MyClass(param1, param2);
+ * example.doSomething();
+ * }
+ *

+ * + * @param type parameter description + * @since 3.0.0 + */ +public class MyClass implements Serializable { + // Class implementation +} +``` + +### Interface Documentation + +```java +/** + * Interface for shuffle block resolution. + *

+ * Implementations of this interface are responsible for + * resolving shuffle block locations and reading shuffle data. + *

+ * + * @since 2.3.0 + */ +public interface ShuffleBlockResolver { + /** + * Gets the data for a shuffle block. + * + * @param blockId the block identifier + * @return managed buffer containing the block data + */ + ManagedBuffer getBlockData(BlockId blockId); +} +``` + +## Python Documentation (Docstrings) + +Python code uses docstrings following PEP 257 and Google style. + +### Function Documentation + +```python +def function_name(param1: str, param2: int) -> bool: + """ + Brief one-line description. + + Detailed description that can span multiple lines. + Explain what this function does, important behavior, + and any constraints. + + Parameters + ---------- + param1 : str + Description of param1 + param2 : int + Description of param2 + + Returns + ------- + bool + Description of return value + + Raises + ------ + ValueError + When input is invalid + + Examples + -------- + >>> result = function_name("test", 42) + >>> print(result) + True + + Notes + ----- + Important notes about usage or behavior. + + .. versionadded:: 3.5.0 + """ + # Implementation + pass +``` + +### Class Documentation + +```python +class MyClass: + """ + Brief description of the class. + + Detailed explanation of the class functionality, + usage patterns, and important considerations. + + Parameters + ---------- + config : dict + Configuration dictionary + is_local : bool, optional + Whether running in local mode (default is False) + + Attributes + ---------- + config : dict + Stored configuration + state : str + Current state of the object + + Examples + -------- + >>> obj = MyClass({'key': 'value'}, is_local=True) + >>> obj.do_something() + + Notes + ----- + This class is thread-safe. + + .. versionadded:: 3.0.0 + """ + + def __init__(self, config: dict, is_local: bool = False): + self.config = config + self.is_local = is_local + self.state = "initialized" +``` + +### Type Hints + +Use type hints consistently: + +```python +from typing import List, Optional, Dict, Any, Union +from pyspark.sql import DataFrame + +def process_data( + df: DataFrame, + columns: List[str], + options: Optional[Dict[str, Any]] = None +) -> Union[DataFrame, None]: + """ + Process DataFrame with specified columns. + + Parameters + ---------- + df : DataFrame + Input DataFrame to process + columns : list of str + Column names to include + options : dict, optional + Processing options + + Returns + ------- + DataFrame or None + Processed DataFrame, or None if processing fails + """ + pass +``` + +## R Documentation (Roxygen2) + +R code uses Roxygen2-style documentation. + +### Function Documentation + +```r +#' Brief one-line description +#' +#' Detailed description that can span multiple lines. +#' Explain what this function does and important behavior. +#' +#' @param param1 description of param1 +#' @param param2 description of param2 +#' @return description of return value +#' @examples +#' \dontrun{ +#' result <- myFunction(param1 = "test", param2 = 42) +#' print(result) +#' } +#' @note Important note about usage +#' @rdname function-name +#' @since 3.0.0 +#' @export +myFunction <- function(param1, param2) { + # Implementation +} +``` + +### Class Documentation + +```r +#' MyClass: A class for doing XYZ +#' +#' Detailed description of the class functionality +#' and usage patterns. +#' +#' @slot field1 description of field1 +#' @slot field2 description of field2 +#' @export +#' @since 3.0.0 +setClass("MyClass", + slots = c( + field1 = "character", + field2 = "numeric" + ) +) +``` + +## Documentation Best Practices + +### 1. Write Clear, Concise Descriptions + +**Good:** +```scala +/** + * Computes the mean of values in the RDD. + * + * @return the arithmetic mean, or NaN if the RDD is empty + */ +def mean(): Double +``` + +**Bad:** +```scala +/** + * This method calculates and returns the mean. + */ +def mean(): Double +``` + +### 2. Document Edge Cases + +```scala +/** + * Divides two numbers. + * + * @param a numerator + * @param b denominator + * @return result of a / b + * @throws ArithmeticException if b is zero + * @note Returns Double.PositiveInfinity if a > 0 and b = 0+ + */ +def divide(a: Double, b: Double): Double +``` + +### 3. Provide Examples + +Always include examples for public APIs: + +```scala +/** + * Filters elements using the given predicate. + * + * Example: + * {{{ + * val rdd = sc.parallelize(1 to 10) + * val evens = rdd.filter(_ % 2 == 0) + * evens.collect() // Array(2, 4, 6, 8, 10) + * }}} + */ +def filter(f: T => Boolean): RDD[T] +``` + +### 4. Document Thread Safety + +```scala +/** + * Thread-safe cache implementation. + * + * @note This class uses internal synchronization and is safe + * for concurrent access from multiple threads. + */ +class ConcurrentCache[K, V] extends Cache[K, V] +``` + +### 5. Document Performance Characteristics + +```scala +/** + * Sorts the RDD by key. + * + * @note This operation triggers a shuffle and is expensive. + * The time complexity is O(n log n) where n is the + * number of elements. + */ +def sortByKey(): RDD[(K, V)] +``` + +### 6. Link to Related APIs + +```scala +/** + * Maps elements to key-value pairs. + * + * @see [[groupByKey]] for grouping by keys + * @see [[reduceByKey]] for aggregating by keys + */ +def keyBy[K](f: T => K): RDD[(K, T)] +``` + +### 7. Version Information + +```scala +/** + * New feature introduced in 3.5.0. + * + * @since 3.5.0 + */ +def newMethod(): Unit + +/** + * Deprecated method, use [[newMethod]] instead. + * + * @deprecated Use newMethod() instead, since 3.5.0 + */ +@deprecated("Use newMethod() instead", "3.5.0") +def oldMethod(): Unit +``` + +## Internal Documentation + +### Code Comments + +Use comments for complex logic: + +```scala +// Sort by key and value to ensure deterministic output +// This is critical for testing and reproducing results +val sorted = data.sortBy(x => (x._1, x._2)) + +// TODO: Optimize this for large datasets +// Current implementation loads all data into memory +val result = computeExpensiveOperation() + +// FIXME: This breaks when input size exceeds Int.MaxValue +val size = data.size.toInt +``` + +### Architecture Comments + +Document architectural decisions: + +```scala +/** + * Internal scheduler implementation. + * + * Architecture: + * 1. Jobs are submitted to DAGScheduler + * 2. DAGScheduler creates stages based on shuffle boundaries + * 3. Each stage is submitted as a TaskSet to TaskScheduler + * 4. TaskScheduler assigns tasks to executors + * 5. Task results are returned to the driver + * + * Thread Safety: + * - DAGScheduler runs in a single thread (event loop) + * - TaskScheduler methods are thread-safe + * - Results are collected with appropriate synchronization + */ +private[spark] class SchedulerImpl +``` + +## Generating Documentation + +### Scaladoc + +```bash +# Generate Scaladoc +./build/mvn scala:doc + +# Output in target/site/scaladocs/ +``` + +### Javadoc + +```bash +# Generate Javadoc +./build/mvn javadoc:javadoc + +# Output in target/site/apidocs/ +``` + +### Python Documentation + +```bash +# Generate Sphinx documentation +cd python/docs +make html + +# Output in _build/html/ +``` + +### R Documentation + +```bash +# Generate R documentation +cd R/pkg +R CMD Rd2pdf . +``` + +## Documentation Review Checklist + +When reviewing documentation: + +- [ ] Is the description clear and accurate? +- [ ] Are all parameters documented? +- [ ] Is the return value documented? +- [ ] Are exceptions/errors documented? +- [ ] Are examples provided for public APIs? +- [ ] Is thread safety documented if relevant? +- [ ] Are performance characteristics noted? +- [ ] Is version information included? +- [ ] Are deprecated APIs marked? +- [ ] Are there links to related APIs? +- [ ] Is internal vs. public API clearly marked? + +## Tools + +### IDE Support + +- **IntelliJ IDEA**: Auto-generates documentation templates +- **VS Code**: Extensions for Scaladoc/Javadoc +- **Eclipse**: Built-in Javadoc support + +### Linters + +- **Scalastyle**: Checks for missing Scaladoc +- **Checkstyle**: Validates Javadoc +- **Pylint**: Checks Python docstrings +- **roxygen2**: Validates R documentation + +## Resources + +- [Scaladoc Style Guide](https://docs.scala-lang.org/style/scaladoc.html) +- [Oracle Javadoc Guide](https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html) +- [PEP 257 - Docstring Conventions](https://www.python.org/dev/peps/pep-0257/) +- [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html) +- [Roxygen2 Documentation](https://roxygen2.r-lib.org/) + +## Contributing + +When contributing code to Spark: + +1. Follow the documentation style for your language +2. Document all public APIs +3. Include examples for new features +4. Update existing documentation when changing behavior +5. Run documentation generators to verify formatting + +For more information, see [CONTRIBUTING.md](CONTRIBUTING.md). diff --git a/README.md b/README.md index 65dfd67ac520e..0dd1f7f173bea 100644 --- a/README.md +++ b/README.md @@ -15,11 +15,33 @@ and Structured Streaming for stream processing. [![PyPI Downloads](https://static.pepy.tech/personalized-badge/pyspark?period=month&units=international_system&left_color=black&right_color=orange&left_text=PyPI%20downloads)](https://pypi.org/project/pyspark/) -## Online Documentation +## Documentation You can find the latest Spark documentation, including a programming guide, on the [project web page](https://spark.apache.org/documentation.html). -This README file only contains basic setup instructions. + +### Repository Documentation + +- **[ARCHITECTURE.md](ARCHITECTURE.md)** - Spark architecture overview and component descriptions +- **[DEVELOPMENT.md](DEVELOPMENT.md)** - Developer guide with build instructions, testing, and IDE setup +- **[CONTRIBUTING.md](CONTRIBUTING.md)** - How to contribute to Apache Spark + +### Module Documentation + +- **[core/](core/README.md)** - Spark Core: RDDs, scheduling, memory management, storage +- **[sql/](sql/README.md)** - Spark SQL: DataFrames, Datasets, SQL engine, data sources +- **[streaming/](streaming/README.md)** - Spark Streaming: DStreams and Structured Streaming +- **[mllib/](mllib/README.md)** - MLlib: Machine learning library with algorithms and pipelines +- **[graphx/](graphx/README.md)** - GraphX: Graph processing framework and algorithms +- **[examples/](examples/README.md)** - Example programs in Scala, Java, Python, and R + +### Tools and Utilities + +- **[bin/](bin/README.md)** - User-facing scripts (spark-submit, spark-shell, pyspark, etc.) +- **[sbin/](sbin/README.md)** - Admin scripts for managing Spark standalone clusters +- **[launcher/](launcher/README.md)** - Programmatic API for launching Spark applications +- **[resource-managers/](resource-managers/README.md)** - Integrations with YARN, Kubernetes, and Mesos +- **[common/](common/README.md)** - Common utilities and libraries shared across modules ## Build Pipeline Status From fa25e0c24eac4bb8108330cacf4b705bf6a57e3f Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sun, 19 Oct 2025 17:29:36 +0000 Subject: [PATCH 6/6] Add comprehensive documentation index for easy navigation Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com> --- DOCUMENTATION_INDEX.md | 345 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 345 insertions(+) create mode 100644 DOCUMENTATION_INDEX.md diff --git a/DOCUMENTATION_INDEX.md b/DOCUMENTATION_INDEX.md new file mode 100644 index 0000000000000..cd4227c35df67 --- /dev/null +++ b/DOCUMENTATION_INDEX.md @@ -0,0 +1,345 @@ +# Apache Spark Documentation Index + +This document provides a complete index of all documentation available in the Apache Spark repository. + +## Quick Start + +- **[README.md](README.md)** - Main project README with quick start guide +- **[docs/quick-start.md](docs/quick-start.md)** - Interactive tutorial for getting started +- **[CONTRIBUTING.md](CONTRIBUTING.md)** - How to contribute to the project + +## Architecture and Development + +### Core Documentation +- **[ARCHITECTURE.md](ARCHITECTURE.md)** - Complete Spark architecture overview + - Core components and their responsibilities + - Execution model and data flow + - Module structure and dependencies + - Key subsystems (memory, shuffle, storage, networking) + +- **[DEVELOPMENT.md](DEVELOPMENT.md)** - Developer guide + - Setting up development environment + - Building and testing instructions + - IDE configuration + - Code style guidelines + - Debugging techniques + - Common development tasks + +- **[CODE_DOCUMENTATION_GUIDE.md](CODE_DOCUMENTATION_GUIDE.md)** - Code documentation standards + - Scaladoc guidelines + - Javadoc guidelines + - Python docstring conventions + - R documentation standards + - Best practices and examples + +## Module Documentation + +### Core Modules + +#### Spark Core +- **[core/README.md](core/README.md)** - Spark Core documentation + - RDD API and operations + - SparkContext and configuration + - Task scheduling (DAGScheduler, TaskScheduler) + - Memory management + - Shuffle system + - Storage system + - Serialization + +#### Spark SQL +- **[sql/README.md](sql/README.md)** - Spark SQL documentation (if exists) +- **[docs/sql-programming-guide.md](docs/sql-programming-guide.md)** - SQL programming guide +- **[docs/sql-data-sources.md](docs/sql-data-sources.md)** - Data source integration +- **[docs/sql-performance-tuning.md](docs/sql-performance-tuning.md)** - Performance tuning + +#### Streaming +- **[streaming/README.md](streaming/README.md)** - Spark Streaming documentation + - DStreams API (legacy) + - Structured Streaming (recommended) + - Input sources and output sinks + - Windowing and stateful operations + - Performance tuning + +#### MLlib +- **[mllib/README.md](mllib/README.md)** - MLlib documentation + - ML Pipeline API (spark.ml) + - RDD-based API (spark.mllib - maintenance mode) + - Classification and regression algorithms + - Clustering algorithms + - Feature engineering + - Model selection and tuning + +#### GraphX +- **[graphx/README.md](graphx/README.md)** - GraphX documentation + - Property graphs + - Graph operators + - Graph algorithms (PageRank, Connected Components, etc.) + - Pregel API + - Performance optimization + +### Common Modules +- **[common/README.md](common/README.md)** - Common utilities documentation + - Network communication (network-common, network-shuffle) + - Key-value store + - Sketching algorithms + - Unsafe operations + +### Tools and Utilities + +#### User-Facing Tools +- **[bin/README.md](bin/README.md)** - User scripts documentation + - spark-submit: Application submission + - spark-shell: Interactive Scala shell + - pyspark: Interactive Python shell + - sparkR: Interactive R shell + - spark-sql: SQL query shell + - run-example: Example runner + +#### Administration Tools +- **[sbin/README.md](sbin/README.md)** - Admin scripts documentation + - Cluster management scripts + - start-all.sh / stop-all.sh + - Master and worker daemon management + - History server setup + - Standalone cluster configuration + +#### Programmatic API +- **[launcher/README.md](launcher/README.md)** - Launcher API documentation + - SparkLauncher for programmatic application launching + - SparkAppHandle for monitoring + - Integration patterns + +#### Resource Managers +- **[resource-managers/README.md](resource-managers/README.md)** - Resource manager integrations + - YARN integration + - Kubernetes integration + - Mesos integration + - Comparison and configuration + +### Examples +- **[examples/README.md](examples/README.md)** - Example programs + - Core examples (RDD operations) + - SQL examples (DataFrames) + - Streaming examples + - MLlib examples + - GraphX examples + - Running examples + +## Official Documentation + +### Programming Guides +- **[docs/programming-guide.md](docs/programming-guide.md)** - General programming guide +- **[docs/rdd-programming-guide.md](docs/rdd-programming-guide.md)** - RDD programming +- **[docs/sql-programming-guide.md](docs/sql-programming-guide.md)** - SQL programming +- **[docs/structured-streaming-programming-guide.md](docs/structured-streaming-programming-guide.md)** - Structured Streaming +- **[docs/streaming-programming-guide.md](docs/streaming-programming-guide.md)** - DStreams (legacy) +- **[docs/ml-guide.md](docs/ml-guide.md)** - Machine learning guide +- **[docs/graphx-programming-guide.md](docs/graphx-programming-guide.md)** - Graph processing + +### Deployment +- **[docs/cluster-overview.md](docs/cluster-overview.md)** - Cluster mode overview +- **[docs/submitting-applications.md](docs/submitting-applications.md)** - Application submission +- **[docs/spark-standalone.md](docs/spark-standalone.md)** - Standalone cluster mode +- **[docs/running-on-yarn.md](docs/running-on-yarn.md)** - Running on YARN +- **[docs/running-on-kubernetes.md](docs/running-on-kubernetes.md)** - Running on Kubernetes + +### Configuration and Tuning +- **[docs/configuration.md](docs/configuration.md)** - Configuration reference +- **[docs/tuning.md](docs/tuning.md)** - Performance tuning guide +- **[docs/hardware-provisioning.md](docs/hardware-provisioning.md)** - Hardware recommendations +- **[docs/job-scheduling.md](docs/job-scheduling.md)** - Job scheduling +- **[docs/monitoring.md](docs/monitoring.md)** - Monitoring and instrumentation + +### Advanced Topics +- **[docs/security.md](docs/security.md)** - Security guide +- **[docs/cloud-integration.md](docs/cloud-integration.md)** - Cloud storage integration +- **[docs/building-spark.md](docs/building-spark.md)** - Building from source + +### Migration Guides +- **[docs/core-migration-guide.md](docs/core-migration-guide.md)** - Core API migration +- **[docs/sql-migration-guide.md](docs/sql-migration-guide.md)** - SQL migration +- **[docs/ml-migration-guide.md](docs/ml-migration-guide.md)** - MLlib migration +- **[docs/pyspark-migration-guide.md](docs/pyspark-migration-guide.md)** - PySpark migration +- **[docs/ss-migration-guide.md](docs/ss-migration-guide.md)** - Structured Streaming migration + +### API References +- **[docs/sql-ref.md](docs/sql-ref.md)** - SQL reference +- **[docs/sql-ref-functions.md](docs/sql-ref-functions.md)** - SQL functions +- **[docs/sql-ref-datatypes.md](docs/sql-ref-datatypes.md)** - SQL data types +- **[docs/sql-ref-syntax.md](docs/sql-ref-syntax.md)** - SQL syntax + +## Language-Specific Documentation + +### Python (PySpark) +- **[python/README.md](python/README.md)** - PySpark overview +- **[python/docs/](python/docs/)** - PySpark documentation source +- **[docs/api/python/](docs/api/python/)** - Python API docs (generated) + +### R (SparkR) +- **[R/README.md](R/README.md)** - SparkR overview +- **[docs/sparkr.md](docs/sparkr.md)** - SparkR guide +- **[R/pkg/README.md](R/pkg/README.md)** - R package documentation + +### Scala +- **[docs/api/scala/](docs/api/scala/)** - Scala API docs (generated) + +### Java +- **[docs/api/java/](docs/api/java/)** - Java API docs (generated) + +## Data Sources + +### Built-in Sources +- **[docs/sql-data-sources-load-save-functions.md](docs/sql-data-sources-load-save-functions.md)** +- **[docs/sql-data-sources-parquet.md](docs/sql-data-sources-parquet.md)** +- **[docs/sql-data-sources-json.md](docs/sql-data-sources-json.md)** +- **[docs/sql-data-sources-csv.md](docs/sql-data-sources-csv.md)** +- **[docs/sql-data-sources-jdbc.md](docs/sql-data-sources-jdbc.md)** +- **[docs/sql-data-sources-avro.md](docs/sql-data-sources-avro.md)** +- **[docs/sql-data-sources-orc.md](docs/sql-data-sources-orc.md)** + +### External Integrations +- **[docs/streaming-kafka-integration.md](docs/streaming-kafka-integration.md)** - Kafka integration +- **[docs/streaming-kinesis-integration.md](docs/streaming-kinesis-integration.md)** - Kinesis integration +- **[docs/structured-streaming-kafka-integration.md](docs/structured-streaming-kafka-integration.md)** - Structured Streaming with Kafka + +## Special Topics + +### Machine Learning +- **[docs/ml-pipeline.md](docs/ml-pipeline.md)** - ML Pipelines +- **[docs/ml-features.md](docs/ml-features.md)** - Feature transformers +- **[docs/ml-classification-regression.md](docs/ml-classification-regression.md)** - Classification/Regression +- **[docs/ml-clustering.md](docs/ml-clustering.md)** - Clustering +- **[docs/ml-collaborative-filtering.md](docs/ml-collaborative-filtering.md)** - Recommendation +- **[docs/ml-tuning.md](docs/ml-tuning.md)** - Hyperparameter tuning + +### Streaming +- **[docs/structured-streaming-programming-guide.md](docs/structured-streaming-programming-guide.md)** - Structured Streaming guide + +### Graph Processing +- **[docs/graphx-programming-guide.md](docs/graphx-programming-guide.md)** - GraphX guide + +## Additional Resources + +### Community +- **[Apache Spark Website](https://spark.apache.org/)** - Official website +- **[Spark Documentation](https://spark.apache.org/documentation.html)** - Online docs +- **[Developer Tools](https://spark.apache.org/developer-tools.html)** - Developer resources +- **[Community](https://spark.apache.org/community.html)** - Mailing lists and chat + +### External Links +- **[Spark JIRA](https://issues.apache.org/jira/projects/SPARK)** - Issue tracker +- **[GitHub Repository](https://github.com/apache/spark)** - Source code +- **[Stack Overflow](https://stackoverflow.com/questions/tagged/apache-spark)** - Q&A + +## Document Organization + +### By Audience + +**For Users:** +- Quick Start Guide +- Programming Guides (SQL, Streaming, MLlib, GraphX) +- Configuration Guide +- Deployment Guides (YARN, Kubernetes) +- Examples + +**For Developers:** +- ARCHITECTURE.md +- DEVELOPMENT.md +- CODE_DOCUMENTATION_GUIDE.md +- Module READMEs +- Building Guide + +**For Administrators:** +- Cluster Overview +- Standalone Mode Guide +- Monitoring Guide +- Security Guide +- Admin Scripts (sbin/) + +### By Topic + +**Getting Started:** +1. README.md +2. docs/quick-start.md +3. docs/programming-guide.md + +**Core Concepts:** +1. ARCHITECTURE.md +2. core/README.md +3. docs/rdd-programming-guide.md + +**Data Processing:** +1. docs/sql-programming-guide.md +2. docs/structured-streaming-programming-guide.md +3. docs/ml-guide.md + +**Deployment:** +1. docs/cluster-overview.md +2. docs/submitting-applications.md +3. docs/running-on-yarn.md or docs/running-on-kubernetes.md + +**Optimization:** +1. docs/tuning.md +2. docs/sql-performance-tuning.md +3. docs/hardware-provisioning.md + +## Documentation Standards + +All documentation follows these principles: + +1. **Clarity**: Clear, concise explanations +2. **Completeness**: Comprehensive coverage of topics +3. **Examples**: Code examples for all concepts +4. **Structure**: Consistent formatting and organization +5. **Accuracy**: Up-to-date and technically correct +6. **Accessibility**: Easy to find and navigate + +## Contributing to Documentation + +To contribute to Spark documentation: + +1. Follow the style guides in CODE_DOCUMENTATION_GUIDE.md +2. Update relevant documentation when changing code +3. Add examples for new features +4. Test documentation builds locally +5. Submit pull requests with documentation updates + +See [CONTRIBUTING.md](CONTRIBUTING.md) for details. + +## Building Documentation + +### Building User Documentation +```bash +cd docs +bundle install +bundle exec jekyll serve +# View at http://localhost:4000 +``` + +### Building API Documentation +```bash +# Scala API docs +./build/mvn scala:doc + +# Java API docs +./build/mvn javadoc:javadoc + +# Python API docs +cd python/docs +make html +``` + +## Getting Help + +If you can't find what you're looking for: + +1. Check the [Documentation Index](https://spark.apache.org/documentation.html) +2. Search [Stack Overflow](https://stackoverflow.com/questions/tagged/apache-spark) +3. Ask on the [user mailing list](mailto:user@spark.apache.org) +4. Check [Spark JIRA](https://issues.apache.org/jira/projects/SPARK) for known issues + +## Last Updated + +This index was last updated: 2025-10-19 + +For the most up-to-date documentation, visit [spark.apache.org/docs/latest](https://spark.apache.org/docs/latest/).