From 47b1306b7b5fb8f043a29310dd3b23713d8eef4f Mon Sep 17 00:00:00 2001
From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com>
Date: Sun, 19 Oct 2025 15:31:08 +0000
Subject: [PATCH 1/6] Bump black in /dev in the pip group across 1 directory

Bumps the pip group with 1 update in the /dev directory: [black](https://github.com/psf/black).


Updates `black` from 23.12.1 to 24.3.0
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](https://github.com/psf/black/compare/23.12.1...24.3.0)

---
updated-dependencies:
- dependency-name: black
  dependency-version: 24.3.0
  dependency-type: direct:production
  dependency-group: pip
...

Signed-off-by: dependabot[bot] <support@github.com>
---
 dev/requirements.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/requirements.txt b/dev/requirements.txt
index 40e7fa46cf14b..8dec699878933 100644
--- a/dev/requirements.txt
+++ b/dev/requirements.txt
@@ -57,7 +57,7 @@ jira>=3.5.2
 PyGithub
 
 # pandas API on Spark Code formatter.
-black==23.12.1
+black==24.3.0
 py
 
 # Spark Connect (required)

From 3db5efb1bd13cba354551cf1f6b46e5f9d52c53d Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Sun, 19 Oct 2025 17:08:22 +0000
Subject: [PATCH 2/6] Initial plan


From 87a1a2206b13d27f4dc4efdc16d4957ebed44c17 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Sun, 19 Oct 2025 17:22:55 +0000
Subject: [PATCH 3/6] Add comprehensive documentation files for Spark
 architecture and modules

Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>
---
 ARCHITECTURE.md     | 280 ++++++++++++++++++++++
 DEVELOPMENT.md      | 462 +++++++++++++++++++++++++++++++++++++
 bin/README.md       | 453 ++++++++++++++++++++++++++++++++++++
 common/README.md    | 472 +++++++++++++++++++++++++++++++++++++
 core/README.md      | 360 +++++++++++++++++++++++++++++
 examples/README.md  | 432 ++++++++++++++++++++++++++++++++++
 graphx/README.md    | 549 ++++++++++++++++++++++++++++++++++++++++++++
 mllib/README.md     | 514 +++++++++++++++++++++++++++++++++++++++++
 streaming/README.md | 430 ++++++++++++++++++++++++++++++++++
 9 files changed, 3952 insertions(+)
 create mode 100644 ARCHITECTURE.md
 create mode 100644 DEVELOPMENT.md
 create mode 100644 bin/README.md
 create mode 100644 common/README.md
 create mode 100644 core/README.md
 create mode 100644 examples/README.md
 create mode 100644 graphx/README.md
 create mode 100644 mllib/README.md
 create mode 100644 streaming/README.md

diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
new file mode 100644
index 0000000000000..ca8e58920596d
--- /dev/null
+++ b/ARCHITECTURE.md
@@ -0,0 +1,280 @@
+# Apache Spark Architecture
+
+This document provides an overview of the Apache Spark architecture and its key components.
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Core Components](#core-components)
+- [Execution Model](#execution-model)
+- [Key Subsystems](#key-subsystems)
+- [Data Flow](#data-flow)
+- [Module Structure](#module-structure)
+
+## Overview
+
+Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis.
+
+### Design Principles
+
+1. **Unified Engine**: Single system for batch processing, streaming, machine learning, and graph processing
+2. **In-Memory Computing**: Leverages RAM for fast iterative algorithms and interactive queries
+3. **Lazy Evaluation**: Operations are not executed until an action is called
+4. **Fault Tolerance**: Resilient Distributed Datasets (RDDs) provide automatic fault recovery
+5. **Scalability**: Scales from a single machine to thousands of nodes
+
+## Core Components
+
+### 1. Spark Core
+
+The foundation of the Spark platform, providing:
+
+- **Task scheduling and dispatch**
+- **Memory management**
+- **Fault recovery**
+- **Interaction with storage systems**
+- **RDD API** - The fundamental data abstraction
+
+Location: `core/` directory
+
+Key classes:
+- `SparkContext`: Main entry point for Spark functionality
+- `RDD`: Resilient Distributed Dataset, the fundamental data structure
+- `DAGScheduler`: Schedules stages based on DAG of operations
+- `TaskScheduler`: Launches tasks on executors
+
+### 2. Spark SQL
+
+Module for structured data processing with:
+
+- **DataFrame and Dataset APIs**
+- **SQL query engine**
+- **Data source connectors** (Parquet, JSON, JDBC, etc.)
+- **Catalyst optimizer** for query optimization
+
+Location: `sql/` directory
+
+Key components:
+- Query parsing and analysis
+- Logical and physical query planning
+- Code generation for efficient execution
+- Catalog management
+
+### 3. Spark Streaming
+
+Framework for scalable, high-throughput, fault-tolerant stream processing:
+
+- **DStreams** (Discretized Streams) - Legacy API
+- **Structured Streaming** - Modern streaming API built on Spark SQL
+
+Location: `streaming/` directory
+
+Key features:
+- Micro-batch processing model
+- Exactly-once semantics
+- Integration with Kafka, Flume, Kinesis, and more
+
+### 4. MLlib (Machine Learning Library)
+
+Scalable machine learning library providing:
+
+- **Classification and regression**
+- **Clustering**
+- **Collaborative filtering**
+- **Dimensionality reduction**
+- **Feature extraction and transformation**
+- **ML Pipelines** for building workflows
+
+Location: `mllib/` and `mllib-local/` directories
+
+### 5. GraphX
+
+Graph processing framework with:
+
+- **Graph abstraction** built on top of RDDs
+- **Graph algorithms** (PageRank, connected components, triangle counting, etc.)
+- **Pregel-like API** for iterative graph computations
+
+Location: `graphx/` directory
+
+## Execution Model
+
+### Spark Application Lifecycle
+
+1. **Initialization**: User creates a `SparkContext` or `SparkSession`
+2. **Job Submission**: Actions trigger job submission to the DAG scheduler
+3. **Stage Creation**: DAG scheduler breaks jobs into stages based on shuffle boundaries
+4. **Task Scheduling**: Task scheduler assigns tasks to executors
+5. **Execution**: Executors run tasks and return results
+6. **Result Collection**: Results are collected back to the driver or written to storage
+
+### Driver and Executors
+
+- **Driver Program**: Runs the main() function and creates SparkContext
+  - Converts user program into tasks
+  - Schedules tasks on executors
+  - Maintains metadata about the application
+
+- **Executors**: Processes that run on worker nodes
+  - Run tasks assigned by the driver
+  - Store data in memory or disk
+  - Return results to the driver
+
+### Cluster Managers
+
+Spark supports multiple cluster managers:
+
+- **Standalone**: Built-in cluster manager
+- **Apache YARN**: Hadoop's resource manager
+- **Apache Mesos**: General-purpose cluster manager
+- **Kubernetes**: Container orchestration platform
+
+Location: `resource-managers/` directory
+
+## Key Subsystems
+
+### Memory Management
+
+Spark manages memory in several regions:
+
+1. **Execution Memory**: For shuffles, joins, sorts, and aggregations
+2. **Storage Memory**: For caching and broadcasting data
+3. **User Memory**: For user data structures and metadata
+4. **Reserved Memory**: System reserved memory
+
+Configuration: Unified memory management allows dynamic allocation between execution and storage.
+
+### Shuffle Subsystem
+
+Handles data redistribution across partitions:
+
+- **Shuffle Write**: Map tasks write data to local disk
+- **Shuffle Read**: Reduce tasks fetch data from map outputs
+- **Shuffle Service**: External shuffle service for improved reliability
+
+Location: `core/src/main/scala/org/apache/spark/shuffle/`
+
+### Storage Subsystem
+
+Manages cached data and intermediate results:
+
+- **Block Manager**: Manages storage of data blocks
+- **Memory Store**: In-memory cache
+- **Disk Store**: Disk-based storage
+- **Off-Heap Storage**: Direct memory storage
+
+Location: `core/src/main/scala/org/apache/spark/storage/`
+
+### Serialization
+
+Efficient serialization is critical for performance:
+
+- **Java Serialization**: Default, but slower
+- **Kryo Serialization**: Faster and more compact (recommended)
+- **Custom Serializers**: For specific data types
+
+Location: `core/src/main/scala/org/apache/spark/serializer/`
+
+## Data Flow
+
+### Transformation and Action Pipeline
+
+1. **Transformations**: Lazy operations that define a new RDD/DataFrame
+   - Examples: `map`, `filter`, `join`, `groupBy`
+   - Build up a DAG of operations
+
+2. **Actions**: Operations that trigger computation
+   - Examples: `count`, `collect`, `save`, `reduce`
+   - Cause DAG execution
+
+3. **Stages**: Groups of tasks that can be executed together
+   - Separated by shuffle operations
+   - Pipeline operations within a stage
+
+4. **Tasks**: Unit of work sent to executors
+   - One task per partition
+   - Execute transformations and return results
+
+## Module Structure
+
+### Project Organization
+
+```
+spark/
+├── assembly/          # Builds the final Spark assembly JAR
+├── bin/              # User-facing command-line scripts
+├── build/            # Build-related scripts
+├── common/           # Common utilities shared across modules
+├── conf/             # Configuration file templates
+├── connector/        # External data source connectors
+├── core/             # Spark Core engine
+├── data/             # Sample data for examples
+├── dev/              # Development scripts and tools
+├── docs/             # Documentation source files
+├── examples/         # Example programs
+├── graphx/           # Graph processing library
+├── hadoop-cloud/     # Cloud storage integration
+├── launcher/         # Application launcher
+├── mllib/            # Machine learning library (RDD-based)
+├── mllib-local/      # Local ML algorithms
+├── python/           # PySpark - Python API
+├── R/                # SparkR - R API
+├── repl/             # Interactive Scala shell
+├── resource-managers/ # Cluster manager integrations
+├── sbin/             # Admin scripts for cluster management
+├── sql/              # Spark SQL and DataFrames
+├── streaming/        # Streaming processing
+└── tools/            # Various utility tools
+```
+
+### Module Dependencies
+
+- **Core**: Foundation for all other modules
+- **SQL**: Depends on Core, used by Streaming, MLlib
+- **Streaming**: Depends on Core and SQL
+- **MLlib**: Depends on Core and SQL
+- **GraphX**: Depends on Core
+- **Python/R**: Language bindings to Core APIs
+
+## Building and Testing
+
+For detailed build instructions, see [building-spark.md](docs/building-spark.md).
+
+Quick start:
+```bash
+# Build Spark
+./build/mvn -DskipTests clean package
+
+# Run tests
+./dev/run-tests
+
+# Run specific module tests
+./build/mvn test -pl core
+```
+
+## Performance Tuning
+
+Key areas for optimization:
+
+1. **Memory Configuration**: Adjust executor memory and memory fractions
+2. **Parallelism**: Set appropriate partition counts
+3. **Serialization**: Use Kryo for better performance
+4. **Caching**: Cache frequently accessed data
+5. **Broadcast Variables**: Efficiently distribute large read-only data
+6. **Data Locality**: Ensure tasks run close to their data
+
+See [tuning.md](docs/tuning.md) for detailed tuning guidelines.
+
+## Contributing
+
+See [CONTRIBUTING.md](CONTRIBUTING.md) and the [contributing guide](https://spark.apache.org/contributing.html) for information on how to contribute to Apache Spark.
+
+## Further Reading
+
+- [Programming Guide](docs/programming-guide.md)
+- [SQL Programming Guide](docs/sql-programming-guide.md)
+- [Structured Streaming Guide](docs/structured-streaming-programming-guide.md)
+- [MLlib Guide](docs/ml-guide.md)
+- [GraphX Guide](docs/graphx-programming-guide.md)
+- [Cluster Overview](docs/cluster-overview.md)
+- [Configuration](docs/configuration.md)
diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
new file mode 100644
index 0000000000000..2e5baeb6e0d36
--- /dev/null
+++ b/DEVELOPMENT.md
@@ -0,0 +1,462 @@
+# Spark Development Guide
+
+This guide provides information for developers working on Apache Spark.
+
+## Table of Contents
+
+- [Getting Started](#getting-started)
+- [Development Environment](#development-environment)
+- [Building Spark](#building-spark)
+- [Testing](#testing)
+- [Code Style](#code-style)
+- [IDE Setup](#ide-setup)
+- [Debugging](#debugging)
+- [Working with Git](#working-with-git)
+- [Common Development Tasks](#common-development-tasks)
+
+## Getting Started
+
+### Prerequisites
+
+- Java 17 or Java 21 (for Spark 4.x)
+- Maven 3.9.9 or later
+- Python 3.9+ (for PySpark development)
+- R 4.0+ (for SparkR development)
+- Git
+
+### Initial Setup
+
+1. **Clone the repository:**
+   ```bash
+   git clone https://github.com/apache/spark.git
+   cd spark
+   ```
+
+2. **Build Spark:**
+   ```bash
+   ./build/mvn -DskipTests clean package
+   ```
+
+3. **Verify the build:**
+   ```bash
+   ./bin/spark-shell
+   ```
+
+## Development Environment
+
+### Directory Structure
+
+```
+spark/
+├── assembly/          # Final assembly JAR creation
+├── bin/              # User command scripts (spark-submit, spark-shell, etc.)
+├── build/            # Build scripts and Maven wrapper
+├── common/           # Common utilities and modules
+├── conf/             # Configuration templates
+├── core/             # Spark Core
+├── dev/              # Development tools (run-tests, lint, etc.)
+├── docs/             # Documentation (Jekyll-based)
+├── examples/         # Example programs
+├── python/           # PySpark implementation
+├── R/                # SparkR implementation
+├── sbin/             # Admin scripts (start-all.sh, stop-all.sh, etc.)
+├── sql/              # Spark SQL
+└── [other modules]
+```
+
+### Key Development Directories
+
+- `dev/`: Contains scripts for testing, linting, and releasing
+- `dev/run-tests`: Main test runner
+- `dev/lint-*`: Various linting tools
+- `build/mvn`: Maven wrapper script
+
+## Building Spark
+
+### Full Build
+
+```bash
+# Build all modules, skip tests
+./build/mvn -DskipTests clean package
+
+# Build with specific Hadoop version
+./build/mvn -Phadoop-3.4 -DskipTests clean package
+
+# Build with Hive support
+./build/mvn -Phive -Phive-thriftserver -DskipTests package
+```
+
+### Module-Specific Builds
+
+```bash
+# Build only core module
+./build/mvn -pl core -DskipTests package
+
+# Build core and its dependencies
+./build/mvn -pl core -am -DskipTests package
+
+# Build SQL module
+./build/mvn -pl sql/core -am -DskipTests package
+```
+
+### Build Profiles
+
+Common Maven profiles:
+
+- `-Phadoop-3.4`: Build with Hadoop 3.4
+- `-Pyarn`: Include YARN support
+- `-Pkubernetes`: Include Kubernetes support
+- `-Phive`: Include Hive support
+- `-Phive-thriftserver`: Include Hive Thrift Server
+- `-Pscala-2.13`: Build with Scala 2.13
+
+### Fast Development Builds
+
+For faster iteration during development:
+
+```bash
+# Skip Scala and Java style checks
+./build/mvn -DskipTests -Dcheckstyle.skip package
+
+# Build specific module quickly
+./build/mvn -pl sql/core -am -DskipTests -Dcheckstyle.skip package
+```
+
+## Testing
+
+### Running All Tests
+
+```bash
+# Run all tests (takes several hours)
+./dev/run-tests
+
+# Run tests for specific modules
+./dev/run-tests --modules sql
+```
+
+### Running Specific Test Suites
+
+#### Scala/Java Tests
+
+```bash
+# Run all tests in a module
+./build/mvn test -pl core
+
+# Run a specific test suite
+./build/mvn test -pl core -Dtest=SparkContextSuite
+
+# Run specific test methods
+./build/mvn test -pl core -Dtest=SparkContextSuite#testJobInterruption
+```
+
+#### Python Tests
+
+```bash
+# Run all PySpark tests
+cd python && python run-tests.py
+
+# Run specific test file
+cd python && python -m pytest pyspark/tests/test_context.py
+
+# Run specific test method
+cd python && python -m pytest pyspark/tests/test_context.py::SparkContextTests::test_stop
+```
+
+#### R Tests
+
+```bash
+cd R
+R CMD check --no-manual --no-build-vignettes spark
+```
+
+### Test Coverage
+
+```bash
+# Generate coverage report
+./build/mvn clean install -DskipTests
+./dev/run-tests --coverage
+```
+
+## Code Style
+
+### Scala Code Style
+
+Spark uses Scalastyle for Scala code checking:
+
+```bash
+# Check Scala style
+./dev/lint-scala
+
+# Auto-format (if scalafmt is configured)
+./build/mvn scala:format
+```
+
+Key style guidelines:
+- 2-space indentation
+- Max line length: 100 characters
+- Follow [Scala style guide](https://docs.scala-lang.org/style/)
+
+### Java Code Style
+
+Java code follows Google Java Style:
+
+```bash
+# Check Java style
+./dev/lint-java
+```
+
+Key guidelines:
+- 2-space indentation
+- Max line length: 100 characters
+- Use Java 17+ features appropriately
+
+### Python Code Style
+
+PySpark follows PEP 8:
+
+```bash
+# Check Python style
+./dev/lint-python
+
+# Auto-format with black (if available)
+cd python && black pyspark/
+```
+
+Key guidelines:
+- 4-space indentation
+- Max line length: 100 characters
+- Type hints encouraged for new code
+
+## IDE Setup
+
+### IntelliJ IDEA
+
+1. **Import Project:**
+   - File → Open → Select `pom.xml`
+   - Choose "Open as Project"
+   - Import Maven projects automatically
+
+2. **Configure JDK:**
+   - File → Project Structure → Project SDK → Select Java 17 or 21
+
+3. **Recommended Plugins:**
+   - Scala plugin
+   - Python plugin
+   - Maven plugin
+
+4. **Code Style:**
+   - Import Spark code style from `dev/scalastyle-config.xml`
+
+### Visual Studio Code
+
+1. **Recommended Extensions:**
+   - Scala (Metals)
+   - Python
+   - Maven for Java
+
+2. **Workspace Settings:**
+   ```json
+   {
+     "java.configuration.maven.userSettings": ".mvn/settings.xml",
+     "python.linting.enabled": true,
+     "python.linting.pylintEnabled": true
+   }
+   ```
+
+### Eclipse
+
+1. **Import Project:**
+   - File → Import → Maven → Existing Maven Projects
+
+2. **Install Plugins:**
+   - Scala IDE
+   - Maven Integration
+
+## Debugging
+
+### Debugging Scala/Java Code
+
+#### Using IDE Debugger
+
+1. Run tests with debugging enabled in your IDE
+2. Set breakpoints in source code
+3. Run test in debug mode
+
+#### Command Line Debugging
+
+```bash
+# Enable remote debugging
+export SPARK_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"
+./bin/spark-shell
+```
+
+Then attach your IDE debugger to port 5005.
+
+### Debugging PySpark
+
+```bash
+# Enable Python debugging
+export PYSPARK_PYTHON=python
+export PYSPARK_DRIVER_PYTHON=python
+
+# Run with pdb
+python -m pdb your_spark_script.py
+```
+
+### Logging
+
+Adjust log levels in `conf/log4j2.properties`:
+
+```properties
+# Set root logger level
+rootLogger.level = info
+
+# Set specific logger
+logger.spark.name = org.apache.spark
+logger.spark.level = debug
+```
+
+## Working with Git
+
+### Branch Naming
+
+- Feature branches: `feature/description`
+- Bug fixes: `fix/issue-number-description`
+- Documentation: `docs/description`
+
+### Commit Messages
+
+Follow conventional commit format:
+
+```
+[SPARK-XXXXX] Brief description (max 72 chars)
+
+Detailed description of the change, motivation, and impact.
+
+- Bullet points for specific changes
+- Reference related issues
+
+Closes #XXXXX
+```
+
+### Creating Pull Requests
+
+1. **Fork the repository** on GitHub
+2. **Create a feature branch** from master
+3. **Make your changes** with clear commits
+4. **Push to your fork**
+5. **Open a Pull Request** with:
+   - Clear title and description
+   - Link to JIRA issue if applicable
+   - Unit tests for new functionality
+   - Documentation updates if needed
+
+### Code Review
+
+- Address review comments promptly
+- Keep discussions professional and constructive
+- Be open to suggestions and improvements
+
+## Common Development Tasks
+
+### Adding a New Configuration
+
+1. Define config in appropriate config file (e.g., `sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala`)
+2. Document the configuration
+3. Add tests
+4. Update documentation in `docs/configuration.md`
+
+### Adding a New API
+
+1. Implement the API with proper documentation
+2. Add comprehensive unit tests
+3. Update relevant documentation
+4. Consider backward compatibility
+5. Add deprecation notices if replacing old APIs
+
+### Adding a New Data Source
+
+1. Implement `DataSourceV2` interface
+2. Add read/write support
+3. Include integration tests
+4. Document usage in `docs/sql-data-sources-*.md`
+
+### Performance Optimization
+
+1. Identify bottleneck with profiling
+2. Create benchmark to measure improvement
+3. Implement optimization
+4. Verify performance gain
+5. Ensure no functionality regression
+
+### Updating Dependencies
+
+1. Check for security vulnerabilities
+2. Test compatibility
+3. Update version in `pom.xml`
+4. Update `LICENSE` and `NOTICE` files if needed
+5. Run full test suite
+
+## Useful Commands
+
+```bash
+# Clean build artifacts
+./build/mvn clean
+
+# Skip Scalastyle checks
+./build/mvn -Dscalastyle.skip package
+
+# Generate API documentation
+./build/mvn scala:doc
+
+# Check for dependency updates
+./build/mvn versions:display-dependency-updates
+
+# Profile a build
+./build/mvn clean package -Dprofile
+
+# Run Spark locally with different memory
+./bin/spark-shell --driver-memory 4g --executor-memory 4g
+```
+
+## Troubleshooting
+
+### Build Issues
+
+- **Out of Memory**: Increase Maven memory with `export MAVEN_OPTS="-Xmx4g"`
+- **Compilation errors**: Clean build with `./build/mvn clean`
+- **Version conflicts**: Update local Maven repo: `./build/mvn -U package`
+
+### Test Failures
+
+- Run single test to isolate issue
+- Check for environment-specific problems
+- Review logs in `target/` directories
+- Enable debug logging for more detail
+
+### IDE Issues
+
+- Reimport Maven project
+- Invalidate caches and restart
+- Check SDK and language level settings
+
+## Resources
+
+- [Apache Spark Website](https://spark.apache.org/)
+- [Spark Developer Tools](https://spark.apache.org/developer-tools.html)
+- [Spark Wiki](https://cwiki.apache.org/confluence/display/SPARK)
+- [Spark Mailing Lists](https://spark.apache.org/community.html#mailing-lists)
+- [Spark JIRA](https://issues.apache.org/jira/projects/SPARK)
+
+## Getting Help
+
+- Ask questions on [user@spark.apache.org](mailto:user@spark.apache.org)
+- Report bugs on [JIRA](https://issues.apache.org/jira/projects/SPARK)
+- Discuss on [dev@spark.apache.org](mailto:dev@spark.apache.org)
+- Chat on the [Spark Slack](https://spark.apache.org/community.html)
+
+## Contributing Back
+
+See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed contribution guidelines.
+
+Remember: Quality over quantity. Well-tested, documented changes are more valuable than large, poorly understood patches.
diff --git a/bin/README.md b/bin/README.md
new file mode 100644
index 0000000000000..e83fbf583746c
--- /dev/null
+++ b/bin/README.md
@@ -0,0 +1,453 @@
+# Spark Binary Scripts
+
+This directory contains user-facing command-line scripts for running Spark applications and interactive shells.
+
+## Overview
+
+These scripts provide convenient entry points for:
+- Running Spark applications
+- Starting interactive shells (Scala, Python, R, SQL)
+- Managing Spark clusters
+- Utility operations
+
+## Main Scripts
+
+### spark-submit
+
+Submit Spark applications to a cluster.
+
+**Usage:**
+```bash
+./bin/spark-submit \
+  --class <main-class> \
+  --master <master-url> \
+  --deploy-mode <deploy-mode> \
+  --conf <key>=<value> \
+  ... # other options
+  <application-jar> \
+  [application-arguments]
+```
+
+**Examples:**
+```bash
+# Run on local mode with 4 cores
+./bin/spark-submit --class org.example.App --master local[4] app.jar
+
+# Run on YARN cluster
+./bin/spark-submit --class org.example.App --master yarn --deploy-mode cluster app.jar
+
+# Run Python application
+./bin/spark-submit --master local[2] script.py
+
+# Run with specific memory and executor settings
+./bin/spark-submit \
+  --master spark://master:7077 \
+  --executor-memory 4G \
+  --total-executor-cores 8 \
+  --class org.example.App \
+  app.jar
+```
+
+**Key Options:**
+- `--master`: Master URL (local, spark://, yarn, k8s://, mesos://)
+- `--deploy-mode`: client or cluster
+- `--class`: Application main class (for Java/Scala)
+- `--name`: Application name
+- `--jars`: Additional JARs to include
+- `--packages`: Maven coordinates of packages
+- `--conf`: Spark configuration property
+- `--driver-memory`: Driver memory (e.g., 1g, 2g)
+- `--executor-memory`: Executor memory
+- `--executor-cores`: Cores per executor
+- `--num-executors`: Number of executors (YARN only)
+
+See [submitting-applications.md](../docs/submitting-applications.md) for complete documentation.
+
+### spark-shell
+
+Interactive Scala shell with Spark support.
+
+**Usage:**
+```bash
+./bin/spark-shell [options]
+```
+
+**Examples:**
+```bash
+# Start local shell
+./bin/spark-shell
+
+# Connect to remote cluster
+./bin/spark-shell --master spark://master:7077
+
+# With specific memory
+./bin/spark-shell --driver-memory 4g
+
+# With additional packages
+./bin/spark-shell --packages org.apache.spark:spark-avro_2.13:3.5.0
+```
+
+**In the shell:**
+```scala
+scala> val data = spark.range(1000)
+scala> data.count()
+res0: Long = 1000
+
+scala> spark.read.json("data.json").show()
+```
+
+### pyspark
+
+Interactive Python shell with PySpark support.
+
+**Usage:**
+```bash
+./bin/pyspark [options]
+```
+
+**Examples:**
+```bash
+# Start local shell
+./bin/pyspark
+
+# Connect to remote cluster
+./bin/pyspark --master spark://master:7077
+
+# With specific Python version
+PYSPARK_PYTHON=python3.11 ./bin/pyspark
+```
+
+**In the shell:**
+```python
+>>> df = spark.range(1000)
+>>> df.count()
+1000
+
+>>> spark.read.json("data.json").show()
+```
+
+### sparkR
+
+Interactive R shell with SparkR support.
+
+**Usage:**
+```bash
+./bin/sparkR [options]
+```
+
+**Examples:**
+```bash
+# Start local shell
+./bin/sparkR
+
+# Connect to remote cluster
+./bin/sparkR --master spark://master:7077
+```
+
+**In the shell:**
+```r
+> df <- createDataFrame(iris)
+> head(df)
+> count(df)
+```
+
+### spark-sql
+
+Interactive SQL shell for running SQL queries.
+
+**Usage:**
+```bash
+./bin/spark-sql [options]
+```
+
+**Examples:**
+```bash
+# Start SQL shell
+./bin/spark-sql
+
+# Connect to Hive metastore
+./bin/spark-sql --conf spark.sql.warehouse.dir=/path/to/warehouse
+
+# Run SQL file
+./bin/spark-sql -f query.sql
+
+# Execute inline query
+./bin/spark-sql -e "SELECT * FROM table"
+```
+
+**In the shell:**
+```sql
+spark-sql> CREATE TABLE test (id INT, name STRING);
+spark-sql> INSERT INTO test VALUES (1, 'Alice'), (2, 'Bob');
+spark-sql> SELECT * FROM test;
+```
+
+### run-example
+
+Run Spark example programs.
+
+**Usage:**
+```bash
+./bin/run-example <class> [params]
+```
+
+**Examples:**
+```bash
+# Run SparkPi example
+./bin/run-example SparkPi 100
+
+# Run with specific master
+MASTER=spark://master:7077 ./bin/run-example SparkPi
+
+# Run SQL example
+./bin/run-example sql.SparkSQLExample
+```
+
+## Utility Scripts
+
+### spark-class
+
+Internal script to run Spark classes. Usually not called directly by users.
+
+**Usage:**
+```bash
+./bin/spark-class <class> [options]
+```
+
+### load-spark-env.sh
+
+Loads Spark environment variables from conf/spark-env.sh. Sourced by other scripts.
+
+## Configuration
+
+Scripts read configuration from:
+
+1. **Environment variables**: Set in shell or `conf/spark-env.sh`
+2. **Command-line options**: Passed via `--conf` or specific flags
+3. **Configuration files**: `conf/spark-defaults.conf`
+
+### Common Environment Variables
+
+```bash
+# Java
+export JAVA_HOME=/path/to/java
+
+# Spark
+export SPARK_HOME=/path/to/spark
+export SPARK_MASTER_HOST=master-hostname
+export SPARK_MASTER_PORT=7077
+
+# Python
+export PYSPARK_PYTHON=python3
+export PYSPARK_DRIVER_PYTHON=python3
+
+# Memory
+export SPARK_DRIVER_MEMORY=2g
+export SPARK_EXECUTOR_MEMORY=4g
+
+# Logging
+export SPARK_LOG_DIR=/var/log/spark
+```
+
+Set these in `conf/spark-env.sh` for persistence.
+
+## Master URLs
+
+Scripts accept various master URL formats:
+
+- **local**: Run locally with one worker thread
+- **local[K]**: Run locally with K worker threads
+- **local[*]**: Run locally with as many worker threads as cores
+- **spark://HOST:PORT**: Connect to Spark standalone cluster
+- **yarn**: Connect to YARN cluster
+- **k8s://HOST:PORT**: Connect to Kubernetes cluster
+- **mesos://HOST:PORT**: Connect to Mesos cluster
+
+## Advanced Usage
+
+### Configuring Logging
+
+Create `conf/log4j2.properties`:
+```properties
+rootLogger.level = info
+logger.spark.name = org.apache.spark
+logger.spark.level = warn
+```
+
+### Using with Jupyter Notebook
+
+```bash
+# Set environment variables
+export PYSPARK_DRIVER_PYTHON=jupyter
+export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
+
+# Start PySpark (opens Jupyter)
+./bin/pyspark
+```
+
+### Connecting to Remote Clusters
+
+```bash
+# Standalone cluster
+./bin/spark-submit --master spark://master:7077 app.jar
+
+# YARN
+./bin/spark-submit --master yarn --deploy-mode cluster app.jar
+
+# Kubernetes
+./bin/spark-submit --master k8s://https://k8s-api:6443 \
+  --deploy-mode cluster \
+  --conf spark.kubernetes.container.image=spark:3.5.0 \
+  app.jar
+```
+
+### Dynamic Resource Allocation
+
+```bash
+./bin/spark-submit \
+  --conf spark.dynamicAllocation.enabled=true \
+  --conf spark.dynamicAllocation.minExecutors=1 \
+  --conf spark.dynamicAllocation.maxExecutors=10 \
+  app.jar
+```
+
+## Debugging
+
+### Enable Verbose Output
+
+```bash
+./bin/spark-submit --verbose ...
+```
+
+### Check Spark Configuration
+
+```bash
+./bin/spark-submit --class org.example.App app.jar 2>&1 | grep -i "spark\."
+```
+
+### Remote Debugging
+
+```bash
+# Driver debugging
+./bin/spark-submit \
+  --conf spark.driver.extraJavaOptions="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" \
+  app.jar
+
+# Executor debugging
+./bin/spark-submit \
+  --conf spark.executor.extraJavaOptions="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5006" \
+  app.jar
+```
+
+## Security
+
+### Kerberos Authentication
+
+```bash
+./bin/spark-submit \
+  --principal user@REALM \
+  --keytab /path/to/user.keytab \
+  --master yarn \
+  app.jar
+```
+
+### SSL Configuration
+
+```bash
+./bin/spark-submit \
+  --conf spark.ssl.enabled=true \
+  --conf spark.ssl.keyStore=/path/to/keystore \
+  --conf spark.ssl.keyStorePassword=password \
+  app.jar
+```
+
+## Performance Tuning
+
+### Memory Configuration
+
+```bash
+./bin/spark-submit \
+  --driver-memory 4g \
+  --executor-memory 8g \
+  --conf spark.memory.fraction=0.8 \
+  app.jar
+```
+
+### Parallelism
+
+```bash
+./bin/spark-submit \
+  --conf spark.default.parallelism=100 \
+  --conf spark.sql.shuffle.partitions=200 \
+  app.jar
+```
+
+### Serialization
+
+```bash
+./bin/spark-submit \
+  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
+  app.jar
+```
+
+## Troubleshooting
+
+### Common Issues
+
+**Java not found:**
+```bash
+export JAVA_HOME=/path/to/java
+```
+
+**Class not found:**
+```bash
+# Add dependencies
+./bin/spark-submit --jars dependency.jar app.jar
+```
+
+**Out of memory:**
+```bash
+# Increase memory
+./bin/spark-submit --driver-memory 8g --executor-memory 16g app.jar
+```
+
+**Connection refused:**
+```bash
+# Check master URL and firewall settings
+# Verify master is running with: jps | grep Master
+```
+
+## Script Internals
+
+### Script Hierarchy
+
+```
+spark-submit
+├── spark-class
+│   └── load-spark-env.sh
+└── Actual Java/Python execution
+```
+
+### How spark-submit Works
+
+1. Parse command-line arguments
+2. Load configuration from `spark-defaults.conf`
+3. Set up classpath and Java options
+4. Call `spark-class` with appropriate arguments
+5. Launch JVM with Spark application
+
+## Related Scripts
+
+For cluster management scripts, see [../sbin/README.md](../sbin/README.md).
+
+## Further Reading
+
+- [Submitting Applications](../docs/submitting-applications.md)
+- [Spark Configuration](../docs/configuration.md)
+- [Cluster Mode Overview](../docs/cluster-overview.md)
+- [Running on YARN](../docs/running-on-yarn.md)
+- [Running on Kubernetes](../docs/running-on-kubernetes.md)
+
+## Examples
+
+More examples in [../examples/](../examples/).
diff --git a/common/README.md b/common/README.md
new file mode 100644
index 0000000000000..1d2890b14e6c2
--- /dev/null
+++ b/common/README.md
@@ -0,0 +1,472 @@
+# Spark Common Modules
+
+This directory contains common utilities and libraries shared across all Spark modules.
+
+## Overview
+
+The common modules provide foundational functionality used throughout Spark:
+
+- Network communication
+- Memory management utilities
+- Serialization helpers
+- Configuration management
+- Logging infrastructure
+- Testing utilities
+
+These modules have no dependencies on Spark Core, allowing them to be used by any Spark component.
+
+## Modules
+
+### common/kvstore
+
+Key-value store abstraction for metadata storage.
+
+**Purpose:**
+- Store application metadata
+- Track job and stage information
+- Persist UI data
+
+**Location**: `kvstore/`
+
+**Key classes:**
+- `KVStore`: Interface for key-value storage
+- `LevelDB`: LevelDB-based implementation
+- `InMemoryStore`: In-memory implementation for testing
+
+**Usage:**
+```scala
+val store = new LevelDB(path)
+store.write(new StoreKey(id), value)
+val data = store.read(classOf[ValueType], id)
+```
+
+### common/network-common
+
+Core networking abstractions and utilities.
+
+**Purpose:**
+- RPC framework
+- Block transfer protocol
+- Network servers and clients
+
+**Location**: `network-common/`
+
+**Key components:**
+- `TransportContext`: Network communication setup
+- `TransportClient`: Network client
+- `TransportServer`: Network server
+- `MessageHandler`: Message processing
+- `StreamManager`: Stream data management
+
+**Features:**
+- Netty-based implementation
+- Zero-copy transfers
+- SSL/TLS support
+- Flow control
+
+### common/network-shuffle
+
+Network shuffle service for serving shuffle data.
+
+**Purpose:**
+- External shuffle service
+- Serves shuffle blocks to executors
+- Improves executor reliability
+
+**Location**: `network-shuffle/`
+
+**Key classes:**
+- `ExternalShuffleService`: Standalone shuffle service
+- `ExternalShuffleClient`: Client for fetching shuffle data
+- `ShuffleBlockResolver`: Resolves shuffle block locations
+
+**Benefits:**
+- Executors can be killed without losing shuffle data
+- Better resource utilization
+- Improved fault tolerance
+
+**Configuration:**
+```properties
+spark.shuffle.service.enabled=true
+spark.shuffle.service.port=7337
+```
+
+### common/network-yarn
+
+YARN-specific network integration.
+
+**Purpose:**
+- Integration with YARN shuffle service
+- YARN auxiliary service implementation
+
+**Location**: `network-yarn/`
+
+**Usage:** Automatically used when running on YARN with shuffle service enabled.
+
+### common/sketch
+
+Data sketching and approximate algorithms.
+
+**Purpose:**
+- Memory-efficient approximate computations
+- Probabilistic data structures
+
+**Location**: `sketch/`
+
+**Algorithms:**
+- Count-Min Sketch: Frequency estimation
+- Bloom Filter: Set membership testing
+- HyperLogLog: Cardinality estimation
+
+**Usage:**
+```scala
+import org.apache.spark.util.sketch._
+
+// Create bloom filter
+val bf = BloomFilter.create(expectedItems, falsePositiveRate)
+bf.put("item1")
+bf.mightContain("item1") // true
+
+// Create count-min sketch
+val cms = CountMinSketch.create(depth, width, seed)
+cms.add("item", count)
+val estimate = cms.estimateCount("item")
+```
+
+### common/tags
+
+Test tags for categorizing tests.
+
+**Purpose:**
+- Tag tests for selective execution
+- Categorize slow/flaky tests
+- Enable/disable test groups
+
+**Location**: `tags/`
+
+**Example tags:**
+- `@SlowTest`: Long-running tests
+- `@ExtendedTest`: Extended test suite
+- `@DockerTest`: Tests requiring Docker
+
+### common/unsafe
+
+Unsafe operations for performance-critical code.
+
+**Purpose:**
+- Direct memory access
+- Serialization without reflection
+- Performance optimizations
+
+**Location**: `unsafe/`
+
+**Key classes:**
+- `Platform`: Platform-specific operations
+- `UnsafeAlignedOffset`: Aligned memory access
+- Memory utilities for sorting and hashing
+
+**Warning:** These APIs are internal and subject to change.
+
+## Architecture
+
+### Layering
+
+```
+Spark Core / SQL / Streaming / MLlib
+              ↓
+    Common Modules (network, kvstore, etc.)
+              ↓
+        JVM / Netty / OS
+```
+
+### Design Principles
+
+1. **No Spark Core dependencies**: Can be used independently
+2. **Minimal external dependencies**: Reduce classpath conflicts
+3. **High performance**: Optimized for throughput and latency
+4. **Reusability**: Shared across all Spark components
+
+## Networking Architecture
+
+### Transport Layer
+
+The network-common module provides the foundation for all network communication in Spark.
+
+**Components:**
+
+1. **TransportContext**: Sets up network infrastructure
+2. **TransportClient**: Sends requests and receives responses
+3. **TransportServer**: Accepts connections and handles requests
+4. **MessageHandler**: Processes incoming messages
+
+**Flow:**
+```
+Client                          Server
+  |                               |
+  |------ Request Message ------->|
+  |                               | (Process in MessageHandler)
+  |<----- Response Message -------|
+  |                               |
+```
+
+### RPC Framework
+
+Built on top of the transport layer:
+
+```scala
+// Server side
+val rpcEnv = RpcEnv.create("name", host, port, conf)
+val endpoint = new MyEndpoint(rpcEnv)
+rpcEnv.setupEndpoint("my-endpoint", endpoint)
+
+// Client side
+val ref = rpcEnv.setupEndpointRef("spark://host:port/my-endpoint")
+val response = ref.askSync[Response](request)
+```
+
+### Block Transfer
+
+Optimized for transferring large data blocks:
+
+```scala
+val blockTransferService = new NettyBlockTransferService(conf)
+blockTransferService.fetchBlocks(
+  host, port, execId, blockIds,
+  blockFetchingListener, tempFileManager
+)
+```
+
+## Building and Testing
+
+### Build Common Modules
+
+```bash
+# Build all common modules
+./build/mvn -pl 'common/*' -am package
+
+# Build specific module
+./build/mvn -pl common/network-common -am package
+```
+
+### Run Tests
+
+```bash
+# Run all common tests
+./build/mvn test -pl 'common/*'
+
+# Run specific module tests
+./build/mvn test -pl common/network-common
+
+# Run specific test
+./build/mvn test -pl common/network-common -Dtest=TransportClientSuite
+```
+
+## Module Dependencies
+
+```
+common/unsafe (no dependencies)
+     ↓
+common/network-common
+     ↓
+common/network-shuffle
+     ↓
+common/network-yarn
+     
+common/sketch (independent)
+common/tags (independent)
+common/kvstore (independent)
+```
+
+## Source Code Organization
+
+```
+common/
+├── kvstore/              # Key-value store
+│   └── src/main/java/org/apache/spark/util/kvstore/
+├── network-common/       # Core networking
+│   └── src/main/java/org/apache/spark/network/
+│       ├── client/      # Client implementation
+│       ├── server/      # Server implementation
+│       ├── buffer/      # Buffer management
+│       ├── crypto/      # Encryption
+│       ├── protocol/    # Protocol messages
+│       └── util/        # Utilities
+├── network-shuffle/     # Shuffle service
+│   └── src/main/java/org/apache/spark/network/shuffle/
+├── network-yarn/        # YARN integration
+│   └── src/main/java/org/apache/spark/network/yarn/
+├── sketch/              # Sketching algorithms
+│   └── src/main/java/org/apache/spark/util/sketch/
+├── tags/                # Test tags
+│   └── src/main/java/org/apache/spark/tags/
+└── unsafe/              # Unsafe operations
+    └── src/main/java/org/apache/spark/unsafe/
+```
+
+## Performance Considerations
+
+### Zero-Copy Transfer
+
+Network modules use zero-copy techniques:
+- FileRegion for file-based transfers
+- Direct buffers to avoid copying
+- Netty's native transport when available
+
+### Memory Management
+
+```java
+// Use pooled buffers
+ByteBufAllocator allocator = PooledByteBufAllocator.DEFAULT;
+ByteBuf buffer = allocator.directBuffer(size);
+try {
+  // Use buffer
+} finally {
+  buffer.release();
+}
+```
+
+### Connection Pooling
+
+Clients reuse connections:
+```java
+TransportClientFactory factory = context.createClientFactory();
+TransportClient client = factory.createClient(host, port);
+// Client is cached and reused
+```
+
+## Security
+
+### SSL/TLS Support
+
+Enable encryption in network communication:
+
+```properties
+spark.ssl.enabled=true
+spark.ssl.protocol=TLSv1.2
+spark.ssl.keyStore=/path/to/keystore
+spark.ssl.keyStorePassword=password
+spark.ssl.trustStore=/path/to/truststore
+spark.ssl.trustStorePassword=password
+```
+
+### SASL Authentication
+
+Support for SASL-based authentication:
+
+```properties
+spark.authenticate=true
+spark.authenticate.secret=shared-secret
+```
+
+## Monitoring
+
+### Network Metrics
+
+Key metrics tracked:
+- Active connections
+- Bytes sent/received
+- Request latency
+- Connection failures
+
+**Access via Spark UI**: `http://<driver>:4040/metrics/`
+
+### Logging
+
+Enable detailed network logging:
+
+```properties
+log4j.logger.org.apache.spark.network=DEBUG
+log4j.logger.io.netty=DEBUG
+```
+
+## Configuration
+
+### Network Settings
+
+```properties
+# Connection timeout
+spark.network.timeout=120s
+
+# I/O threads
+spark.network.io.numConnectionsPerPeer=1
+
+# Buffer sizes
+spark.network.io.preferDirectBufs=true
+
+# Maximum retries
+spark.network.io.maxRetries=3
+
+# Connection pooling
+spark.rpc.numRetries=3
+spark.rpc.retry.wait=3s
+```
+
+### Shuffle Service
+
+```properties
+spark.shuffle.service.enabled=true
+spark.shuffle.service.port=7337
+spark.shuffle.service.index.cache.size=100m
+```
+
+## Best Practices
+
+1. **Reuse connections**: Don't create new clients unnecessarily
+2. **Release buffers**: Always release ByteBuf instances
+3. **Handle backpressure**: Implement flow control in handlers
+4. **Enable encryption**: Use SSL for sensitive data
+5. **Monitor metrics**: Track network performance
+6. **Configure timeouts**: Set appropriate timeout values
+7. **Use external shuffle service**: For production deployments
+
+## Troubleshooting
+
+### Connection Issues
+
+**Problem**: Connection refused or timeout
+
+**Solutions:**
+- Check firewall settings
+- Verify host and port
+- Increase timeout values
+- Check network connectivity
+
+### Memory Leaks
+
+**Problem**: Growing memory usage in network layer
+
+**Solutions:**
+- Ensure ByteBuf.release() is called
+- Check for unclosed connections
+- Monitor Netty buffer pool metrics
+
+### Slow Performance
+
+**Problem**: High network latency
+
+**Solutions:**
+- Enable native transport
+- Increase I/O threads
+- Adjust buffer sizes
+- Check network bandwidth
+
+## Internal APIs
+
+**Note**: All classes in common modules are internal APIs and may change between versions. They are not part of the public Spark API.
+
+## Further Reading
+
+- [Cluster Mode Overview](../docs/cluster-overview.md)
+- [Configuration Guide](../docs/configuration.md)
+- [Security Guide](../docs/security.md)
+
+## Contributing
+
+For contributing to common modules, see [CONTRIBUTING.md](../CONTRIBUTING.md).
+
+When adding functionality:
+- Keep dependencies minimal
+- Write comprehensive tests
+- Document public methods
+- Consider performance implications
+- Maintain backward compatibility where possible
diff --git a/core/README.md b/core/README.md
new file mode 100644
index 0000000000000..4a5be68b0342e
--- /dev/null
+++ b/core/README.md
@@ -0,0 +1,360 @@
+# Spark Core
+
+Spark Core is the foundation of the Apache Spark platform. It provides the basic functionality for distributed task dispatching, scheduling, and I/O operations.
+
+## Overview
+
+Spark Core contains the fundamental abstractions and components that all other Spark modules build upon:
+
+- **Resilient Distributed Datasets (RDDs)**: The fundamental data abstraction in Spark
+- **SparkContext**: The main entry point for Spark functionality
+- **Task Scheduling**: DAG scheduler and task scheduler for distributed execution
+- **Memory Management**: Unified memory management for execution and storage
+- **Shuffle System**: Data redistribution across partitions
+- **Storage System**: In-memory and disk-based storage for cached data
+- **Network Communication**: RPC and data transfer between driver and executors
+
+## Key Components
+
+### RDD (Resilient Distributed Dataset)
+
+The core abstraction in Spark - an immutable, distributed collection of objects that can be processed in parallel.
+
+**Key characteristics:**
+- **Resilient**: Fault-tolerant through lineage information
+- **Distributed**: Data is partitioned across cluster nodes
+- **Immutable**: Cannot be changed once created
+
+**Location**: `src/main/scala/org/apache/spark/rdd/`
+
+**Main classes:**
+- `RDD.scala`: Base RDD class with transformations and actions
+- `HadoopRDD.scala`: RDD for reading from Hadoop
+- `ParallelCollectionRDD.scala`: RDD created from a local collection
+- `MapPartitionsRDD.scala`: Result of map-like transformations
+
+### SparkContext
+
+The main entry point for Spark functionality. Creates RDDs, accumulators, and broadcast variables.
+
+**Location**: `src/main/scala/org/apache/spark/SparkContext.scala`
+
+**Key responsibilities:**
+- Connects to cluster manager
+- Acquires executors
+- Sends application code to executors
+- Creates and manages RDDs
+- Schedules and executes jobs
+
+### Scheduling
+
+#### DAGScheduler
+
+Computes a DAG of stages for each job and submits them to the TaskScheduler.
+
+**Location**: `src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala`
+
+**Responsibilities:**
+- Determines preferred locations for tasks based on cache status
+- Handles task failures and stage retries
+- Identifies shuffle boundaries to split stages
+- Manages job completion and failure
+
+#### TaskScheduler
+
+Submits task sets to the cluster, manages task execution, and retries failed tasks.
+
+**Location**: `src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala`
+
+**Implementations:**
+- `TaskSchedulerImpl`: Default implementation
+- `YarnScheduler`: YARN-specific implementation
+- Cluster manager-specific schedulers
+
+### Memory Management
+
+Unified memory management system that dynamically allocates memory between execution and storage.
+
+**Location**: `src/main/scala/org/apache/spark/memory/`
+
+**Components:**
+- `MemoryManager`: Base memory management interface
+- `UnifiedMemoryManager`: Dynamic allocation between execution and storage
+- `StorageMemoryPool`: Memory pool for caching
+- `ExecutionMemoryPool`: Memory pool for shuffles and joins
+
+**Memory regions:**
+1. **Execution Memory**: Shuffles, joins, sorts, aggregations
+2. **Storage Memory**: Caching and broadcasting
+3. **User Memory**: User data structures
+4. **Reserved Memory**: System overhead
+
+### Shuffle System
+
+Handles data redistribution between stages.
+
+**Location**: `src/main/scala/org/apache/spark/shuffle/`
+
+**Key classes:**
+- `ShuffleManager`: Interface for shuffle implementations
+- `SortShuffleManager`: Default shuffle implementation
+- `ShuffleWriter`: Writes shuffle data
+- `ShuffleReader`: Reads shuffle data
+
+**Shuffle process:**
+1. **Shuffle Write**: Map tasks write partitioned data to disk
+2. **Shuffle Fetch**: Reduce tasks fetch data from map outputs
+3. **Shuffle Service**: External service for serving shuffle data
+
+### Storage System
+
+Block-based storage abstraction for cached data and shuffle outputs.
+
+**Location**: `src/main/scala/org/apache/spark/storage/`
+
+**Components:**
+- `BlockManager`: Manages data blocks in memory and disk
+- `MemoryStore`: In-memory block storage
+- `DiskStore`: Disk-based block storage
+- `BlockManagerMaster`: Master for coordinating block managers
+
+**Storage levels:**
+- `MEMORY_ONLY`: Store in memory only
+- `MEMORY_AND_DISK`: Spill to disk if memory is full
+- `DISK_ONLY`: Store on disk only
+- `OFF_HEAP`: Store in off-heap memory
+
+### Network Layer
+
+Communication infrastructure for driver-executor and executor-executor communication.
+
+**Location**: `src/main/scala/org/apache/spark/network/` and `common/network-*/`
+
+**Components:**
+- `NettyRpcEnv`: Netty-based RPC implementation
+- `TransportContext`: Network communication setup
+- `BlockTransferService`: Block data transfer
+
+### Serialization
+
+Efficient serialization for data and closures.
+
+**Location**: `src/main/scala/org/apache/spark/serializer/`
+
+**Serializers:**
+- `JavaSerializer`: Default Java serialization (slower)
+- `KryoSerializer`: Faster, more compact serialization (recommended)
+
+**Configuration:**
+```scala
+conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
+```
+
+## API Overview
+
+### Creating RDDs
+
+```scala
+// From a local collection
+val data = Array(1, 2, 3, 4, 5)
+val rdd = sc.parallelize(data)
+
+// From external storage
+val textFile = sc.textFile("hdfs://path/to/file")
+
+// From another RDD
+val mapped = rdd.map(_ * 2)
+```
+
+### Transformations
+
+Lazy operations that define a new RDD:
+
+```scala
+val mapped = rdd.map(x => x * 2)
+val filtered = rdd.filter(x => x > 10)
+val flatMapped = rdd.flatMap(x => x.toString.split(" "))
+```
+
+### Actions
+
+Operations that trigger computation:
+
+```scala
+val count = rdd.count()
+val collected = rdd.collect()
+val reduced = rdd.reduce(_ + _)
+rdd.saveAsTextFile("hdfs://path/to/output")
+```
+
+### Caching
+
+```scala
+// Cache in memory
+rdd.cache()
+
+// Cache with specific storage level
+rdd.persist(StorageLevel.MEMORY_AND_DISK)
+
+// Remove from cache
+rdd.unpersist()
+```
+
+## Configuration
+
+Key configuration parameters (set via `SparkConf`):
+
+### Memory
+- `spark.executor.memory`: Executor memory (default: 1g)
+- `spark.memory.fraction`: Fraction for execution and storage (default: 0.6)
+- `spark.memory.storageFraction`: Fraction of spark.memory.fraction for storage (default: 0.5)
+
+### Parallelism
+- `spark.default.parallelism`: Default number of partitions (default: number of cores)
+- `spark.sql.shuffle.partitions`: Partitions for shuffle operations (default: 200)
+
+### Scheduling
+- `spark.scheduler.mode`: FIFO or FAIR (default: FIFO)
+- `spark.locality.wait`: Wait time for data-local tasks (default: 3s)
+
+### Shuffle
+- `spark.shuffle.compress`: Compress shuffle output (default: true)
+- `spark.shuffle.spill.compress`: Compress shuffle spills (default: true)
+
+See [configuration.md](../docs/configuration.md) for complete list.
+
+## Architecture
+
+### Job Execution Flow
+
+1. **Action called** → Triggers job submission
+2. **DAG construction** → DAGScheduler creates stages
+3. **Task creation** → Each stage becomes a task set
+4. **Task scheduling** → TaskScheduler assigns tasks to executors
+5. **Task execution** → Executors run tasks
+6. **Result collection** → Results returned to driver
+
+### Fault Tolerance
+
+Spark achieves fault tolerance through:
+
+1. **RDD Lineage**: Each RDD knows how to recompute from its parent RDDs
+2. **Task Retry**: Failed tasks are automatically retried
+3. **Stage Retry**: Failed stages are re-executed
+4. **Checkpoint**: Optionally save RDD to stable storage
+
+## Building and Testing
+
+### Build Core Module
+
+```bash
+# Build core only
+./build/mvn -pl core -DskipTests package
+
+# Build core with dependencies
+./build/mvn -pl core -am -DskipTests package
+```
+
+### Run Tests
+
+```bash
+# Run all core tests
+./build/mvn test -pl core
+
+# Run specific test suite
+./build/mvn test -pl core -Dtest=SparkContextSuite
+
+# Run specific test
+./build/mvn test -pl core -Dtest=SparkContextSuite#testJobCancellation
+```
+
+## Source Code Organization
+
+```
+core/src/main/
+├── java/                    # Java sources
+│   └── org/apache/spark/
+│       ├── api/            # Java API
+│       ├── shuffle/        # Shuffle implementation
+│       └── unsafe/         # Unsafe operations
+├── scala/                  # Scala sources
+│   └── org/apache/spark/
+│       ├── rdd/           # RDD implementations
+│       ├── scheduler/     # Scheduling components
+│       ├── storage/       # Storage system
+│       ├── memory/        # Memory management
+│       ├── shuffle/       # Shuffle system
+│       ├── broadcast/     # Broadcast variables
+│       ├── deploy/        # Deployment components
+│       ├── executor/      # Executor implementation
+│       ├── io/           # I/O utilities
+│       ├── network/      # Network layer
+│       ├── serializer/   # Serialization
+│       └── util/         # Utilities
+└── resources/            # Resource files
+```
+
+## Performance Tuning
+
+### Memory Optimization
+
+1. Adjust memory fractions based on workload
+2. Use off-heap memory for large datasets
+3. Choose appropriate storage levels
+4. Avoid excessive caching
+
+### Shuffle Optimization
+
+1. Minimize shuffle operations
+2. Use `reduceByKey` instead of `groupByKey`
+3. Increase shuffle parallelism
+4. Enable compression
+
+### Serialization Optimization
+
+1. Use Kryo serialization
+2. Register custom classes with Kryo
+3. Avoid closures with large objects
+
+### Data Locality
+
+1. Ensure data and compute are co-located
+2. Increase `spark.locality.wait` if needed
+3. Use appropriate storage levels
+
+## Common Issues and Solutions
+
+### OutOfMemoryError
+
+- Increase executor memory
+- Reduce parallelism
+- Use disk-based storage levels
+- Enable off-heap memory
+
+### Shuffle Failures
+
+- Increase shuffle memory
+- Increase shuffle parallelism
+- Enable external shuffle service
+
+### Slow Performance
+
+- Check data skew
+- Optimize shuffle operations
+- Increase parallelism
+- Enable speculation
+
+## Further Reading
+
+- [RDD Programming Guide](../docs/rdd-programming-guide.md)
+- [Cluster Mode Overview](../docs/cluster-overview.md)
+- [Tuning Guide](../docs/tuning.md)
+- [Job Scheduling](../docs/job-scheduling.md)
+- [Hardware Provisioning](../docs/hardware-provisioning.md)
+
+## Related Modules
+
+- [common/](../common/) - Common utilities shared across modules
+- [launcher/](../launcher/) - Application launcher
+- [sql/](../sql/) - Spark SQL and DataFrames
+- [streaming/](../streaming/) - Spark Streaming
diff --git a/examples/README.md b/examples/README.md
new file mode 100644
index 0000000000000..964dfaf3393c3
--- /dev/null
+++ b/examples/README.md
@@ -0,0 +1,432 @@
+# Spark Examples
+
+This directory contains example programs for Apache Spark in Scala, Java, Python, and R.
+
+## Overview
+
+The examples demonstrate various Spark features and APIs:
+
+- **Core Examples**: Basic RDD operations and transformations
+- **SQL Examples**: DataFrame and SQL operations
+- **Streaming Examples**: Stream processing with DStreams and Structured Streaming
+- **MLlib Examples**: Machine learning algorithms and pipelines
+- **GraphX Examples**: Graph processing algorithms
+
+## Running Examples
+
+### Using spark-submit
+
+The recommended way to run examples:
+
+```bash
+# Run a Scala/Java example
+./bin/run-example <class-name> [params]
+
+# Example: Run SparkPi
+./bin/run-example SparkPi 100
+
+# Example: Run with specific master
+MASTER=spark://host:7077 ./bin/run-example SparkPi 100
+```
+
+### Direct spark-submit
+
+```bash
+# Scala/Java examples
+./bin/spark-submit \
+  --class org.apache.spark.examples.SparkPi \
+  --master local[4] \
+  examples/target/scala-2.13/jars/spark-examples*.jar \
+  100
+
+# Python examples
+./bin/spark-submit examples/src/main/python/pi.py 100
+
+# R examples
+./bin/spark-submit examples/src/main/r/dataframe.R
+```
+
+### Interactive Shells
+
+```bash
+# Scala shell with examples on classpath
+./bin/spark-shell --jars examples/target/scala-2.13/jars/spark-examples*.jar
+
+# Python shell
+./bin/pyspark
+# Then run: exec(open('examples/src/main/python/pi.py').read())
+
+# R shell
+./bin/sparkR
+# Then: source('examples/src/main/r/dataframe.R')
+```
+
+## Example Categories
+
+### Core Examples
+
+**Basic RDD Operations**
+
+- `SparkPi`: Estimates π using Monte Carlo method
+- `SparkLR`: Logistic regression using gradient descent
+- `SparkKMeans`: K-means clustering
+- `SparkPageRank`: PageRank algorithm implementation
+- `GroupByTest`: Tests groupBy performance
+
+**Locations:**
+- Scala: `src/main/scala/org/apache/spark/examples/`
+- Java: `src/main/java/org/apache/spark/examples/`
+- Python: `src/main/python/`
+- R: `src/main/r/`
+
+### SQL Examples
+
+**DataFrame and SQL Operations**
+
+- `SparkSQLExample`: Basic DataFrame operations
+- `SQLDataSourceExample`: Working with various data sources
+- `RDDRelation`: Converting between RDDs and DataFrames
+- `UserDefinedFunction`: Creating and using UDFs
+- `CsvDataSource`: Reading and writing CSV files
+
+**Running:**
+```bash
+# Scala
+./bin/run-example sql.SparkSQLExample
+
+# Python
+./bin/spark-submit examples/src/main/python/sql/basic.py
+
+# R
+./bin/spark-submit examples/src/main/r/RSparkSQLExample.R
+```
+
+### Streaming Examples
+
+**DStream Examples (Legacy)**
+
+- `NetworkWordCount`: Count words from network stream
+- `StatefulNetworkWordCount`: Stateful word count
+- `RecoverableNetworkWordCount`: Checkpoint and recovery
+- `KafkaWordCount`: Read from Apache Kafka
+- `QueueStream`: Create DStream from queue
+
+**Structured Streaming Examples**
+
+- `StructuredNetworkWordCount`: Word count using Structured Streaming
+- `StructuredKafkaWordCount`: Kafka integration
+- `StructuredSessionization`: Session window operations
+
+**Running:**
+```bash
+# DStream example
+./bin/run-example streaming.NetworkWordCount localhost 9999
+
+# Structured Streaming
+./bin/run-example sql.streaming.StructuredNetworkWordCount localhost 9999
+
+# Python Structured Streaming
+./bin/spark-submit examples/src/main/python/sql/streaming/structured_network_wordcount.py localhost 9999
+```
+
+### MLlib Examples
+
+**Classification**
+- `LogisticRegressionExample`: Binary and multiclass classification
+- `DecisionTreeClassificationExample`: Decision tree classifier
+- `RandomForestClassificationExample`: Random forest classifier
+- `GradientBoostedTreeClassifierExample`: GBT classifier
+- `NaiveBayesExample`: Naive Bayes classifier
+
+**Regression**
+- `LinearRegressionExample`: Linear regression
+- `DecisionTreeRegressionExample`: Decision tree regressor
+- `RandomForestRegressionExample`: Random forest regressor
+- `AFTSurvivalRegressionExample`: Survival regression
+
+**Clustering**
+- `KMeansExample`: K-means clustering
+- `BisectingKMeansExample`: Bisecting K-means
+- `GaussianMixtureExample`: Gaussian mixture model
+- `LDAExample`: Latent Dirichlet Allocation
+
+**Pipelines**
+- `PipelineExample`: ML Pipeline with multiple stages
+- `CrossValidatorExample`: Model selection with cross-validation
+- `TrainValidationSplitExample`: Model selection with train/validation split
+
+**Running:**
+```bash
+# Scala
+./bin/run-example ml.LogisticRegressionExample
+
+# Java
+./bin/run-example ml.JavaLogisticRegressionExample
+
+# Python
+./bin/spark-submit examples/src/main/python/ml/logistic_regression.py
+```
+
+### GraphX Examples
+
+**Graph Algorithms**
+
+- `PageRankExample`: PageRank algorithm
+- `ConnectedComponentsExample`: Finding connected components
+- `TriangleCountExample`: Counting triangles
+- `SocialNetworkExample`: Social network analysis
+
+**Running:**
+```bash
+./bin/run-example graphx.PageRankExample
+```
+
+## Example Datasets
+
+Many examples use sample data from the `data/` directory:
+
+- `data/mllib/`: MLlib sample datasets
+  - `sample_libsvm_data.txt`: LibSVM format data
+  - `sample_binary_classification_data.txt`: Binary classification
+  - `sample_multiclass_classification_data.txt`: Multiclass classification
+  
+- `data/graphx/`: GraphX sample data
+  - `followers.txt`: Social network follower data
+  - `users.txt`: User information
+
+## Building Examples
+
+### Build All Examples
+
+```bash
+# Build examples module
+./build/mvn -pl examples -am package
+
+# Skip tests
+./build/mvn -pl examples -am -DskipTests package
+```
+
+### Build Specific Language Examples
+
+The examples are compiled together, but you can run them separately by language.
+
+## Creating Your Own Examples
+
+### Scala Example Template
+
+```scala
+package org.apache.spark.examples
+
+import org.apache.spark.sql.SparkSession
+
+object MyExample {
+  def main(args: Array[String]): Unit = {
+    val spark = SparkSession
+      .builder()
+      .appName("My Example")
+      .getOrCreate()
+    
+    try {
+      // Your Spark code here
+      import spark.implicits._
+      val df = spark.range(100).toDF("number")
+      df.show()
+    } finally {
+      spark.stop()
+    }
+  }
+}
+```
+
+### Python Example Template
+
+```python
+from pyspark.sql import SparkSession
+
+def main():
+    spark = SparkSession \
+        .builder \
+        .appName("My Example") \
+        .getOrCreate()
+    
+    try:
+        # Your Spark code here
+        df = spark.range(100)
+        df.show()
+    finally:
+        spark.stop()
+
+if __name__ == "__main__":
+    main()
+```
+
+### Java Example Template
+
+```java
+package org.apache.spark.examples;
+
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+
+public class MyExample {
+    public static void main(String[] args) {
+        SparkSession spark = SparkSession
+            .builder()
+            .appName("My Example")
+            .getOrCreate();
+        
+        try {
+            // Your Spark code here
+            Dataset<Row> df = spark.range(100);
+            df.show();
+        } finally {
+            spark.stop();
+        }
+    }
+}
+```
+
+### R Example Template
+
+```r
+library(SparkR)
+
+sparkR.session(appName = "My Example")
+
+# Your Spark code here
+df <- createDataFrame(data.frame(number = 1:100))
+head(df)
+
+sparkR.session.stop()
+```
+
+## Example Directory Structure
+
+```
+examples/src/main/
+├── java/org/apache/spark/examples/   # Java examples
+│   ├── JavaSparkPi.java
+│   ├── JavaWordCount.java
+│   ├── ml/                           # ML examples
+│   ├── sql/                          # SQL examples
+│   └── streaming/                    # Streaming examples
+├── python/                           # Python examples
+│   ├── pi.py
+│   ├── wordcount.py
+│   ├── ml/                          # ML examples
+│   ├── sql/                         # SQL examples
+│   └── streaming/                   # Streaming examples
+├── r/                               # R examples
+│   ├── RSparkSQLExample.R
+│   ├── ml.R
+│   └── dataframe.R
+└── scala/org/apache/spark/examples/ # Scala examples
+    ├── SparkPi.scala
+    ├── SparkLR.scala
+    ├── ml/                          # ML examples
+    ├── sql/                         # SQL examples
+    ├── streaming/                   # Streaming examples
+    └── graphx/                      # GraphX examples
+```
+
+## Common Patterns
+
+### Reading Data
+
+```scala
+// Text file
+val textData = spark.read.textFile("path/to/file.txt")
+
+// CSV
+val csvData = spark.read.option("header", "true").csv("path/to/file.csv")
+
+// JSON
+val jsonData = spark.read.json("path/to/file.json")
+
+// Parquet
+val parquetData = spark.read.parquet("path/to/file.parquet")
+```
+
+### Writing Data
+
+```scala
+// Save as text
+df.write.text("output/path")
+
+// Save as CSV
+df.write.option("header", "true").csv("output/path")
+
+// Save as Parquet
+df.write.parquet("output/path")
+
+// Save as JSON
+df.write.json("output/path")
+```
+
+### Working with Partitions
+
+```scala
+// Repartition for more parallelism
+val repartitioned = df.repartition(10)
+
+// Coalesce to reduce partitions
+val coalesced = df.coalesce(2)
+
+// Partition by column when writing
+df.write.partitionBy("year", "month").parquet("output/path")
+```
+
+## Performance Tips for Examples
+
+1. **Use Local Mode for Testing**: Start with `local[*]` for development
+2. **Adjust Partitions**: Use appropriate partition counts for your data size
+3. **Cache When Reusing**: Cache DataFrames/RDDs that are accessed multiple times
+4. **Monitor Jobs**: Use Spark UI at http://localhost:4040 to monitor execution
+
+## Troubleshooting
+
+### Common Issues
+
+**OutOfMemoryError**
+```bash
+# Increase driver memory
+./bin/spark-submit --driver-memory 4g examples/...
+
+# Increase executor memory
+./bin/spark-submit --executor-memory 4g examples/...
+```
+
+**Class Not Found**
+```bash
+# Make sure examples JAR is built
+./build/mvn -pl examples -am package
+```
+
+**File Not Found**
+```bash
+# Use absolute paths or ensure working directory is spark root
+./bin/run-example SparkPi  # Run from spark root directory
+```
+
+## Additional Resources
+
+- [Quick Start Guide](../docs/quick-start.md)
+- [Programming Guide](../docs/programming-guide.md)
+- [SQL Programming Guide](../docs/sql-programming-guide.md)
+- [MLlib Guide](../docs/ml-guide.md)
+- [Structured Streaming Guide](../docs/structured-streaming-programming-guide.md)
+- [GraphX Guide](../docs/graphx-programming-guide.md)
+
+## Contributing Examples
+
+When adding new examples:
+
+1. Follow existing code style and structure
+2. Include clear comments explaining the example
+3. Add appropriate documentation
+4. Test the example with various inputs
+5. Add to the appropriate category
+6. Update this README
+
+For more information, see [CONTRIBUTING.md](../CONTRIBUTING.md).
diff --git a/graphx/README.md b/graphx/README.md
new file mode 100644
index 0000000000000..08c841b6c04d5
--- /dev/null
+++ b/graphx/README.md
@@ -0,0 +1,549 @@
+# GraphX
+
+GraphX is Apache Spark's API for graphs and graph-parallel computation.
+
+## Overview
+
+GraphX unifies ETL (Extract, Transform, and Load), exploratory analysis, and iterative graph computation within a single system. It provides:
+
+- **Graph Abstraction**: Efficient representation of property graphs
+- **Graph Algorithms**: PageRank, Connected Components, Triangle Counting, and more
+- **Pregel API**: For iterative graph computations
+- **Graph Builders**: Tools to construct graphs from RDDs or files
+- **Graph Operators**: Transformations and structural operations
+
+## Key Concepts
+
+### Property Graph
+
+A directed multigraph with properties attached to each vertex and edge.
+
+**Components:**
+- **Vertices**: Nodes with unique IDs and properties
+- **Edges**: Directed connections between vertices with properties
+- **Triplets**: A view joining vertices and edges
+
+```scala
+import org.apache.spark.graphx._
+
+// Create vertices RDD
+val vertices: RDD[(VertexId, String)] = sc.parallelize(Array(
+  (1L, "Alice"),
+  (2L, "Bob"),
+  (3L, "Charlie")
+))
+
+// Create edges RDD
+val edges: RDD[Edge[String]] = sc.parallelize(Array(
+  Edge(1L, 2L, "friend"),
+  Edge(2L, 3L, "follow")
+))
+
+// Build the graph
+val graph: Graph[String, String] = Graph(vertices, edges)
+```
+
+### Graph Structure
+
+```
+Graph[VD, ED]
+  - vertices: VertexRDD[VD]  // Vertices with properties of type VD
+  - edges: EdgeRDD[ED]        // Edges with properties of type ED
+  - triplets: RDD[EdgeTriplet[VD, ED]]  // Combined view
+```
+
+## Core Components
+
+### Graph Class
+
+The main graph abstraction.
+
+**Location**: `src/main/scala/org/apache/spark/graphx/Graph.scala`
+
+**Key methods:**
+- `vertices: VertexRDD[VD]`: Access vertices
+- `edges: EdgeRDD[ED]`: Access edges
+- `triplets: RDD[EdgeTriplet[VD, ED]]`: Access triplets
+- `mapVertices[VD2](map: (VertexId, VD) => VD2)`: Transform vertex properties
+- `mapEdges[ED2](map: Edge[ED] => ED2)`: Transform edge properties
+- `subgraph(epred, vpred)`: Create subgraph based on predicates
+
+### VertexRDD
+
+Optimized RDD for vertex data.
+
+**Location**: `src/main/scala/org/apache/spark/graphx/VertexRDD.scala`
+
+**Features:**
+- Fast lookups by vertex ID
+- Efficient joins with edge data
+- Reuse of vertex indices
+
+### EdgeRDD
+
+Optimized RDD for edge data.
+
+**Location**: `src/main/scala/org/apache/spark/graphx/EdgeRDD.scala`
+
+**Features:**
+- Compact edge storage
+- Fast filtering and mapping
+- Efficient partitioning
+
+### EdgeTriplet
+
+Represents a edge with its source and destination vertex properties.
+
+**Structure:**
+```scala
+class EdgeTriplet[VD, ED] extends Edge[ED] {
+  var srcAttr: VD  // Source vertex property
+  var dstAttr: VD  // Destination vertex property
+  var attr: ED     // Edge property
+}
+```
+
+## Graph Operators
+
+### Property Operators
+
+```scala
+// Map vertex properties
+val newGraph = graph.mapVertices((id, attr) => attr.toUpperCase)
+
+// Map edge properties
+val newGraph = graph.mapEdges(e => e.attr + "relationship")
+
+// Map triplets (access to src and dst properties)
+val newGraph = graph.mapTriplets(triplet => 
+  (triplet.srcAttr, triplet.attr, triplet.dstAttr)
+)
+```
+
+### Structural Operators
+
+```scala
+// Reverse edge directions
+val reversedGraph = graph.reverse
+
+// Create subgraph
+val subgraph = graph.subgraph(
+  epred = e => e.srcId != e.dstId,  // No self-loops
+  vpred = (id, attr) => attr.length > 0  // Non-empty names
+)
+
+// Mask graph (keep only edges/vertices in another graph)
+val maskedGraph = graph.mask(subgraph)
+
+// Group edges
+val groupedGraph = graph.groupEdges((e1, e2) => e1 + e2)
+```
+
+### Join Operators
+
+```scala
+// Join vertices with external data
+val newData: RDD[(VertexId, NewType)] = ...
+val newGraph = graph.joinVertices(newData) {
+  (id, oldAttr, newAttr) => (oldAttr, newAttr)
+}
+
+// Outer join vertices
+val newGraph = graph.outerJoinVertices(newData) {
+  (id, oldAttr, newAttr) => newAttr.getOrElse(oldAttr)
+}
+```
+
+## Graph Algorithms
+
+GraphX includes several common graph algorithms.
+
+**Location**: `src/main/scala/org/apache/spark/graphx/lib/`
+
+### PageRank
+
+Measures the importance of each vertex based on link structure.
+
+```scala
+import org.apache.spark.graphx.lib.PageRank
+
+// Static PageRank (fixed iterations)
+val ranks = graph.staticPageRank(numIter = 10)
+
+// Dynamic PageRank (convergence-based)
+val ranks = graph.pageRank(tol = 0.001)
+
+// Get top ranked vertices
+val topRanked = ranks.vertices.top(10)(Ordering.by(_._2))
+```
+
+**File**: `src/main/scala/org/apache/spark/graphx/lib/PageRank.scala`
+
+### Connected Components
+
+Finds connected components in the graph.
+
+```scala
+import org.apache.spark.graphx.lib.ConnectedComponents
+
+// Find connected components
+val cc = graph.connectedComponents()
+
+// Count vertices in each component
+val componentCounts = cc.vertices
+  .map { case (id, component) => (component, 1) }
+  .reduceByKey(_ + _)
+```
+
+**File**: `src/main/scala/org/apache/spark/graphx/lib/ConnectedComponents.scala`
+
+### Triangle Counting
+
+Counts triangles (3-cliques) in the graph.
+
+```scala
+import org.apache.spark.graphx.lib.TriangleCount
+
+// Count triangles
+val triCounts = graph.triangleCount()
+
+// Get vertices with most triangles
+val topTriangles = triCounts.vertices.top(10)(Ordering.by(_._2))
+```
+
+**File**: `src/main/scala/org/apache/spark/graphx/lib/TriangleCount.scala`
+
+### Label Propagation
+
+Community detection algorithm.
+
+```scala
+import org.apache.spark.graphx.lib.LabelPropagation
+
+// Run label propagation
+val communities = graph.labelPropagation(maxSteps = 5)
+
+// Group vertices by community
+val communityGroups = communities.vertices
+  .map { case (id, label) => (label, Set(id)) }
+  .reduceByKey(_ ++ _)
+```
+
+**File**: `src/main/scala/org/apache/spark/graphx/lib/LabelPropagation.scala`
+
+### Strongly Connected Components
+
+Finds strongly connected components in a directed graph.
+
+```scala
+import org.apache.spark.graphx.lib.StronglyConnectedComponents
+
+// Find strongly connected components
+val scc = graph.stronglyConnectedComponents(numIter = 10)
+```
+
+**File**: `src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala`
+
+### Shortest Paths
+
+Computes shortest paths from source vertices to all reachable vertices.
+
+```scala
+import org.apache.spark.graphx.lib.ShortestPaths
+
+// Compute shortest paths from vertices 1 and 2
+val landmarks = Seq(1L, 2L)
+val results = graph.shortestPaths(landmarks)
+
+// Results contain distance to each landmark
+results.vertices.foreach { case (id, distances) =>
+  println(s"Vertex $id: $distances")
+}
+```
+
+**File**: `src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala`
+
+## Pregel API
+
+Bulk-synchronous parallel messaging abstraction for iterative graph algorithms.
+
+```scala
+def pregel[A: ClassTag](
+  initialMsg: A,
+  maxIterations: Int = Int.MaxValue,
+  activeDirection: EdgeDirection = EdgeDirection.Either
+)(
+  vprog: (VertexId, VD, A) => VD,
+  sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
+  mergeMsg: (A, A) => A
+): Graph[VD, ED]
+```
+
+**Example: Single-Source Shortest Path**
+
+```scala
+val sourceId: VertexId = 1L
+
+// Initialize distances
+val initialGraph = graph.mapVertices((id, _) =>
+  if (id == sourceId) 0.0 else Double.PositiveInfinity
+)
+
+// Run Pregel
+val sssp = initialGraph.pregel(Double.PositiveInfinity)(
+  // Vertex program: update vertex value with minimum distance
+  (id, dist, newDist) => math.min(dist, newDist),
+  
+  // Send message: send distance + edge weight to neighbors
+  triplet => {
+    if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
+      Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
+    } else {
+      Iterator.empty
+    }
+  },
+  
+  // Merge messages: take minimum distance
+  (a, b) => math.min(a, b)
+)
+```
+
+**File**: `src/main/scala/org/apache/spark/graphx/Pregel.scala`
+
+## Graph Builders
+
+### From Edge List
+
+```scala
+// Load edge list from file
+val graph = GraphLoader.edgeListFile(sc, "path/to/edges.txt")
+
+// Edge file format: source destination
+// Example:
+// 1 2
+// 2 3
+// 3 1
+```
+
+### From RDDs
+
+```scala
+val vertices: RDD[(VertexId, VD)] = ...
+val edges: RDD[Edge[ED]] = ...
+
+val graph = Graph(vertices, edges)
+
+// With default vertex property
+val graph = Graph.fromEdges(edges, defaultValue = "Unknown")
+
+// From edge tuples
+val edgeTuples: RDD[(VertexId, VertexId)] = ...
+val graph = Graph.fromEdgeTuples(edgeTuples, defaultValue = 1)
+```
+
+## Partitioning Strategies
+
+Efficient graph partitioning is crucial for performance.
+
+**Available strategies:**
+- `EdgePartition1D`: Partition edges by source vertex
+- `EdgePartition2D`: 2D matrix partitioning
+- `RandomVertexCut`: Random edge partitioning (default)
+- `CanonicalRandomVertexCut`: Similar to RandomVertexCut but canonical
+
+```scala
+import org.apache.spark.graphx.PartitionStrategy
+
+val graph = Graph(vertices, edges)
+  .partitionBy(PartitionStrategy.EdgePartition2D)
+```
+
+**Location**: `src/main/scala/org/apache/spark/graphx/PartitionStrategy.scala`
+
+## Performance Optimization
+
+### Caching
+
+```scala
+// Cache graph in memory
+graph.cache()
+
+// Or persist with storage level
+graph.persist(StorageLevel.MEMORY_AND_DISK)
+
+// Unpersist when done
+graph.unpersist()
+```
+
+### Partitioning
+
+```scala
+// Repartition for better balance
+val partitionedGraph = graph
+  .partitionBy(PartitionStrategy.EdgePartition2D, numPartitions = 100)
+  .cache()
+```
+
+### Checkpointing
+
+For iterative algorithms, checkpoint periodically:
+
+```scala
+sc.setCheckpointDir("hdfs://checkpoint")
+
+var graph = initialGraph
+for (i <- 1 to maxIterations) {
+  // Perform iteration
+  graph = performIteration(graph)
+  
+  // Checkpoint every 10 iterations
+  if (i % 10 == 0) {
+    graph.checkpoint()
+  }
+}
+```
+
+## Building and Testing
+
+### Build GraphX Module
+
+```bash
+# Build graphx module
+./build/mvn -pl graphx -am package
+
+# Skip tests
+./build/mvn -pl graphx -am -DskipTests package
+```
+
+### Run Tests
+
+```bash
+# Run all graphx tests
+./build/mvn test -pl graphx
+
+# Run specific test suite
+./build/mvn test -pl graphx -Dtest=GraphSuite
+```
+
+## Source Code Organization
+
+```
+graphx/src/main/
+├── scala/org/apache/spark/graphx/
+│   ├── Graph.scala                     # Main graph class
+│   ├── GraphOps.scala                  # Graph operations
+│   ├── VertexRDD.scala                 # Vertex RDD
+│   ├── EdgeRDD.scala                   # Edge RDD
+│   ├── Edge.scala                      # Edge class
+│   ├── EdgeTriplet.scala              # Edge triplet
+│   ├── Pregel.scala                   # Pregel API
+│   ├── GraphLoader.scala              # Graph loading utilities
+│   ├── PartitionStrategy.scala        # Partitioning strategies
+│   ├── impl/                          # Implementation details
+│   │   ├── GraphImpl.scala           # Graph implementation
+│   │   ├── VertexRDDImpl.scala       # VertexRDD implementation
+│   │   ├── EdgeRDDImpl.scala         # EdgeRDD implementation
+│   │   └── ReplicatedVertexView.scala # Vertex replication
+│   ├── lib/                           # Graph algorithms
+│   │   ├── PageRank.scala
+│   │   ├── ConnectedComponents.scala
+│   │   ├── TriangleCount.scala
+│   │   ├── LabelPropagation.scala
+│   │   ├── StronglyConnectedComponents.scala
+│   │   └── ShortestPaths.scala
+│   └── util/                          # Utilities
+│       ├── BytecodeUtils.scala
+│       └── GraphGenerators.scala      # Test graph generation
+└── resources/
+```
+
+## Examples
+
+See [examples/src/main/scala/org/apache/spark/examples/graphx/](../examples/src/main/scala/org/apache/spark/examples/graphx/) for complete examples.
+
+**Key examples:**
+- `PageRankExample.scala`: PageRank on social network
+- `ConnectedComponentsExample.scala`: Finding connected components
+- `SocialNetworkExample.scala`: Complete social network analysis
+
+## Common Use Cases
+
+### Social Network Analysis
+
+```scala
+// Load social network
+val users: RDD[(VertexId, String)] = sc.textFile("users.txt")
+  .map(line => (line.split(",")(0).toLong, line.split(",")(1)))
+
+val relationships: RDD[Edge[String]] = sc.textFile("relationships.txt")
+  .map { line =>
+    val fields = line.split(",")
+    Edge(fields(0).toLong, fields(1).toLong, fields(2))
+  }
+
+val graph = Graph(users, relationships)
+
+// Find influential users (PageRank)
+val ranks = graph.pageRank(0.001).vertices
+
+// Find communities
+val communities = graph.labelPropagation(5)
+
+// Count mutual friends (triangles)
+val triangles = graph.triangleCount()
+```
+
+### Web Graph Analysis
+
+```scala
+// Load web graph
+val graph = GraphLoader.edgeListFile(sc, "web-graph.txt")
+
+// Compute PageRank
+val ranks = graph.pageRank(0.001)
+
+// Find authoritative pages
+val topPages = ranks.vertices.top(100)(Ordering.by(_._2))
+```
+
+### Road Network Analysis
+
+```scala
+// Vertices are intersections, edges are roads
+val roadNetwork: Graph[String, Double] = ...
+
+// Find shortest paths from landmarks
+val landmarks = Seq(1L, 2L, 3L)
+val distances = roadNetwork.shortestPaths(landmarks)
+
+// Find highly connected intersections
+val degrees = roadNetwork.degrees
+val busyIntersections = degrees.top(10)(Ordering.by(_._2))
+```
+
+## Best Practices
+
+1. **Partition carefully**: Use appropriate partitioning strategy for your workload
+2. **Cache graphs**: Cache graphs that are accessed multiple times
+3. **Avoid unnecessary materialization**: GraphX uses lazy evaluation
+4. **Use GraphLoader**: For simple edge lists, use GraphLoader
+5. **Monitor memory**: Graph algorithms can be memory-intensive
+6. **Checkpoint long lineages**: Checkpoint periodically in iterative algorithms
+7. **Consider edge direction**: Many operations respect edge direction
+
+## Limitations and Considerations
+
+- **No mutable graphs**: Graphs are immutable; modifications create new graphs
+- **Memory overhead**: Vertex replication can increase memory usage
+- **Edge direction**: Operations may behave differently on directed vs undirected graphs
+- **Single-machine graphs**: For small graphs (< 1M edges), NetworkX or igraph may be faster
+
+## Further Reading
+
+- [GraphX Programming Guide](../docs/graphx-programming-guide.md)
+- [GraphX Paper](http://www.vldb.org/pvldb/vol7/p1673-xin.pdf)
+- [Pregel: A System for Large-Scale Graph Processing](https://kowshik.github.io/JPregel/pregel_paper.pdf)
+
+## Contributing
+
+For contributing to GraphX, see [CONTRIBUTING.md](../CONTRIBUTING.md).
diff --git a/mllib/README.md b/mllib/README.md
new file mode 100644
index 0000000000000..dd62159f84fef
--- /dev/null
+++ b/mllib/README.md
@@ -0,0 +1,514 @@
+# MLlib - Machine Learning Library
+
+MLlib is Apache Spark's scalable machine learning library.
+
+## Overview
+
+MLlib provides:
+
+- **ML Algorithms**: Classification, regression, clustering, collaborative filtering
+- **Featurization**: Feature extraction, transformation, dimensionality reduction, selection
+- **Pipelines**: Tools for constructing, evaluating, and tuning ML workflows
+- **Utilities**: Linear algebra, statistics, data handling
+
+## Important Note
+
+MLlib includes two packages:
+
+1. **`spark.ml`** (DataFrame-based API) - **Primary API** (Recommended)
+2. **`spark.mllib`** (RDD-based API) - **Maintenance mode only**
+
+The RDD-based API (`spark.mllib`) is in maintenance mode. The DataFrame-based API (`spark.ml`) is the primary API and is recommended for all new applications.
+
+## Package Structure
+
+### spark.ml (Primary API)
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/`
+
+DataFrame-based API with:
+- **ML Pipeline API**: For building ML workflows
+- **Transformers**: Feature transformers
+- **Estimators**: Learning algorithms
+- **Models**: Fitted models
+
+```scala
+import org.apache.spark.ml.classification.LogisticRegression
+import org.apache.spark.ml.feature.VectorAssembler
+
+// Create pipeline
+val assembler = new VectorAssembler()
+  .setInputCols(Array("feature1", "feature2"))
+  .setOutputCol("features")
+
+val lr = new LogisticRegression()
+  .setMaxIter(10)
+
+val pipeline = new Pipeline().setStages(Array(assembler, lr))
+
+// Fit model
+val model = pipeline.fit(trainingData)
+
+// Make predictions
+val predictions = model.transform(testData)
+```
+
+### spark.mllib (RDD-based API - Maintenance Mode)
+
+**Location**: `src/main/scala/org/apache/spark/mllib/`
+
+RDD-based API with:
+- Classic algorithms using RDDs
+- Maintained for backward compatibility
+- No new features added
+
+```scala
+import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
+import org.apache.spark.mllib.regression.LabeledPoint
+
+// Train model (old API)
+val data: RDD[LabeledPoint] = ...
+val model = LogisticRegressionWithLBFGS.train(data)
+
+// Make predictions
+val predictions = data.map { point => model.predict(point.features) }
+```
+
+## Key Concepts
+
+### Pipeline API (spark.ml)
+
+Machine learning pipelines provide:
+
+1. **DataFrame**: Unified data representation
+2. **Transformer**: Algorithms that transform DataFrames
+3. **Estimator**: Algorithms that fit on DataFrames to produce Transformers
+4. **Pipeline**: Chains multiple Transformers and Estimators
+5. **Parameter**: Common API for specifying parameters
+
+**Example Pipeline:**
+```scala
+import org.apache.spark.ml.{Pipeline, PipelineModel}
+import org.apache.spark.ml.classification.LogisticRegression
+import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
+
+// Configure pipeline stages
+val tokenizer = new Tokenizer()
+  .setInputCol("text")
+  .setOutputCol("words")
+
+val hashingTF = new HashingTF()
+  .setInputCol("words")
+  .setOutputCol("features")
+
+val lr = new LogisticRegression()
+  .setMaxIter(10)
+
+val pipeline = new Pipeline()
+  .setStages(Array(tokenizer, hashingTF, lr))
+
+// Fit the pipeline
+val model = pipeline.fit(trainingData)
+
+// Make predictions
+model.transform(testData)
+```
+
+### Transformers
+
+Algorithms that transform one DataFrame into another.
+
+**Examples:**
+- `Tokenizer`: Splits text into words
+- `HashingTF`: Maps word sequences to feature vectors
+- `StandardScaler`: Normalizes features
+- `VectorAssembler`: Combines multiple columns into a vector
+- `PCA`: Dimensionality reduction
+
+### Estimators
+
+Algorithms that fit on a DataFrame to produce a Transformer.
+
+**Examples:**
+- `LogisticRegression`: Produces LogisticRegressionModel
+- `DecisionTreeClassifier`: Produces DecisionTreeClassificationModel
+- `KMeans`: Produces KMeansModel
+- `StringIndexer`: Produces StringIndexerModel
+
+## ML Algorithms
+
+### Classification
+
+**Binary and Multiclass:**
+- Logistic Regression
+- Decision Tree Classifier
+- Random Forest Classifier
+- Gradient-Boosted Tree Classifier
+- Naive Bayes
+- Linear Support Vector Machine
+
+**Multilabel:**
+- OneVsRest
+
+**Example:**
+```scala
+import org.apache.spark.ml.classification.LogisticRegression
+
+val lr = new LogisticRegression()
+  .setMaxIter(10)
+  .setRegParam(0.3)
+  .setElasticNetParam(0.8)
+
+val model = lr.fit(trainingData)
+val predictions = model.transform(testData)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/classification/`
+
+### Regression
+
+- Linear Regression
+- Generalized Linear Regression
+- Decision Tree Regression
+- Random Forest Regression
+- Gradient-Boosted Tree Regression
+- Survival Regression (AFT)
+- Isotonic Regression
+
+**Example:**
+```scala
+import org.apache.spark.ml.regression.LinearRegression
+
+val lr = new LinearRegression()
+  .setMaxIter(10)
+  .setRegParam(0.3)
+  .setElasticNetParam(0.8)
+
+val model = lr.fit(trainingData)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/regression/`
+
+### Clustering
+
+- K-means
+- Latent Dirichlet Allocation (LDA)
+- Bisecting K-means
+- Gaussian Mixture Model (GMM)
+
+**Example:**
+```scala
+import org.apache.spark.ml.clustering.KMeans
+
+val kmeans = new KMeans()
+  .setK(3)
+  .setSeed(1L)
+
+val model = kmeans.fit(dataset)
+val predictions = model.transform(dataset)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/clustering/`
+
+### Collaborative Filtering
+
+Alternating Least Squares (ALS) for recommendation systems.
+
+**Example:**
+```scala
+import org.apache.spark.ml.recommendation.ALS
+
+val als = new ALS()
+  .setMaxIter(10)
+  .setRegParam(0.01)
+  .setUserCol("userId")
+  .setItemCol("movieId")
+  .setRatingCol("rating")
+
+val model = als.fit(ratings)
+val predictions = model.transform(testData)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/recommendation/`
+
+## Feature Engineering
+
+### Feature Extractors
+
+- `TF-IDF`: Text feature extraction
+- `Word2Vec`: Word embeddings
+- `CountVectorizer`: Converts text to vectors of token counts
+
+### Feature Transformers
+
+- `Tokenizer`: Text tokenization
+- `StopWordsRemover`: Removes stop words
+- `StringIndexer`: Encodes string labels to indices
+- `IndexToString`: Converts indices back to strings
+- `OneHotEncoder`: One-hot encoding
+- `VectorAssembler`: Combines columns into feature vector
+- `StandardScaler`: Standardizes features
+- `MinMaxScaler`: Scales features to a range
+- `Normalizer`: Normalizes vectors to unit norm
+- `Binarizer`: Binarizes based on threshold
+
+### Feature Selectors
+
+- `VectorSlicer`: Extracts subset of features
+- `RFormula`: R model formula for feature specification
+- `ChiSqSelector`: Chi-square feature selection
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/feature/`
+
+## Model Selection and Tuning
+
+### Cross-Validation
+
+```scala
+import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
+import org.apache.spark.ml.evaluation.RegressionEvaluator
+
+val paramGrid = new ParamGridBuilder()
+  .addGrid(lr.regParam, Array(0.1, 0.01))
+  .addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
+  .build()
+
+val cv = new CrossValidator()
+  .setEstimator(lr)
+  .setEvaluator(new RegressionEvaluator())
+  .setEstimatorParamMaps(paramGrid)
+  .setNumFolds(3)
+
+val cvModel = cv.fit(trainingData)
+```
+
+### Train-Validation Split
+
+```scala
+import org.apache.spark.ml.tuning.TrainValidationSplit
+
+val trainValidationSplit = new TrainValidationSplit()
+  .setEstimator(lr)
+  .setEvaluator(new RegressionEvaluator())
+  .setEstimatorParamMaps(paramGrid)
+  .setTrainRatio(0.8)
+
+val model = trainValidationSplit.fit(trainingData)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/tuning/`
+
+## Evaluation Metrics
+
+### Classification
+
+```scala
+import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
+
+val evaluator = new MulticlassClassificationEvaluator()
+  .setLabelCol("label")
+  .setPredictionCol("prediction")
+  .setMetricName("accuracy")
+
+val accuracy = evaluator.evaluate(predictions)
+```
+
+### Regression
+
+```scala
+import org.apache.spark.ml.evaluation.RegressionEvaluator
+
+val evaluator = new RegressionEvaluator()
+  .setLabelCol("label")
+  .setPredictionCol("prediction")
+  .setMetricName("rmse")
+
+val rmse = evaluator.evaluate(predictions)
+```
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/ml/evaluation/`
+
+## Linear Algebra
+
+MLlib provides distributed linear algebra through Breeze.
+
+**Location**: `src/main/scala/org/apache/spark/mllib/linalg/`
+
+**Local vectors and matrices:**
+```scala
+import org.apache.spark.ml.linalg.{Vector, Vectors, Matrix, Matrices}
+
+// Dense vector
+val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
+
+// Sparse vector
+val sv: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
+
+// Dense matrix
+val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
+```
+
+**Distributed matrices:**
+- `RowMatrix`: Distributed row-oriented matrix
+- `IndexedRowMatrix`: Indexed rows
+- `CoordinateMatrix`: Coordinate list format
+- `BlockMatrix`: Block-partitioned matrix
+
+## Statistics
+
+Basic statistics and hypothesis testing.
+
+**Location**: `src/main/scala/org/apache/spark/mllib/stat/`
+
+**Examples:**
+- Summary statistics
+- Correlations
+- Stratified sampling
+- Hypothesis testing
+- Random data generation
+
+## Building and Testing
+
+### Build MLlib Module
+
+```bash
+# Build mllib module (RDD-based)
+./build/mvn -pl mllib -am package
+
+# The DataFrame-based ml package is in sql/core
+./build/mvn -pl sql/core -am package
+```
+
+### Run Tests
+
+```bash
+# Run mllib tests
+./build/mvn test -pl mllib
+
+# Run specific test
+./build/mvn test -pl mllib -Dtest=LinearRegressionSuite
+```
+
+## Source Code Organization
+
+```
+mllib/src/main/
+├── scala/org/apache/spark/mllib/
+│   ├── classification/         # Classification algorithms (RDD-based)
+│   ├── clustering/            # Clustering algorithms (RDD-based)
+│   ├── evaluation/            # Evaluation metrics (RDD-based)
+│   ├── feature/               # Feature engineering (RDD-based)
+│   ├── fpm/                   # Frequent pattern mining
+│   ├── linalg/                # Linear algebra
+│   ├── optimization/          # Optimization algorithms
+│   ├── recommendation/        # Collaborative filtering (RDD-based)
+│   ├── regression/            # Regression algorithms (RDD-based)
+│   ├── stat/                  # Statistics
+│   ├── tree/                  # Decision trees (RDD-based)
+│   └── util/                  # Utilities
+└── resources/
+```
+
+## Performance Considerations
+
+### Caching
+
+Cache datasets that are used multiple times:
+```scala
+val trainingData = data.cache()
+```
+
+### Parallelism
+
+Adjust parallelism for better performance:
+```scala
+import org.apache.spark.ml.classification.LogisticRegression
+
+val lr = new LogisticRegression()
+  .setMaxIter(10)
+  .setParallelism(4)  // Parallel model fitting
+```
+
+### Data Format
+
+Use Parquet format for efficient storage and reading:
+```scala
+df.write.parquet("training_data.parquet")
+val data = spark.read.parquet("training_data.parquet")
+```
+
+### Feature Scaling
+
+Normalize features for better convergence:
+```scala
+import org.apache.spark.ml.feature.StandardScaler
+
+val scaler = new StandardScaler()
+  .setInputCol("features")
+  .setOutputCol("scaledFeatures")
+  .setWithStd(true)
+  .setWithMean(false)
+```
+
+## Best Practices
+
+1. **Use spark.ml**: Prefer DataFrame-based API over RDD-based API
+2. **Build pipelines**: Use Pipeline API for reproducible workflows
+3. **Cache data**: Cache datasets used in iterative algorithms
+4. **Scale features**: Normalize features for better performance
+5. **Cross-validate**: Use cross-validation for model selection
+6. **Monitor convergence**: Check convergence for iterative algorithms
+7. **Save models**: Persist trained models for reuse
+8. **Use appropriate algorithms**: Choose algorithms based on data characteristics
+
+## Model Persistence
+
+Save and load models:
+
+```scala
+// Save model
+model.write.overwrite().save("path/to/model")
+
+// Load model
+val loadedModel = PipelineModel.load("path/to/model")
+```
+
+## Migration Guide
+
+### From RDD-based API to DataFrame-based API
+
+**Old (RDD-based):**
+```scala
+import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
+import org.apache.spark.mllib.regression.LabeledPoint
+
+val data: RDD[LabeledPoint] = ...
+val model = LogisticRegressionWithLBFGS.train(data)
+```
+
+**New (DataFrame-based):**
+```scala
+import org.apache.spark.ml.classification.LogisticRegression
+
+val data: DataFrame = ...
+val lr = new LogisticRegression()
+val model = lr.fit(data)
+```
+
+## Examples
+
+See [examples/src/main/scala/org/apache/spark/examples/ml/](../examples/src/main/scala/org/apache/spark/examples/ml/) for complete examples.
+
+## Further Reading
+
+- [ML Programming Guide](../docs/ml-guide.md) (DataFrame-based API)
+- [MLlib Programming Guide](../docs/mllib-guide.md) (RDD-based API - legacy)
+- [ML Pipelines](../docs/ml-pipeline.md)
+- [ML Tuning](../docs/ml-tuning.md)
+- [Feature Extraction](../docs/ml-features.md)
+
+## Contributing
+
+For contributing to MLlib, see [CONTRIBUTING.md](../CONTRIBUTING.md).
+
+New features should use the DataFrame-based API (`spark.ml`).
diff --git a/streaming/README.md b/streaming/README.md
new file mode 100644
index 0000000000000..4e16b8f12b11e
--- /dev/null
+++ b/streaming/README.md
@@ -0,0 +1,430 @@
+# Spark Streaming
+
+Spark Streaming provides scalable, high-throughput, fault-tolerant stream processing of live data streams.
+
+## Overview
+
+Spark Streaming supports two APIs:
+
+1. **DStreams (Discretized Streams)** - Legacy API (Deprecated as of Spark 3.4)
+2. **Structured Streaming** - Modern API built on Spark SQL (Recommended)
+
+**Note**: DStreams are deprecated. For new applications, use **Structured Streaming** which is located in the `sql/core` module.
+
+## DStreams (Legacy API)
+
+### What are DStreams?
+
+DStreams represent a continuous stream of data, internally represented as a sequence of RDDs.
+
+**Key characteristics:**
+- Micro-batch processing model
+- Integration with Kafka, Flume, Kinesis, TCP sockets, and more
+- Windowing operations for time-based aggregations
+- Stateful transformations with updateStateByKey
+- Fault tolerance through checkpointing
+
+### Location
+
+- Scala/Java: `src/main/scala/org/apache/spark/streaming/`
+- Python: `../python/pyspark/streaming/`
+
+### Basic Example
+
+```scala
+import org.apache.spark.streaming._
+import org.apache.spark.SparkConf
+
+val conf = new SparkConf().setAppName("NetworkWordCount")
+val ssc = new StreamingContext(conf, Seconds(1))
+
+// Create DStream from TCP source
+val lines = ssc.socketTextStream("localhost", 9999)
+
+// Process the stream
+val words = lines.flatMap(_.split(" "))
+val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
+
+// Print results
+wordCounts.print()
+
+// Start the computation
+ssc.start()
+ssc.awaitTermination()
+```
+
+### Key Components
+
+#### StreamingContext
+
+The main entry point for streaming functionality.
+
+**File**: `src/main/scala/org/apache/spark/streaming/StreamingContext.scala`
+
+**Usage:**
+```scala
+val ssc = new StreamingContext(sparkContext, Seconds(batchInterval))
+// or
+val ssc = new StreamingContext(conf, Seconds(batchInterval))
+```
+
+#### DStream
+
+The fundamental abstraction for a continuous data stream.
+
+**File**: `src/main/scala/org/apache/spark/streaming/dstream/DStream.scala`
+
+**Operations:**
+- **Transformations**: map, flatMap, filter, reduce, join, window
+- **Output Operations**: print, saveAsTextFiles, foreachRDD
+
+#### Input Sources
+
+**Built-in sources:**
+- `socketTextStream`: TCP socket source
+- `textFileStream`: File system monitoring
+- `queueStream`: Queue-based testing source
+
+**Advanced sources** (require external libraries):
+- Kafka: `KafkaUtils.createStream`
+- Flume: `FlumeUtils.createStream`
+- Kinesis: `KinesisUtils.createStream`
+
+**Location**: `src/main/scala/org/apache/spark/streaming/dstream/`
+
+### Windowing Operations
+
+Process data over sliding windows:
+
+```scala
+val windowedStream = lines
+  .window(Seconds(30), Seconds(10))  // 30s window, 10s slide
+  
+val windowedWordCounts = words
+  .map(x => (x, 1))
+  .reduceByKeyAndWindow(_ + _, Seconds(30), Seconds(10))
+```
+
+### Stateful Operations
+
+Maintain state across batches:
+
+```scala
+def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
+  val newCount = runningCount.getOrElse(0) + newValues.sum
+  Some(newCount)
+}
+
+val runningCounts = pairs.updateStateByKey(updateFunction)
+```
+
+### Checkpointing
+
+Essential for stateful operations and fault tolerance:
+
+```scala
+ssc.checkpoint("hdfs://checkpoint/directory")
+```
+
+**What gets checkpointed:**
+- Configuration
+- DStream operations
+- Incomplete batches
+- State data (for stateful operations)
+
+### Performance Tuning
+
+**Batch Interval**
+- Set based on processing time and latency requirements
+- Too small: overhead increases
+- Too large: latency increases
+
+**Parallelism**
+```scala
+// Increase receiver parallelism
+val numStreams = 5
+val streams = (1 to numStreams).map(_ => ssc.socketTextStream(...))
+val unifiedStream = ssc.union(streams)
+
+// Repartition for processing
+val repartitioned = dstream.repartition(10)
+```
+
+**Memory Management**
+```scala
+conf.set("spark.streaming.receiver.maxRate", "10000")
+conf.set("spark.streaming.kafka.maxRatePerPartition", "1000")
+```
+
+## Structured Streaming (Recommended)
+
+For new applications, use Structured Streaming instead of DStreams.
+
+**Location**: `../sql/core/src/main/scala/org/apache/spark/sql/streaming/`
+
+**Example:**
+```scala
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.streaming._
+
+val spark = SparkSession.builder()
+  .appName("StructuredNetworkWordCount")
+  .getOrCreate()
+
+import spark.implicits._
+
+// Create DataFrame from stream source
+val lines = spark
+  .readStream
+  .format("socket")
+  .option("host", "localhost")
+  .option("port", 9999)
+  .load()
+
+// Process the stream
+val words = lines.as[String].flatMap(_.split(" "))
+val wordCounts = words.groupBy("value").count()
+
+// Output the stream
+val query = wordCounts
+  .writeStream
+  .outputMode("complete")
+  .format("console")
+  .start()
+
+query.awaitTermination()
+```
+
+**Advantages over DStreams:**
+- Unified API with batch processing
+- Better performance with Catalyst optimizer
+- Exactly-once semantics
+- Event time processing
+- Watermarking for late data
+- Easier to reason about
+
+See [Structured Streaming Guide](../docs/structured-streaming-programming-guide.md) for details.
+
+## Building and Testing
+
+### Build Streaming Module
+
+```bash
+# Build streaming module
+./build/mvn -pl streaming -am package
+
+# Skip tests
+./build/mvn -pl streaming -am -DskipTests package
+```
+
+### Run Tests
+
+```bash
+# Run all streaming tests
+./build/mvn test -pl streaming
+
+# Run specific test suite
+./build/mvn test -pl streaming -Dtest=BasicOperationsSuite
+```
+
+## Source Code Organization
+
+```
+streaming/src/main/
+├── scala/org/apache/spark/streaming/
+│   ├── StreamingContext.scala           # Main entry point
+│   ├── Time.scala                       # Time utilities
+│   ├── Checkpoint.scala                 # Checkpointing
+│   ├── dstream/
+│   │   ├── DStream.scala               # Base DStream
+│   │   ├── InputDStream.scala          # Input sources
+│   │   ├── ReceiverInputDStream.scala  # Receiver-based input
+│   │   ├── WindowedDStream.scala       # Windowing operations
+│   │   ├── StateDStream.scala          # Stateful operations
+│   │   └── PairDStreamFunctions.scala  # Key-value operations
+│   ├── receiver/
+│   │   ├── Receiver.scala              # Base receiver class
+│   │   ├── ReceiverSupervisor.scala    # Receiver management
+│   │   └── BlockGenerator.scala        # Block generation
+│   ├── scheduler/
+│   │   ├── JobScheduler.scala          # Job scheduling
+│   │   ├── JobGenerator.scala          # Job generation
+│   │   └── ReceiverTracker.scala       # Receiver tracking
+│   └── ui/
+│       └── StreamingTab.scala          # Web UI
+└── resources/
+```
+
+## Integration with External Systems
+
+### Apache Kafka
+
+**Deprecated DStreams approach:**
+```scala
+import org.apache.spark.streaming.kafka010._
+
+val kafkaParams = Map[String, Object](
+  "bootstrap.servers" -> "localhost:9092",
+  "key.deserializer" -> classOf[StringDeserializer],
+  "value.deserializer" -> classOf[StringDeserializer],
+  "group.id" -> "test-group"
+)
+
+val stream = KafkaUtils.createDirectStream[String, String](
+  ssc,
+  PreferConsistent,
+  Subscribe[String, String](topics, kafkaParams)
+)
+```
+
+**Recommended Structured Streaming approach:**
+```scala
+val df = spark
+  .readStream
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "localhost:9092")
+  .option("subscribe", "topic1")
+  .load()
+```
+
+See [Kafka Integration Guide](../docs/streaming-kafka-integration.md).
+
+### Amazon Kinesis
+
+```scala
+import org.apache.spark.streaming.kinesis._
+
+val stream = KinesisInputDStream.builder
+  .streamingContext(ssc)
+  .endpointUrl("https://kinesis.us-east-1.amazonaws.com")
+  .regionName("us-east-1")
+  .streamName("myStream")
+  .build()
+```
+
+See [Kinesis Integration Guide](../docs/streaming-kinesis-integration.md).
+
+## Monitoring and Debugging
+
+### Streaming UI
+
+Access at: `http://<driver-node>:4040/streaming/`
+
+**Metrics:**
+- Batch processing times
+- Input rates
+- Scheduling delays
+- Active batches
+
+### Logs
+
+Enable detailed logging:
+```properties
+log4j.logger.org.apache.spark.streaming=DEBUG
+```
+
+### Metrics
+
+Key metrics to monitor:
+- **Batch Processing Time**: Should be < batch interval
+- **Scheduling Delay**: Should be minimal
+- **Total Delay**: End-to-end delay
+- **Input Rate**: Records per second
+
+## Common Issues
+
+### Batch Processing Time > Batch Interval
+
+**Symptoms**: Scheduling delay increases over time
+
+**Solutions:**
+- Increase parallelism
+- Optimize transformations
+- Increase resources (executors, memory)
+- Reduce batch interval data volume
+
+### Out of Memory Errors
+
+**Solutions:**
+- Increase executor memory
+- Enable compression
+- Reduce window/batch size
+- Persist less data
+
+### Receiver Failures
+
+**Solutions:**
+- Enable WAL (Write-Ahead Logs)
+- Increase receiver memory
+- Add multiple receivers
+- Use Structured Streaming with better fault tolerance
+
+## Migration from DStreams to Structured Streaming
+
+**Why migrate:**
+- DStreams are deprecated
+- Better performance and semantics
+- Unified API with batch processing
+- Active development and support
+
+**Key differences:**
+- DataFrame/Dataset API instead of RDDs
+- Declarative operations
+- Built-in support for event time
+- Exactly-once semantics by default
+
+**Migration guide**: See [Structured Streaming Migration Guide](../docs/ss-migration-guide.md)
+
+## Examples
+
+See [examples/src/main/scala/org/apache/spark/examples/streaming/](../examples/src/main/scala/org/apache/spark/examples/streaming/) for more examples.
+
+**Key examples:**
+- `NetworkWordCount.scala`: Basic word count
+- `StatefulNetworkWordCount.scala`: Stateful processing
+- `WindowedNetworkWordCount.scala`: Window operations
+- `KafkaWordCount.scala`: Kafka integration
+
+## Configuration
+
+Key configuration parameters:
+
+```properties
+# Batch interval (set in code)
+# StreamingContext(conf, Seconds(batchInterval))
+
+# Backpressure (rate limiting)
+spark.streaming.backpressure.enabled=true
+
+# Receiver memory
+spark.streaming.receiver.maxRate=10000
+
+# Checkpoint interval
+spark.streaming.checkpoint.interval=10s
+
+# Graceful shutdown
+spark.streaming.stopGracefullyOnShutdown=true
+```
+
+## Best Practices
+
+1. **Use Structured Streaming for new applications**
+2. **Set appropriate batch intervals** based on latency requirements
+3. **Enable checkpointing** for stateful operations
+4. **Monitor batch processing times** to ensure they're less than batch interval
+5. **Use backpressure** to handle variable input rates
+6. **Test failure scenarios** with checkpointing
+7. **Consider using Kafka** for reliable message delivery
+
+## Further Reading
+
+- [Structured Streaming Programming Guide](../docs/structured-streaming-programming-guide.md) (Recommended)
+- [DStreams Programming Guide](../docs/streaming-programming-guide.md) (Legacy)
+- [Kafka Integration](../docs/streaming-kafka-integration.md)
+- [Kinesis Integration](../docs/streaming-kinesis-integration.md)
+
+## Contributing
+
+For contributing to Spark Streaming, see [CONTRIBUTING.md](../CONTRIBUTING.md).
+
+Note: New features should focus on Structured Streaming rather than DStreams.

From 6293b49cde89c561316e97515f21e9f380d051cc Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Sun, 19 Oct 2025 17:26:03 +0000
Subject: [PATCH 4/6] Add README files for launcher, resource-managers, and
 sbin directories

Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>
---
 launcher/README.md          | 475 +++++++++++++++++++++++++++++++++
 resource-managers/README.md | 514 ++++++++++++++++++++++++++++++++++++
 sbin/README.md              | 514 ++++++++++++++++++++++++++++++++++++
 3 files changed, 1503 insertions(+)
 create mode 100644 launcher/README.md
 create mode 100644 resource-managers/README.md
 create mode 100644 sbin/README.md

diff --git a/launcher/README.md b/launcher/README.md
new file mode 100644
index 0000000000000..7b49bcba4cfae
--- /dev/null
+++ b/launcher/README.md
@@ -0,0 +1,475 @@
+# Spark Launcher
+
+The Spark Launcher library provides a programmatic interface for launching Spark applications.
+
+## Overview
+
+The Launcher module allows you to:
+- Launch Spark applications programmatically from Java/Scala code
+- Monitor application state and output
+- Manage Spark processes
+- Build command-line arguments programmatically
+
+This is an alternative to invoking `spark-submit` via shell commands.
+
+## Key Components
+
+### SparkLauncher
+
+The main class for launching Spark applications.
+
+**Location**: `src/main/java/org/apache/spark/launcher/SparkLauncher.java`
+
+**Basic Usage:**
+```java
+import org.apache.spark.launcher.SparkLauncher;
+
+SparkLauncher launcher = new SparkLauncher()
+  .setAppResource("/path/to/app.jar")
+  .setMainClass("com.example.MyApp")
+  .setMaster("spark://master:7077")
+  .setConf(SparkLauncher.DRIVER_MEMORY, "2g")
+  .setConf(SparkLauncher.EXECUTOR_MEMORY, "4g")
+  .addAppArgs("arg1", "arg2");
+
+Process spark = launcher.launch();
+spark.waitFor();
+```
+
+### SparkAppHandle
+
+Interface for monitoring launched applications.
+
+**Location**: `src/main/java/org/apache/spark/launcher/SparkAppHandle.java`
+
+**Usage:**
+```java
+import org.apache.spark.launcher.SparkAppHandle;
+
+SparkAppHandle handle = launcher.startApplication();
+
+// Add listener for state changes
+handle.addListener(new SparkAppHandle.Listener() {
+  @Override
+  public void stateChanged(SparkAppHandle handle) {
+    System.out.println("State: " + handle.getState());
+  }
+  
+  @Override
+  public void infoChanged(SparkAppHandle handle) {
+    System.out.println("App ID: " + handle.getAppId());
+  }
+});
+
+// Wait for completion
+while (!handle.getState().isFinal()) {
+  Thread.sleep(1000);
+}
+```
+
+## API Reference
+
+### Configuration Methods
+
+```java
+SparkLauncher launcher = new SparkLauncher();
+
+// Application settings
+launcher.setAppResource("/path/to/app.jar");
+launcher.setMainClass("com.example.MainClass");
+launcher.setAppName("MyApplication");
+
+// Cluster settings
+launcher.setMaster("spark://master:7077");
+launcher.setDeployMode("cluster");
+
+// Resource settings
+launcher.setConf(SparkLauncher.DRIVER_MEMORY, "2g");
+launcher.setConf(SparkLauncher.EXECUTOR_MEMORY, "4g");
+launcher.setConf(SparkLauncher.EXECUTOR_CORES, "2");
+
+// Additional configurations
+launcher.setConf("spark.executor.instances", "5");
+launcher.setConf("spark.sql.shuffle.partitions", "200");
+
+// Dependencies
+launcher.addJar("/path/to/dependency.jar");
+launcher.addFile("/path/to/file.txt");
+launcher.addPyFile("/path/to/module.py");
+
+// Application arguments
+launcher.addAppArgs("arg1", "arg2", "arg3");
+
+// Environment
+launcher.setSparkHome("/path/to/spark");
+launcher.setPropertiesFile("/path/to/spark-defaults.conf");
+launcher.setVerbose(true);
+```
+
+### Launch Methods
+
+```java
+// Launch and return Process handle
+Process process = launcher.launch();
+
+// Launch and return SparkAppHandle for monitoring
+SparkAppHandle handle = launcher.startApplication();
+
+// For child process mode (rare)
+SparkAppHandle handle = launcher.startApplication(
+  new SparkAppHandle.Listener() {
+    // Listener implementation
+  }
+);
+```
+
+### Constants
+
+Common configuration keys are available as constants:
+
+```java
+SparkLauncher.SPARK_MASTER          // "spark.master"
+SparkLauncher.APP_RESOURCE          // "spark.app.resource"
+SparkLauncher.APP_NAME              // "spark.app.name"
+SparkLauncher.DRIVER_MEMORY         // "spark.driver.memory"
+SparkLauncher.DRIVER_EXTRA_CLASSPATH // "spark.driver.extraClassPath"
+SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS // "spark.driver.extraJavaOptions"
+SparkLauncher.DRIVER_EXTRA_LIBRARY_PATH // "spark.driver.extraLibraryPath"
+SparkLauncher.EXECUTOR_MEMORY       // "spark.executor.memory"
+SparkLauncher.EXECUTOR_CORES        // "spark.executor.cores"
+SparkLauncher.EXECUTOR_EXTRA_CLASSPATH // "spark.executor.extraClassPath"
+SparkLauncher.EXECUTOR_EXTRA_JAVA_OPTIONS // "spark.executor.extraJavaOptions"
+SparkLauncher.EXECUTOR_EXTRA_LIBRARY_PATH // "spark.executor.extraLibraryPath"
+```
+
+## Application States
+
+The `SparkAppHandle.State` enum represents application lifecycle states:
+
+- `UNKNOWN`: Initial state
+- `CONNECTED`: Connected to Spark
+- `SUBMITTED`: Application submitted
+- `RUNNING`: Application running
+- `FINISHED`: Completed successfully
+- `FAILED`: Failed with error
+- `KILLED`: Killed by user
+- `LOST`: Connection lost
+
+**Check if final:**
+```java
+if (handle.getState().isFinal()) {
+  // Application has completed
+}
+```
+
+## Examples
+
+### Launch Scala Application
+
+```java
+import org.apache.spark.launcher.SparkLauncher;
+
+public class LaunchSparkApp {
+  public static void main(String[] args) throws Exception {
+    Process spark = new SparkLauncher()
+      .setAppResource("/path/to/app.jar")
+      .setMainClass("com.example.SparkApp")
+      .setMaster("local[2]")
+      .setConf(SparkLauncher.DRIVER_MEMORY, "2g")
+      .launch();
+    
+    spark.waitFor();
+    System.exit(spark.exitValue());
+  }
+}
+```
+
+### Launch Python Application
+
+```java
+SparkLauncher launcher = new SparkLauncher()
+  .setAppResource("/path/to/app.py")
+  .setMaster("yarn")
+  .setDeployMode("cluster")
+  .setConf(SparkLauncher.EXECUTOR_MEMORY, "4g")
+  .addPyFile("/path/to/dependency.py")
+  .addAppArgs("--input", "/data/input", "--output", "/data/output");
+
+SparkAppHandle handle = launcher.startApplication();
+```
+
+### Monitor Application with Listener
+
+```java
+import org.apache.spark.launcher.SparkAppHandle;
+
+class MyListener implements SparkAppHandle.Listener {
+  @Override
+  public void stateChanged(SparkAppHandle handle) {
+    SparkAppHandle.State state = handle.getState();
+    System.out.println("Application state changed to: " + state);
+    
+    if (state.isFinal()) {
+      if (state == SparkAppHandle.State.FINISHED) {
+        System.out.println("Application completed successfully");
+      } else {
+        System.out.println("Application failed: " + state);
+      }
+    }
+  }
+  
+  @Override
+  public void infoChanged(SparkAppHandle handle) {
+    System.out.println("Application ID: " + handle.getAppId());
+  }
+}
+
+// Use the listener
+SparkAppHandle handle = new SparkLauncher()
+  .setAppResource("/path/to/app.jar")
+  .setMainClass("com.example.App")
+  .setMaster("spark://master:7077")
+  .startApplication(new MyListener());
+```
+
+### Capture Output
+
+```java
+import java.io.*;
+
+Process spark = new SparkLauncher()
+  .setAppResource("/path/to/app.jar")
+  .setMainClass("com.example.App")
+  .setMaster("local")
+  .redirectOutput(ProcessBuilder.Redirect.PIPE)
+  .redirectError(ProcessBuilder.Redirect.PIPE)
+  .launch();
+
+// Read output
+BufferedReader reader = new BufferedReader(
+  new InputStreamReader(spark.getInputStream())
+);
+String line;
+while ((line = reader.readLine()) != null) {
+  System.out.println(line);
+}
+
+spark.waitFor();
+```
+
+### Kill Running Application
+
+```java
+SparkAppHandle handle = launcher.startApplication();
+
+// Later, kill the application
+handle.kill();
+
+// Or stop gracefully
+handle.stop();
+```
+
+## In-Process Launcher
+
+For testing or special cases, launch Spark in the same JVM:
+
+```java
+import org.apache.spark.launcher.InProcessLauncher;
+
+InProcessLauncher launcher = new InProcessLauncher();
+// Configure launcher...
+SparkAppHandle handle = launcher.startApplication();
+```
+
+**Note**: This is primarily for testing. Production code should use `SparkLauncher`.
+
+## Building and Testing
+
+### Build Launcher Module
+
+```bash
+# Build launcher module
+./build/mvn -pl launcher -am package
+
+# Skip tests
+./build/mvn -pl launcher -am -DskipTests package
+```
+
+### Run Tests
+
+```bash
+# Run all launcher tests
+./build/mvn test -pl launcher
+
+# Run specific test
+./build/mvn test -pl launcher -Dtest=SparkLauncherSuite
+```
+
+## Source Code Organization
+
+```
+launcher/src/main/java/org/apache/spark/launcher/
+├── SparkLauncher.java              # Main launcher class
+├── SparkAppHandle.java             # Application handle interface
+├── AbstractLauncher.java           # Base launcher implementation
+├── InProcessLauncher.java          # In-process launcher (testing)
+├── Main.java                       # Entry point for spark-submit
+├── SparkSubmitCommandBuilder.java  # Builds spark-submit commands
+├── CommandBuilderUtils.java        # Command building utilities
+└── LauncherBackend.java           # Backend communication
+```
+
+## Integration with spark-submit
+
+The Launcher library is used internally by `spark-submit`:
+
+```
+spark-submit script
+    ↓
+Main.main()
+    ↓
+SparkSubmitCommandBuilder
+    ↓
+Launch JVM with SparkSubmit
+```
+
+## Configuration Priority
+
+Configuration values are resolved in this order (highest priority first):
+
+1. Values set via `setConf()` or specific setters
+2. Properties file specified with `setPropertiesFile()`
+3. `conf/spark-defaults.conf` in `SPARK_HOME`
+4. Environment variables
+
+## Environment Variables
+
+The launcher respects these environment variables:
+
+- `SPARK_HOME`: Spark installation directory
+- `JAVA_HOME`: Java installation directory
+- `SPARK_CONF_DIR`: Configuration directory
+- `HADOOP_CONF_DIR`: Hadoop configuration directory
+- `YARN_CONF_DIR`: YARN configuration directory
+
+## Security Considerations
+
+When launching applications programmatically:
+
+1. **Validate inputs**: Sanitize application arguments
+2. **Secure credentials**: Don't hardcode secrets
+3. **Limit permissions**: Run with minimal required privileges
+4. **Monitor processes**: Track launched applications
+5. **Clean up resources**: Always close handles and processes
+
+## Common Use Cases
+
+### Workflow Orchestration
+
+Launch Spark jobs as part of data pipelines:
+
+```java
+public class DataPipeline {
+  public void runStage(String stageName, String mainClass) throws Exception {
+    SparkAppHandle handle = new SparkLauncher()
+      .setAppResource("/path/to/pipeline.jar")
+      .setMainClass(mainClass)
+      .setMaster("yarn")
+      .setAppName("Pipeline-" + stageName)
+      .startApplication();
+    
+    // Wait for completion
+    while (!handle.getState().isFinal()) {
+      Thread.sleep(1000);
+    }
+    
+    if (handle.getState() != SparkAppHandle.State.FINISHED) {
+      throw new RuntimeException("Stage " + stageName + " failed");
+    }
+  }
+}
+```
+
+### Testing
+
+Launch Spark applications in integration tests:
+
+```java
+@Test
+public void testSparkApp() throws Exception {
+  SparkAppHandle handle = new SparkLauncher()
+    .setAppResource("target/test-app.jar")
+    .setMainClass("com.example.TestApp")
+    .setMaster("local[2]")
+    .startApplication();
+  
+  // Wait for completion
+  handle.waitFor(60000); // 60 second timeout
+  
+  assertEquals(SparkAppHandle.State.FINISHED, handle.getState());
+}
+```
+
+### Resource Management
+
+Launch applications with dynamic resource allocation:
+
+```java
+int executors = calculateRequiredExecutors(dataSize);
+String memory = calculateMemory(dataSize);
+
+SparkLauncher launcher = new SparkLauncher()
+  .setAppResource("/path/to/app.jar")
+  .setMainClass("com.example.App")
+  .setMaster("yarn")
+  .setConf("spark.executor.instances", String.valueOf(executors))
+  .setConf(SparkLauncher.EXECUTOR_MEMORY, memory)
+  .setConf("spark.dynamicAllocation.enabled", "true");
+```
+
+## Best Practices
+
+1. **Use SparkAppHandle**: Monitor application state
+2. **Add listeners**: Track state changes and failures
+3. **Set timeouts**: Don't wait indefinitely
+4. **Handle errors**: Check exit codes and states
+5. **Clean up**: Stop handles and processes
+6. **Log everything**: Record launches and outcomes
+7. **Use constants**: Use SparkLauncher constants for config keys
+
+## Troubleshooting
+
+### Application Not Starting
+
+**Check:**
+- SPARK_HOME is set correctly
+- Application JAR path is correct
+- Master URL is valid
+- Required resources are available
+
+### Process Hangs
+
+**Solutions:**
+- Add timeout: `handle.waitFor(timeout)`
+- Check for deadlocks in application
+- Verify cluster has capacity
+- Check logs for issues
+
+### Cannot Monitor Application
+
+**Solutions:**
+- Use `startApplication()` instead of `launch()`
+- Add listener before starting
+- Check for connection issues
+- Verify cluster is accessible
+
+## Further Reading
+
+- [Submitting Applications](../docs/submitting-applications.md)
+- [Cluster Mode Overview](../docs/cluster-overview.md)
+- [Configuration Guide](../docs/configuration.md)
+
+## API Documentation
+
+Full JavaDoc available in the built JAR or online at:
+https://spark.apache.org/docs/latest/api/java/org/apache/spark/launcher/package-summary.html
diff --git a/resource-managers/README.md b/resource-managers/README.md
new file mode 100644
index 0000000000000..f87ed0f06ad98
--- /dev/null
+++ b/resource-managers/README.md
@@ -0,0 +1,514 @@
+# Spark Resource Managers
+
+This directory contains integrations with various cluster resource managers.
+
+## Overview
+
+Spark can run on different cluster managers:
+- **YARN** (Hadoop YARN)
+- **Kubernetes** (Container orchestration)
+- **Mesos** (General-purpose cluster manager)
+- **Standalone** (Spark's built-in cluster manager)
+
+Each integration provides Spark-specific implementation for:
+- Resource allocation
+- Task scheduling
+- Application lifecycle management
+- Security integration
+
+## Modules
+
+### kubernetes/
+
+Integration with Kubernetes for container-based deployments.
+
+**Location**: `kubernetes/`
+
+**Key Features:**
+- Native Kubernetes resource management
+- Dynamic executor allocation
+- Volume mounting support
+- Kerberos integration
+- Custom resource definitions
+
+**Running on Kubernetes:**
+```bash
+./bin/spark-submit \
+  --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
+  --deploy-mode cluster \
+  --name spark-pi \
+  --class org.apache.spark.examples.SparkPi \
+  --conf spark.executor.instances=2 \
+  --conf spark.kubernetes.container.image=spark:3.5.0 \
+  local:///opt/spark/examples/jars/spark-examples.jar
+```
+
+**Documentation**: See [running-on-kubernetes.md](../docs/running-on-kubernetes.md)
+
+### mesos/
+
+Integration with Apache Mesos cluster manager.
+
+**Location**: `mesos/`
+
+**Key Features:**
+- Fine-grained mode (one task per Mesos task)
+- Coarse-grained mode (dedicated executors)
+- Dynamic allocation
+- Mesos frameworks integration
+
+**Running on Mesos:**
+```bash
+./bin/spark-submit \
+  --master mesos://mesos-master:5050 \
+  --deploy-mode cluster \
+  --class org.apache.spark.examples.SparkPi \
+  spark-examples.jar
+```
+
+**Documentation**: Check Apache Mesos documentation
+
+### yarn/
+
+Integration with Hadoop YARN (Yet Another Resource Negotiator).
+
+**Location**: `yarn/`
+
+**Key Features:**
+- Client and cluster deploy modes
+- Dynamic resource allocation
+- YARN container management
+- Security integration (Kerberos)
+- External shuffle service
+- Application timeline service integration
+
+**Running on YARN:**
+```bash
+# Client mode (driver runs locally)
+./bin/spark-submit \
+  --master yarn \
+  --deploy-mode client \
+  --class org.apache.spark.examples.SparkPi \
+  spark-examples.jar
+
+# Cluster mode (driver runs on YARN)
+./bin/spark-submit \
+  --master yarn \
+  --deploy-mode cluster \
+  --class org.apache.spark.examples.SparkPi \
+  spark-examples.jar
+```
+
+**Documentation**: See [running-on-yarn.md](../docs/running-on-yarn.md)
+
+## Comparison
+
+### YARN
+
+**Best for:**
+- Existing Hadoop deployments
+- Enterprise environments with Hadoop ecosystem
+- Multi-tenancy with resource queues
+- Organizations standardized on YARN
+
+**Pros:**
+- Mature and stable
+- Rich security features
+- Queue-based resource management
+- Good tooling and monitoring
+
+**Cons:**
+- Requires Hadoop installation
+- More complex setup
+- Higher overhead
+
+### Kubernetes
+
+**Best for:**
+- Cloud-native deployments
+- Containerized applications
+- Modern microservices architectures
+- Multi-cloud environments
+
+**Pros:**
+- Container isolation
+- Modern orchestration features
+- Cloud provider integration
+- Active development community
+
+**Cons:**
+- Newer integration (less mature)
+- Requires Kubernetes cluster
+- Learning curve for K8s
+
+### Mesos
+
+**Best for:**
+- General-purpose cluster management
+- Mixed workload environments (not just Spark)
+- Large-scale deployments
+
+**Pros:**
+- Fine-grained resource allocation
+- Flexible framework support
+- Good for mixed workloads
+
+**Cons:**
+- Less common than YARN/K8s
+- Setup complexity
+- Smaller community
+
+### Standalone
+
+**Best for:**
+- Quick start and development
+- Small clusters
+- Dedicated Spark clusters
+
+**Pros:**
+- Simple setup
+- No dependencies
+- Fast deployment
+
+**Cons:**
+- Limited resource management
+- No multi-tenancy
+- Basic scheduling
+
+## Architecture
+
+### Resource Manager Integration
+
+```
+Spark Application
+       ↓
+SparkContext
+       ↓
+Cluster Manager Client
+       ↓
+Resource Manager (YARN/K8s/Mesos)
+       ↓
+Container/Pod/Task Launch
+       ↓
+Executor Processes
+```
+
+### Common Components
+
+Each integration implements:
+
+1. **SchedulerBackend**: Launches executors and schedules tasks
+2. **ApplicationMaster/Driver**: Manages application lifecycle
+3. **ExecutorBackend**: Runs tasks on executors
+4. **Resource Allocation**: Requests and manages resources
+5. **Security Integration**: Authentication and authorization
+
+## Building
+
+### Build All Resource Manager Modules
+
+```bash
+# Build all resource manager integrations
+./build/mvn -pl 'resource-managers/*' -am package
+```
+
+### Build Specific Modules
+
+```bash
+# YARN only
+./build/mvn -pl resource-managers/yarn -am package
+
+# Kubernetes only
+./build/mvn -pl resource-managers/kubernetes/core -am package
+
+# Mesos only
+./build/mvn -pl resource-managers/mesos -am package
+```
+
+### Build with Specific Profiles
+
+```bash
+# Build with Kubernetes support
+./build/mvn -Pkubernetes package
+
+# Build with YARN support
+./build/mvn -Pyarn package
+
+# Build with Mesos support (requires Mesos libraries)
+./build/mvn -Pmesos package
+```
+
+## Configuration
+
+### YARN Configuration
+
+**Key settings:**
+```properties
+# Resource allocation
+spark.executor.instances=10
+spark.executor.memory=4g
+spark.executor.cores=2
+
+# YARN specific
+spark.yarn.am.memory=1g
+spark.yarn.am.cores=1
+spark.yarn.queue=default
+spark.yarn.jars=hdfs:///spark-jars/*
+
+# Dynamic allocation
+spark.dynamicAllocation.enabled=true
+spark.dynamicAllocation.minExecutors=1
+spark.dynamicAllocation.maxExecutors=100
+spark.shuffle.service.enabled=true
+```
+
+### Kubernetes Configuration
+
+**Key settings:**
+```properties
+# Container image
+spark.kubernetes.container.image=my-spark:latest
+spark.kubernetes.container.image.pullPolicy=Always
+
+# Resource allocation
+spark.executor.instances=5
+spark.kubernetes.executor.request.cores=1
+spark.kubernetes.executor.limit.cores=2
+spark.kubernetes.executor.request.memory=4g
+
+# Namespace and service account
+spark.kubernetes.namespace=spark
+spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa
+
+# Volumes
+spark.kubernetes.driver.volumes.persistentVolumeClaim.data.options.claimName=spark-pvc
+spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.path=/data
+```
+
+### Mesos Configuration
+
+**Key settings:**
+```properties
+# Mesos master
+spark.mesos.coarse=true
+spark.executor.uri=hdfs://path/to/spark.tgz
+
+# Resource allocation
+spark.executor.memory=4g
+spark.cores.max=20
+
+# Mesos specific
+spark.mesos.role=spark
+spark.mesos.constraints=rack:us-east
+```
+
+## Source Code Organization
+
+```
+resource-managers/
+├── kubernetes/
+│   ├── core/                    # Core K8s integration
+│   │   └── src/main/scala/org/apache/spark/
+│   │       ├── deploy/k8s/      # Deployment logic
+│   │       ├── scheduler/       # K8s scheduler backend
+│   │       └── executor/        # K8s executor backend
+│   └── integration-tests/       # K8s integration tests
+├── mesos/
+│   └── src/main/scala/org/apache/spark/
+│       ├── scheduler/           # Mesos scheduler
+│       └── executor/            # Mesos executor
+└── yarn/
+    └── src/main/scala/org/apache/spark/
+        ├── deploy/yarn/         # YARN deployment
+        ├── scheduler/           # YARN scheduler
+        └── executor/            # YARN executor
+```
+
+## Development
+
+### Testing Resource Manager Integrations
+
+```bash
+# Run YARN tests
+./build/mvn test -pl resource-managers/yarn
+
+# Run Kubernetes tests
+./build/mvn test -pl resource-managers/kubernetes/core
+
+# Run Mesos tests
+./build/mvn test -pl resource-managers/mesos
+```
+
+### Integration Tests
+
+**Kubernetes:**
+```bash
+cd resource-managers/kubernetes/integration-tests
+./dev/dev-run-integration-tests.sh
+```
+
+See `kubernetes/integration-tests/README.md` for details.
+
+## Security
+
+### YARN Security
+
+**Kerberos authentication:**
+```bash
+./bin/spark-submit \
+  --master yarn \
+  --principal user@REALM \
+  --keytab /path/to/user.keytab \
+  --class org.apache.spark.examples.SparkPi \
+  spark-examples.jar
+```
+
+**Token renewal:**
+```properties
+spark.yarn.principal=user@REALM
+spark.yarn.keytab=/path/to/keytab
+spark.yarn.token.renewal.interval=86400
+```
+
+### Kubernetes Security
+
+**Service account:**
+```properties
+spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa
+spark.kubernetes.authenticate.executor.serviceAccountName=spark-sa
+```
+
+**Secrets:**
+```bash
+kubectl create secret generic spark-secret --from-literal=password=mypassword
+```
+
+```properties
+spark.kubernetes.driver.secrets.spark-secret=/etc/secrets
+```
+
+### Mesos Security
+
+**Authentication:**
+```properties
+spark.mesos.principal=spark-user
+spark.mesos.secret=spark-secret
+```
+
+## Migration Guide
+
+### Moving from Standalone to YARN
+
+1. Set up Hadoop cluster
+2. Configure YARN resource manager
+3. Enable external shuffle service
+4. Update spark-submit commands to use `--master yarn`
+5. Test dynamic allocation
+
+### Moving from YARN to Kubernetes
+
+1. Build Docker image with Spark
+2. Push image to container registry
+3. Create Kubernetes namespace and service account
+4. Update spark-submit to use `--master k8s://`
+5. Configure volume mounts for data access
+
+## Troubleshooting
+
+### YARN Issues
+
+**Application stuck in ACCEPTED state:**
+- Check YARN capacity
+- Verify queue settings
+- Check resource availability
+
+**Container allocation failures:**
+- Increase memory overhead
+- Check node resources
+- Verify memory/core requests
+
+### Kubernetes Issues
+
+**Image pull failures:**
+- Verify image name and tag
+- Check image pull secrets
+- Ensure registry is accessible
+
+**Pod failures:**
+- Check pod logs: `kubectl logs <pod-name>`
+- Verify service account permissions
+- Check resource limits
+
+### Mesos Issues
+
+**Framework registration failures:**
+- Verify Mesos master URL
+- Check authentication settings
+- Ensure proper role configuration
+
+## Best Practices
+
+1. **Choose the right manager**: Based on infrastructure and requirements
+2. **Enable dynamic allocation**: For better resource utilization
+3. **Use external shuffle service**: For executor failure tolerance
+4. **Configure memory overhead**: Account for non-heap memory
+5. **Monitor resource usage**: Track executor and driver metrics
+6. **Use appropriate deploy mode**: Client for interactive, cluster for production
+7. **Implement security**: Enable authentication and encryption
+8. **Test failure scenarios**: Verify fault tolerance
+
+## Performance Tuning
+
+### YARN Performance
+
+```properties
+# Memory overhead
+spark.yarn.executor.memoryOverhead=512m
+
+# Locality wait
+spark.locality.wait=3s
+
+# Container reuse
+spark.yarn.executor.launch.parallelism=10
+```
+
+### Kubernetes Performance
+
+```properties
+# Resource limits
+spark.kubernetes.executor.limit.cores=2
+
+# Volume performance
+spark.kubernetes.driver.volumes.emptyDir.cache.medium=Memory
+
+# Network optimization
+spark.kubernetes.executor.podNamePrefix=spark-exec
+```
+
+### Mesos Performance
+
+```properties
+# Fine-grained mode for better sharing
+spark.mesos.coarse=false
+
+# Container timeout
+spark.mesos.executor.docker.pullTimeout=600
+```
+
+## Further Reading
+
+- [Running on YARN](../docs/running-on-yarn.md)
+- [Running on Kubernetes](../docs/running-on-kubernetes.md)
+- [Cluster Mode Overview](../docs/cluster-overview.md)
+- [Configuration Guide](../docs/configuration.md)
+- [Security Guide](../docs/security.md)
+
+## Contributing
+
+For contributing to resource manager integrations, see [CONTRIBUTING.md](../CONTRIBUTING.md).
+
+When adding features:
+- Ensure cross-compatibility
+- Add comprehensive tests
+- Update documentation
+- Consider security implications
diff --git a/sbin/README.md b/sbin/README.md
new file mode 100644
index 0000000000000..dbe86cc1a8aa4
--- /dev/null
+++ b/sbin/README.md
@@ -0,0 +1,514 @@
+# Spark Admin Scripts
+
+This directory contains administrative scripts for managing Spark standalone clusters.
+
+## Overview
+
+The `sbin/` scripts are used by cluster administrators to:
+- Start and stop Spark standalone clusters
+- Start and stop individual daemons (master, workers, history server)
+- Manage cluster lifecycle
+- Configure cluster nodes
+
+**Note**: These scripts are for **Spark Standalone** cluster mode only. For YARN, Kubernetes, or Mesos, use their respective cluster management tools.
+
+## Cluster Management Scripts
+
+### start-all.sh / stop-all.sh
+
+Start or stop all Spark daemons on the cluster.
+
+**Usage:**
+```bash
+# Start master and all workers
+./sbin/start-all.sh
+
+# Stop all daemons
+./sbin/stop-all.sh
+```
+
+**What they do:**
+- `start-all.sh`: Starts master on the current machine and workers on machines listed in `conf/workers`
+- `stop-all.sh`: Stops all master and worker daemons
+
+**Prerequisites:**
+- SSH key-based authentication configured
+- `conf/workers` file with worker hostnames
+- Spark installed at same location on all machines
+
+**Configuration files:**
+- `conf/workers`: List of worker hostnames (one per line)
+- `conf/spark-env.sh`: Environment variables
+
+### start-master.sh / stop-master.sh
+
+Start or stop the Spark master daemon on the current machine.
+
+**Usage:**
+```bash
+# Start master
+./sbin/start-master.sh
+
+# Stop master
+./sbin/stop-master.sh
+```
+
+**Master Web UI**: Access at `http://<master-host>:8080/`
+
+**Configuration:**
+```bash
+# In conf/spark-env.sh
+export SPARK_MASTER_HOST=master-hostname
+export SPARK_MASTER_PORT=7077
+export SPARK_MASTER_WEBUI_PORT=8080
+```
+
+### start-worker.sh / stop-worker.sh
+
+Start or stop a Spark worker daemon on the current machine.
+
+**Usage:**
+```bash
+# Start worker connecting to master
+./sbin/start-worker.sh spark://master:7077
+
+# Stop worker
+./sbin/stop-worker.sh
+```
+
+**Worker Web UI**: Access at `http://<worker-host>:8081/`
+
+**Configuration:**
+```bash
+# In conf/spark-env.sh
+export SPARK_WORKER_CORES=8      # Number of cores to use
+export SPARK_WORKER_MEMORY=16g   # Memory to allocate
+export SPARK_WORKER_PORT=7078    # Worker port
+export SPARK_WORKER_WEBUI_PORT=8081
+export SPARK_WORKER_DIR=/var/spark/work  # Work directory
+```
+
+### start-workers.sh / stop-workers.sh
+
+Start or stop workers on all machines listed in `conf/workers`.
+
+**Usage:**
+```bash
+# Start all workers
+./sbin/start-workers.sh spark://master:7077
+
+# Stop all workers
+./sbin/stop-workers.sh
+```
+
+**Requirements:**
+- `conf/workers` file configured
+- SSH access to all worker machines
+- Master URL (for starting)
+
+## History Server Scripts
+
+### start-history-server.sh / stop-history-server.sh
+
+Start or stop the Spark History Server for viewing completed application logs.
+
+**Usage:**
+```bash
+# Start history server
+./sbin/start-history-server.sh
+
+# Stop history server
+./sbin/stop-history-server.sh
+```
+
+**History Server UI**: Access at `http://<host>:18080/`
+
+**Configuration:**
+```properties
+# In conf/spark-defaults.conf
+spark.history.fs.logDirectory=hdfs://namenode/spark-logs
+spark.history.ui.port=18080
+spark.eventLog.enabled=true
+spark.eventLog.dir=hdfs://namenode/spark-logs
+```
+
+**Requirements:**
+- Applications must have event logging enabled
+- Log directory must be accessible
+
+## Shuffle Service Scripts
+
+### start-shuffle-service.sh / stop-shuffle-service.sh
+
+Start or stop the external shuffle service (for YARN).
+
+**Usage:**
+```bash
+# Start shuffle service
+./sbin/start-shuffle-service.sh
+
+# Stop shuffle service
+./sbin/stop-shuffle-service.sh
+```
+
+**Note**: Typically used only when running on YARN without the YARN auxiliary service.
+
+## Configuration Files
+
+### conf/workers
+
+Lists worker hostnames, one per line.
+
+**Example:**
+```
+worker1.example.com
+worker2.example.com
+worker3.example.com
+```
+
+**Usage:**
+- Used by `start-all.sh` and `start-workers.sh`
+- Each line should contain a hostname or IP address
+- Blank lines and lines starting with `#` are ignored
+
+### conf/spark-env.sh
+
+Environment variables for Spark daemons.
+
+**Example:**
+```bash
+#!/usr/bin/env bash
+
+# Java
+export JAVA_HOME=/usr/lib/jvm/java-17
+
+# Master settings
+export SPARK_MASTER_HOST=master.example.com
+export SPARK_MASTER_PORT=7077
+export SPARK_MASTER_WEBUI_PORT=8080
+
+# Worker settings
+export SPARK_WORKER_CORES=8
+export SPARK_WORKER_MEMORY=16g
+export SPARK_WORKER_PORT=7078
+export SPARK_WORKER_WEBUI_PORT=8081
+export SPARK_WORKER_DIR=/var/spark/work
+
+# Directories
+export SPARK_LOG_DIR=/var/log/spark
+export SPARK_PID_DIR=/var/run/spark
+
+# History Server
+export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://namenode/spark-logs"
+
+# Additional Java options
+export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181"
+```
+
+**Key Variables:**
+
+**Master:**
+- `SPARK_MASTER_HOST`: Master hostname
+- `SPARK_MASTER_PORT`: Master port (default: 7077)
+- `SPARK_MASTER_WEBUI_PORT`: Web UI port (default: 8080)
+
+**Worker:**
+- `SPARK_WORKER_CORES`: Number of cores per worker
+- `SPARK_WORKER_MEMORY`: Memory per worker (e.g., 16g)
+- `SPARK_WORKER_PORT`: Worker communication port
+- `SPARK_WORKER_WEBUI_PORT`: Worker web UI port (default: 8081)
+- `SPARK_WORKER_DIR`: Directory for scratch space and logs
+- `SPARK_WORKER_INSTANCES`: Number of worker instances per machine
+
+**General:**
+- `SPARK_LOG_DIR`: Directory for daemon logs
+- `SPARK_PID_DIR`: Directory for PID files
+- `SPARK_IDENT_STRING`: Identifier for daemons (default: username)
+- `SPARK_NICENESS`: Nice value for daemons
+- `SPARK_DAEMON_MEMORY`: Memory for daemon processes
+
+## Setting Up a Standalone Cluster
+
+### Step 1: Install Spark on All Nodes
+
+```bash
+# Download and extract Spark on each machine
+tar xzf spark-X.Y.Z-bin-hadoopX.tgz
+cd spark-X.Y.Z-bin-hadoopX
+```
+
+### Step 2: Configure spark-env.sh
+
+Create `conf/spark-env.sh` from template:
+```bash
+cp conf/spark-env.sh.template conf/spark-env.sh
+# Edit conf/spark-env.sh with appropriate settings
+```
+
+### Step 3: Configure Workers File
+
+Create `conf/workers`:
+```bash
+cp conf/workers.template conf/workers
+# Add worker hostnames, one per line
+```
+
+### Step 4: Configure SSH Access
+
+Set up password-less SSH from master to all workers:
+```bash
+ssh-keygen -t rsa
+ssh-copy-id user@worker1
+ssh-copy-id user@worker2
+# ... for each worker
+```
+
+### Step 5: Synchronize Configuration
+
+Copy configuration to all workers:
+```bash
+for host in $(cat conf/workers); do
+  rsync -av conf/ user@$host:spark/conf/
+done
+```
+
+### Step 6: Start the Cluster
+
+```bash
+./sbin/start-all.sh
+```
+
+### Step 7: Verify
+
+- Check master UI: `http://master:8080`
+- Check worker UIs: `http://worker1:8081`, etc.
+- Look for workers registered with master
+
+## High Availability
+
+For production deployments, configure high availability with ZooKeeper.
+
+### ZooKeeper-based HA Configuration
+
+**In conf/spark-env.sh:**
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.deploy.recoveryMode=ZOOKEEPER
+  -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181,zk3:2181
+  -Dspark.deploy.zookeeper.dir=/spark
+"
+```
+
+### Start Multiple Masters
+
+```bash
+# On master1
+./sbin/start-master.sh
+
+# On master2
+./sbin/start-master.sh
+
+# On master3
+./sbin/start-master.sh
+```
+
+### Connect Workers to All Masters
+
+```bash
+./sbin/start-worker.sh spark://master1:7077,master2:7077,master3:7077
+```
+
+**Automatic failover:** If active master fails, standby masters detect the failure and one becomes active.
+
+## Monitoring and Logs
+
+### Log Files
+
+Daemon logs are written to `$SPARK_LOG_DIR` (default: `logs/`):
+
+```bash
+# Master log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.master.Master-*.out
+
+# Worker log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.worker.Worker-*.out
+
+# History Server log
+$SPARK_LOG_DIR/spark-$USER-org.apache.spark.deploy.history.HistoryServer-*.out
+```
+
+### View Logs
+
+```bash
+# Tail master log
+tail -f logs/spark-*-master-*.out
+
+# Tail worker log
+tail -f logs/spark-*-worker-*.out
+
+# Search for errors
+grep ERROR logs/spark-*-master-*.out
+```
+
+### Web UIs
+
+- **Master UI**: `http://<master>:8080` - Cluster status, workers, applications
+- **Worker UI**: `http://<worker>:8081` - Worker status, running executors
+- **Application UI**: `http://<driver>:4040` - Running application metrics
+- **History Server**: `http://<history-server>:18080` - Completed applications
+
+## Advanced Configuration
+
+### Memory Overhead
+
+Reserve memory for system processes:
+```bash
+export SPARK_DAEMON_MEMORY=2g
+```
+
+### Multiple Workers per Machine
+
+Run multiple worker instances on a single machine:
+```bash
+export SPARK_WORKER_INSTANCES=2
+export SPARK_WORKER_CORES=4      # Cores per instance
+export SPARK_WORKER_MEMORY=8g    # Memory per instance
+```
+
+### Work Directory
+
+Change worker scratch space:
+```bash
+export SPARK_WORKER_DIR=/mnt/fast-disk/spark-work
+```
+
+### Port Configuration
+
+Use non-default ports:
+```bash
+export SPARK_MASTER_PORT=9077
+export SPARK_MASTER_WEBUI_PORT=9080
+export SPARK_WORKER_PORT=9078
+export SPARK_WORKER_WEBUI_PORT=9081
+```
+
+## Security
+
+### Enable Authentication
+
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.authenticate=true
+  -Dspark.authenticate.secret=your-secret-key
+"
+```
+
+### Enable SSL
+
+```bash
+export SPARK_DAEMON_JAVA_OPTS="
+  -Dspark.ssl.enabled=true
+  -Dspark.ssl.keyStore=/path/to/keystore
+  -Dspark.ssl.keyStorePassword=password
+  -Dspark.ssl.trustStore=/path/to/truststore
+  -Dspark.ssl.trustStorePassword=password
+"
+```
+
+## Troubleshooting
+
+### Master Won't Start
+
+**Check:**
+1. Port 7077 is available: `netstat -an | grep 7077`
+2. Hostname is resolvable: `ping $SPARK_MASTER_HOST`
+3. Logs for errors: `cat logs/spark-*-master-*.out`
+
+### Workers Not Connecting
+
+**Check:**
+1. Master URL is correct
+2. Network connectivity: `telnet master 7077`
+3. Firewall allows connections
+4. Worker logs: `cat logs/spark-*-worker-*.out`
+
+### SSH Connection Issues
+
+**Solutions:**
+1. Verify SSH key: `ssh worker1 echo test`
+2. Check SSH config: `~/.ssh/config`
+3. Use SSH agent: `eval $(ssh-agent); ssh-add`
+
+### Insufficient Resources
+
+**Check:**
+- Worker has enough memory: `free -h`
+- Enough cores available: `nproc`
+- Disk space: `df -h`
+
+## Cluster Shutdown
+
+### Graceful Shutdown
+
+```bash
+# Stop all workers first
+./sbin/stop-workers.sh
+
+# Stop master
+./sbin/stop-master.sh
+
+# Or stop everything
+./sbin/stop-all.sh
+```
+
+### Check All Stopped
+
+```bash
+# Check for running Java processes
+jps | grep -E "(Master|Worker)"
+```
+
+### Force Kill if Needed
+
+```bash
+# Kill any remaining Spark processes
+pkill -f org.apache.spark.deploy
+```
+
+## Best Practices
+
+1. **Use HA in production**: Configure ZooKeeper-based HA
+2. **Monitor resources**: Watch CPU, memory, disk usage
+3. **Separate log directories**: Use dedicated disk for logs
+4. **Regular maintenance**: Clean old logs and application data
+5. **Automate startup**: Use systemd or init scripts
+6. **Configure limits**: Set file descriptor and process limits
+7. **Use external shuffle service**: For better fault tolerance
+8. **Back up metadata**: Regularly back up ZooKeeper data
+
+## Scripts Reference
+
+| Script | Purpose |
+|--------|---------|
+| `start-all.sh` | Start master and all workers |
+| `stop-all.sh` | Stop master and all workers |
+| `start-master.sh` | Start master on current machine |
+| `stop-master.sh` | Stop master |
+| `start-worker.sh` | Start worker on current machine |
+| `stop-worker.sh` | Stop worker |
+| `start-workers.sh` | Start workers on all machines in `conf/workers` |
+| `stop-workers.sh` | Stop all workers |
+| `start-history-server.sh` | Start history server |
+| `stop-history-server.sh` | Stop history server |
+
+## Further Reading
+
+- [Spark Standalone Mode](../docs/spark-standalone.md)
+- [Cluster Mode Overview](../docs/cluster-overview.md)
+- [Configuration Guide](../docs/configuration.md)
+- [Security Guide](../docs/security.md)
+- [Monitoring Guide](../docs/monitoring.md)
+
+## User-Facing Scripts
+
+For user-facing scripts (spark-submit, spark-shell, etc.), see [../bin/README.md](../bin/README.md).

From 8c06e81657319a50f64a0ef9a7ce18e38971b68d Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Sun, 19 Oct 2025 17:27:58 +0000
Subject: [PATCH 5/6] Update main README with documentation links and add code
 documentation guide

Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>
---
 CODE_DOCUMENTATION_GUIDE.md | 612 ++++++++++++++++++++++++++++++++++++
 README.md                   |  26 +-
 2 files changed, 636 insertions(+), 2 deletions(-)
 create mode 100644 CODE_DOCUMENTATION_GUIDE.md

diff --git a/CODE_DOCUMENTATION_GUIDE.md b/CODE_DOCUMENTATION_GUIDE.md
new file mode 100644
index 0000000000000..1229bc447140b
--- /dev/null
+++ b/CODE_DOCUMENTATION_GUIDE.md
@@ -0,0 +1,612 @@
+# Code Documentation Guide
+
+This guide describes documentation standards for Apache Spark source code.
+
+## Overview
+
+Good documentation helps developers understand and maintain code. Spark follows industry-standard documentation practices for each language it supports.
+
+## Scala Documentation (Scaladoc)
+
+Scala code uses Scaladoc for API documentation.
+
+### Basic Format
+
+```scala
+/**
+ * Brief one-line description.
+ *
+ * Detailed description that can span multiple lines.
+ * Explain what this class/method does, important behavior,
+ * and any constraints or assumptions.
+ *
+ * @param paramName description of parameter
+ * @param anotherParam description of another parameter
+ * @return description of return value
+ * @throws ExceptionType when this exception is thrown
+ * @since 3.5.0
+ * @note Important note about usage or behavior
+ */
+def methodName(paramName: String, anotherParam: Int): ReturnType = {
+  // Implementation
+}
+```
+
+### Class Documentation
+
+```scala
+/**
+ * Brief description of the class purpose.
+ *
+ * Detailed explanation of the class functionality, usage patterns,
+ * and important considerations.
+ *
+ * Example usage:
+ * {{{
+ * val example = new MyClass(param1, param2)
+ * example.doSomething()
+ * }}}
+ *
+ * @constructor Creates a new instance with the given parameters
+ * @param config Configuration object
+ * @param isLocal Whether running in local mode
+ * @since 3.0.0
+ */
+class MyClass(config: SparkConf, isLocal: Boolean) extends Logging {
+  // Class implementation
+}
+```
+
+### Code Examples
+
+Use triple braces for code examples:
+
+```scala
+/**
+ * Transforms the RDD by applying a function to each element.
+ *
+ * Example:
+ * {{{
+ * val rdd = sc.parallelize(1 to 10)
+ * val doubled = rdd.map(_ * 2)
+ * doubled.collect() // Array(2, 4, 6, ..., 20)
+ * }}}
+ *
+ * @param f function to apply to each element
+ * @return transformed RDD
+ */
+def map[U: ClassTag](f: T => U): RDD[U]
+```
+
+### Annotations
+
+Use Spark annotations for API stability:
+
+```scala
+/**
+ * :: Experimental ::
+ * This feature is experimental and may change in future releases.
+ */
+@Experimental
+class ExperimentalFeature
+
+/**
+ * :: DeveloperApi ::
+ * This is a developer API and may change between minor versions.
+ */
+@DeveloperApi
+class DeveloperFeature
+
+/**
+ * :: Unstable ::
+ * This API is unstable and may change in patch releases.
+ */
+@Unstable
+class UnstableFeature
+```
+
+### Internal APIs
+
+Mark internal classes and methods:
+
+```scala
+/**
+ * Internal utility class for XYZ.
+ *
+ * @note This is an internal API and may change without notice.
+ */
+private[spark] class InternalUtil
+
+/**
+ * Internal method used by scheduler.
+ */
+private[scheduler] def internalMethod(): Unit
+```
+
+## Java Documentation (Javadoc)
+
+Java code uses Javadoc for API documentation.
+
+### Basic Format
+
+```java
+/**
+ * Brief one-line description.
+ * <p>
+ * Detailed description that can span multiple paragraphs.
+ * Explain what this class/method does and important behavior.
+ * </p>
+ *
+ * @param paramName description of parameter
+ * @param anotherParam description of another parameter
+ * @return description of return value
+ * @throws ExceptionType when this exception is thrown
+ * @since 3.5.0
+ */
+public ReturnType methodName(String paramName, int anotherParam) 
+    throws ExceptionType {
+  // Implementation
+}
+```
+
+### Class Documentation
+
+```java
+/**
+ * Brief description of the class purpose.
+ * <p>
+ * Detailed explanation of functionality, usage patterns,
+ * and important considerations.
+ * </p>
+ * <p>
+ * Example usage:
+ * <pre>{@code
+ * MyClass example = new MyClass(param1, param2);
+ * example.doSomething();
+ * }</pre>
+ * </p>
+ *
+ * @param <T> type parameter description
+ * @since 3.0.0
+ */
+public class MyClass<T> implements Serializable {
+  // Class implementation
+}
+```
+
+### Interface Documentation
+
+```java
+/**
+ * Interface for shuffle block resolution.
+ * <p>
+ * Implementations of this interface are responsible for
+ * resolving shuffle block locations and reading shuffle data.
+ * </p>
+ *
+ * @since 2.3.0
+ */
+public interface ShuffleBlockResolver {
+  /**
+   * Gets the data for a shuffle block.
+   *
+   * @param blockId the block identifier
+   * @return managed buffer containing the block data
+   */
+  ManagedBuffer getBlockData(BlockId blockId);
+}
+```
+
+## Python Documentation (Docstrings)
+
+Python code uses docstrings following PEP 257 and Google style.
+
+### Function Documentation
+
+```python
+def function_name(param1: str, param2: int) -> bool:
+    """
+    Brief one-line description.
+    
+    Detailed description that can span multiple lines.
+    Explain what this function does, important behavior,
+    and any constraints.
+    
+    Parameters
+    ----------
+    param1 : str
+        Description of param1
+    param2 : int
+        Description of param2
+    
+    Returns
+    -------
+    bool
+        Description of return value
+    
+    Raises
+    ------
+    ValueError
+        When input is invalid
+    
+    Examples
+    --------
+    >>> result = function_name("test", 42)
+    >>> print(result)
+    True
+    
+    Notes
+    -----
+    Important notes about usage or behavior.
+    
+    .. versionadded:: 3.5.0
+    """
+    # Implementation
+    pass
+```
+
+### Class Documentation
+
+```python
+class MyClass:
+    """
+    Brief description of the class.
+    
+    Detailed explanation of the class functionality,
+    usage patterns, and important considerations.
+    
+    Parameters
+    ----------
+    config : dict
+        Configuration dictionary
+    is_local : bool, optional
+        Whether running in local mode (default is False)
+    
+    Attributes
+    ----------
+    config : dict
+        Stored configuration
+    state : str
+        Current state of the object
+    
+    Examples
+    --------
+    >>> obj = MyClass({'key': 'value'}, is_local=True)
+    >>> obj.do_something()
+    
+    Notes
+    -----
+    This class is thread-safe.
+    
+    .. versionadded:: 3.0.0
+    """
+    
+    def __init__(self, config: dict, is_local: bool = False):
+        self.config = config
+        self.is_local = is_local
+        self.state = "initialized"
+```
+
+### Type Hints
+
+Use type hints consistently:
+
+```python
+from typing import List, Optional, Dict, Any, Union
+from pyspark.sql import DataFrame
+
+def process_data(
+    df: DataFrame,
+    columns: List[str],
+    options: Optional[Dict[str, Any]] = None
+) -> Union[DataFrame, None]:
+    """
+    Process DataFrame with specified columns.
+    
+    Parameters
+    ----------
+    df : DataFrame
+        Input DataFrame to process
+    columns : list of str
+        Column names to include
+    options : dict, optional
+        Processing options
+    
+    Returns
+    -------
+    DataFrame or None
+        Processed DataFrame, or None if processing fails
+    """
+    pass
+```
+
+## R Documentation (Roxygen2)
+
+R code uses Roxygen2-style documentation.
+
+### Function Documentation
+
+```r
+#' Brief one-line description
+#'
+#' Detailed description that can span multiple lines.
+#' Explain what this function does and important behavior.
+#'
+#' @param param1 description of param1
+#' @param param2 description of param2
+#' @return description of return value
+#' @examples
+#' \dontrun{
+#' result <- myFunction(param1 = "test", param2 = 42)
+#' print(result)
+#' }
+#' @note Important note about usage
+#' @rdname function-name
+#' @since 3.0.0
+#' @export
+myFunction <- function(param1, param2) {
+  # Implementation
+}
+```
+
+### Class Documentation
+
+```r
+#' MyClass: A class for doing XYZ
+#'
+#' Detailed description of the class functionality
+#' and usage patterns.
+#'
+#' @slot field1 description of field1
+#' @slot field2 description of field2
+#' @export
+#' @since 3.0.0
+setClass("MyClass",
+  slots = c(
+    field1 = "character",
+    field2 = "numeric"
+  )
+)
+```
+
+## Documentation Best Practices
+
+### 1. Write Clear, Concise Descriptions
+
+**Good:**
+```scala
+/**
+ * Computes the mean of values in the RDD.
+ *
+ * @return the arithmetic mean, or NaN if the RDD is empty
+ */
+def mean(): Double
+```
+
+**Bad:**
+```scala
+/**
+ * This method calculates and returns the mean.
+ */
+def mean(): Double
+```
+
+### 2. Document Edge Cases
+
+```scala
+/**
+ * Divides two numbers.
+ *
+ * @param a numerator
+ * @param b denominator
+ * @return result of a / b
+ * @throws ArithmeticException if b is zero
+ * @note Returns Double.PositiveInfinity if a > 0 and b = 0+
+ */
+def divide(a: Double, b: Double): Double
+```
+
+### 3. Provide Examples
+
+Always include examples for public APIs:
+
+```scala
+/**
+ * Filters elements using the given predicate.
+ *
+ * Example:
+ * {{{
+ * val rdd = sc.parallelize(1 to 10)
+ * val evens = rdd.filter(_ % 2 == 0)
+ * evens.collect() // Array(2, 4, 6, 8, 10)
+ * }}}
+ */
+def filter(f: T => Boolean): RDD[T]
+```
+
+### 4. Document Thread Safety
+
+```scala
+/**
+ * Thread-safe cache implementation.
+ *
+ * @note This class uses internal synchronization and is safe
+ *       for concurrent access from multiple threads.
+ */
+class ConcurrentCache[K, V] extends Cache[K, V]
+```
+
+### 5. Document Performance Characteristics
+
+```scala
+/**
+ * Sorts the RDD by key.
+ *
+ * @note This operation triggers a shuffle and is expensive.
+ *       The time complexity is O(n log n) where n is the
+ *       number of elements.
+ */
+def sortByKey(): RDD[(K, V)]
+```
+
+### 6. Link to Related APIs
+
+```scala
+/**
+ * Maps elements to key-value pairs.
+ *
+ * @see [[groupByKey]] for grouping by keys
+ * @see [[reduceByKey]] for aggregating by keys
+ */
+def keyBy[K](f: T => K): RDD[(K, T)]
+```
+
+### 7. Version Information
+
+```scala
+/**
+ * New feature introduced in 3.5.0.
+ *
+ * @since 3.5.0
+ */
+def newMethod(): Unit
+
+/**
+ * Deprecated method, use [[newMethod]] instead.
+ *
+ * @deprecated Use newMethod() instead, since 3.5.0
+ */
+@deprecated("Use newMethod() instead", "3.5.0")
+def oldMethod(): Unit
+```
+
+## Internal Documentation
+
+### Code Comments
+
+Use comments for complex logic:
+
+```scala
+// Sort by key and value to ensure deterministic output
+// This is critical for testing and reproducing results
+val sorted = data.sortBy(x => (x._1, x._2))
+
+// TODO: Optimize this for large datasets
+// Current implementation loads all data into memory
+val result = computeExpensiveOperation()
+
+// FIXME: This breaks when input size exceeds Int.MaxValue
+val size = data.size.toInt
+```
+
+### Architecture Comments
+
+Document architectural decisions:
+
+```scala
+/**
+ * Internal scheduler implementation.
+ *
+ * Architecture:
+ * 1. Jobs are submitted to DAGScheduler
+ * 2. DAGScheduler creates stages based on shuffle boundaries
+ * 3. Each stage is submitted as a TaskSet to TaskScheduler
+ * 4. TaskScheduler assigns tasks to executors
+ * 5. Task results are returned to the driver
+ *
+ * Thread Safety:
+ * - DAGScheduler runs in a single thread (event loop)
+ * - TaskScheduler methods are thread-safe
+ * - Results are collected with appropriate synchronization
+ */
+private[spark] class SchedulerImpl
+```
+
+## Generating Documentation
+
+### Scaladoc
+
+```bash
+# Generate Scaladoc
+./build/mvn scala:doc
+
+# Output in target/site/scaladocs/
+```
+
+### Javadoc
+
+```bash
+# Generate Javadoc
+./build/mvn javadoc:javadoc
+
+# Output in target/site/apidocs/
+```
+
+### Python Documentation
+
+```bash
+# Generate Sphinx documentation
+cd python/docs
+make html
+
+# Output in _build/html/
+```
+
+### R Documentation
+
+```bash
+# Generate R documentation
+cd R/pkg
+R CMD Rd2pdf .
+```
+
+## Documentation Review Checklist
+
+When reviewing documentation:
+
+- [ ] Is the description clear and accurate?
+- [ ] Are all parameters documented?
+- [ ] Is the return value documented?
+- [ ] Are exceptions/errors documented?
+- [ ] Are examples provided for public APIs?
+- [ ] Is thread safety documented if relevant?
+- [ ] Are performance characteristics noted?
+- [ ] Is version information included?
+- [ ] Are deprecated APIs marked?
+- [ ] Are there links to related APIs?
+- [ ] Is internal vs. public API clearly marked?
+
+## Tools
+
+### IDE Support
+
+- **IntelliJ IDEA**: Auto-generates documentation templates
+- **VS Code**: Extensions for Scaladoc/Javadoc
+- **Eclipse**: Built-in Javadoc support
+
+### Linters
+
+- **Scalastyle**: Checks for missing Scaladoc
+- **Checkstyle**: Validates Javadoc
+- **Pylint**: Checks Python docstrings
+- **roxygen2**: Validates R documentation
+
+## Resources
+
+- [Scaladoc Style Guide](https://docs.scala-lang.org/style/scaladoc.html)
+- [Oracle Javadoc Guide](https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html)
+- [PEP 257 - Docstring Conventions](https://www.python.org/dev/peps/pep-0257/)
+- [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html)
+- [Roxygen2 Documentation](https://roxygen2.r-lib.org/)
+
+## Contributing
+
+When contributing code to Spark:
+
+1. Follow the documentation style for your language
+2. Document all public APIs
+3. Include examples for new features
+4. Update existing documentation when changing behavior
+5. Run documentation generators to verify formatting
+
+For more information, see [CONTRIBUTING.md](CONTRIBUTING.md).
diff --git a/README.md b/README.md
index 65dfd67ac520e..0dd1f7f173bea 100644
--- a/README.md
+++ b/README.md
@@ -15,11 +15,33 @@ and Structured Streaming for stream processing.
 [![PyPI Downloads](https://static.pepy.tech/personalized-badge/pyspark?period=month&units=international_system&left_color=black&right_color=orange&left_text=PyPI%20downloads)](https://pypi.org/project/pyspark/)
 
 
-## Online Documentation
+## Documentation
 
 You can find the latest Spark documentation, including a programming
 guide, on the [project web page](https://spark.apache.org/documentation.html).
-This README file only contains basic setup instructions.
+
+### Repository Documentation
+
+- **[ARCHITECTURE.md](ARCHITECTURE.md)** - Spark architecture overview and component descriptions
+- **[DEVELOPMENT.md](DEVELOPMENT.md)** - Developer guide with build instructions, testing, and IDE setup
+- **[CONTRIBUTING.md](CONTRIBUTING.md)** - How to contribute to Apache Spark
+
+### Module Documentation
+
+- **[core/](core/README.md)** - Spark Core: RDDs, scheduling, memory management, storage
+- **[sql/](sql/README.md)** - Spark SQL: DataFrames, Datasets, SQL engine, data sources
+- **[streaming/](streaming/README.md)** - Spark Streaming: DStreams and Structured Streaming
+- **[mllib/](mllib/README.md)** - MLlib: Machine learning library with algorithms and pipelines
+- **[graphx/](graphx/README.md)** - GraphX: Graph processing framework and algorithms
+- **[examples/](examples/README.md)** - Example programs in Scala, Java, Python, and R
+
+### Tools and Utilities
+
+- **[bin/](bin/README.md)** - User-facing scripts (spark-submit, spark-shell, pyspark, etc.)
+- **[sbin/](sbin/README.md)** - Admin scripts for managing Spark standalone clusters
+- **[launcher/](launcher/README.md)** - Programmatic API for launching Spark applications
+- **[resource-managers/](resource-managers/README.md)** - Integrations with YARN, Kubernetes, and Mesos
+- **[common/](common/README.md)** - Common utilities and libraries shared across modules
 
 ## Build Pipeline Status
 

From fa25e0c24eac4bb8108330cacf4b705bf6a57e3f Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Sun, 19 Oct 2025 17:29:36 +0000
Subject: [PATCH 6/6] Add comprehensive documentation index for easy navigation

Co-authored-by: GizzZmo <8039975+GizzZmo@users.noreply.github.com>
---
 DOCUMENTATION_INDEX.md | 345 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 345 insertions(+)
 create mode 100644 DOCUMENTATION_INDEX.md

diff --git a/DOCUMENTATION_INDEX.md b/DOCUMENTATION_INDEX.md
new file mode 100644
index 0000000000000..cd4227c35df67
--- /dev/null
+++ b/DOCUMENTATION_INDEX.md
@@ -0,0 +1,345 @@
+# Apache Spark Documentation Index
+
+This document provides a complete index of all documentation available in the Apache Spark repository.
+
+## Quick Start
+
+- **[README.md](README.md)** - Main project README with quick start guide
+- **[docs/quick-start.md](docs/quick-start.md)** - Interactive tutorial for getting started
+- **[CONTRIBUTING.md](CONTRIBUTING.md)** - How to contribute to the project
+
+## Architecture and Development
+
+### Core Documentation
+- **[ARCHITECTURE.md](ARCHITECTURE.md)** - Complete Spark architecture overview
+  - Core components and their responsibilities
+  - Execution model and data flow
+  - Module structure and dependencies
+  - Key subsystems (memory, shuffle, storage, networking)
+
+- **[DEVELOPMENT.md](DEVELOPMENT.md)** - Developer guide
+  - Setting up development environment
+  - Building and testing instructions
+  - IDE configuration
+  - Code style guidelines
+  - Debugging techniques
+  - Common development tasks
+
+- **[CODE_DOCUMENTATION_GUIDE.md](CODE_DOCUMENTATION_GUIDE.md)** - Code documentation standards
+  - Scaladoc guidelines
+  - Javadoc guidelines
+  - Python docstring conventions
+  - R documentation standards
+  - Best practices and examples
+
+## Module Documentation
+
+### Core Modules
+
+#### Spark Core
+- **[core/README.md](core/README.md)** - Spark Core documentation
+  - RDD API and operations
+  - SparkContext and configuration
+  - Task scheduling (DAGScheduler, TaskScheduler)
+  - Memory management
+  - Shuffle system
+  - Storage system
+  - Serialization
+
+#### Spark SQL
+- **[sql/README.md](sql/README.md)** - Spark SQL documentation (if exists)
+- **[docs/sql-programming-guide.md](docs/sql-programming-guide.md)** - SQL programming guide
+- **[docs/sql-data-sources.md](docs/sql-data-sources.md)** - Data source integration
+- **[docs/sql-performance-tuning.md](docs/sql-performance-tuning.md)** - Performance tuning
+
+#### Streaming
+- **[streaming/README.md](streaming/README.md)** - Spark Streaming documentation
+  - DStreams API (legacy)
+  - Structured Streaming (recommended)
+  - Input sources and output sinks
+  - Windowing and stateful operations
+  - Performance tuning
+
+#### MLlib
+- **[mllib/README.md](mllib/README.md)** - MLlib documentation
+  - ML Pipeline API (spark.ml)
+  - RDD-based API (spark.mllib - maintenance mode)
+  - Classification and regression algorithms
+  - Clustering algorithms
+  - Feature engineering
+  - Model selection and tuning
+
+#### GraphX
+- **[graphx/README.md](graphx/README.md)** - GraphX documentation
+  - Property graphs
+  - Graph operators
+  - Graph algorithms (PageRank, Connected Components, etc.)
+  - Pregel API
+  - Performance optimization
+
+### Common Modules
+- **[common/README.md](common/README.md)** - Common utilities documentation
+  - Network communication (network-common, network-shuffle)
+  - Key-value store
+  - Sketching algorithms
+  - Unsafe operations
+
+### Tools and Utilities
+
+#### User-Facing Tools
+- **[bin/README.md](bin/README.md)** - User scripts documentation
+  - spark-submit: Application submission
+  - spark-shell: Interactive Scala shell
+  - pyspark: Interactive Python shell
+  - sparkR: Interactive R shell
+  - spark-sql: SQL query shell
+  - run-example: Example runner
+
+#### Administration Tools
+- **[sbin/README.md](sbin/README.md)** - Admin scripts documentation
+  - Cluster management scripts
+  - start-all.sh / stop-all.sh
+  - Master and worker daemon management
+  - History server setup
+  - Standalone cluster configuration
+
+#### Programmatic API
+- **[launcher/README.md](launcher/README.md)** - Launcher API documentation
+  - SparkLauncher for programmatic application launching
+  - SparkAppHandle for monitoring
+  - Integration patterns
+
+#### Resource Managers
+- **[resource-managers/README.md](resource-managers/README.md)** - Resource manager integrations
+  - YARN integration
+  - Kubernetes integration
+  - Mesos integration
+  - Comparison and configuration
+
+### Examples
+- **[examples/README.md](examples/README.md)** - Example programs
+  - Core examples (RDD operations)
+  - SQL examples (DataFrames)
+  - Streaming examples
+  - MLlib examples
+  - GraphX examples
+  - Running examples
+
+## Official Documentation
+
+### Programming Guides
+- **[docs/programming-guide.md](docs/programming-guide.md)** - General programming guide
+- **[docs/rdd-programming-guide.md](docs/rdd-programming-guide.md)** - RDD programming
+- **[docs/sql-programming-guide.md](docs/sql-programming-guide.md)** - SQL programming
+- **[docs/structured-streaming-programming-guide.md](docs/structured-streaming-programming-guide.md)** - Structured Streaming
+- **[docs/streaming-programming-guide.md](docs/streaming-programming-guide.md)** - DStreams (legacy)
+- **[docs/ml-guide.md](docs/ml-guide.md)** - Machine learning guide
+- **[docs/graphx-programming-guide.md](docs/graphx-programming-guide.md)** - Graph processing
+
+### Deployment
+- **[docs/cluster-overview.md](docs/cluster-overview.md)** - Cluster mode overview
+- **[docs/submitting-applications.md](docs/submitting-applications.md)** - Application submission
+- **[docs/spark-standalone.md](docs/spark-standalone.md)** - Standalone cluster mode
+- **[docs/running-on-yarn.md](docs/running-on-yarn.md)** - Running on YARN
+- **[docs/running-on-kubernetes.md](docs/running-on-kubernetes.md)** - Running on Kubernetes
+
+### Configuration and Tuning
+- **[docs/configuration.md](docs/configuration.md)** - Configuration reference
+- **[docs/tuning.md](docs/tuning.md)** - Performance tuning guide
+- **[docs/hardware-provisioning.md](docs/hardware-provisioning.md)** - Hardware recommendations
+- **[docs/job-scheduling.md](docs/job-scheduling.md)** - Job scheduling
+- **[docs/monitoring.md](docs/monitoring.md)** - Monitoring and instrumentation
+
+### Advanced Topics
+- **[docs/security.md](docs/security.md)** - Security guide
+- **[docs/cloud-integration.md](docs/cloud-integration.md)** - Cloud storage integration
+- **[docs/building-spark.md](docs/building-spark.md)** - Building from source
+
+### Migration Guides
+- **[docs/core-migration-guide.md](docs/core-migration-guide.md)** - Core API migration
+- **[docs/sql-migration-guide.md](docs/sql-migration-guide.md)** - SQL migration
+- **[docs/ml-migration-guide.md](docs/ml-migration-guide.md)** - MLlib migration
+- **[docs/pyspark-migration-guide.md](docs/pyspark-migration-guide.md)** - PySpark migration
+- **[docs/ss-migration-guide.md](docs/ss-migration-guide.md)** - Structured Streaming migration
+
+### API References
+- **[docs/sql-ref.md](docs/sql-ref.md)** - SQL reference
+- **[docs/sql-ref-functions.md](docs/sql-ref-functions.md)** - SQL functions
+- **[docs/sql-ref-datatypes.md](docs/sql-ref-datatypes.md)** - SQL data types
+- **[docs/sql-ref-syntax.md](docs/sql-ref-syntax.md)** - SQL syntax
+
+## Language-Specific Documentation
+
+### Python (PySpark)
+- **[python/README.md](python/README.md)** - PySpark overview
+- **[python/docs/](python/docs/)** - PySpark documentation source
+- **[docs/api/python/](docs/api/python/)** - Python API docs (generated)
+
+### R (SparkR)
+- **[R/README.md](R/README.md)** - SparkR overview
+- **[docs/sparkr.md](docs/sparkr.md)** - SparkR guide
+- **[R/pkg/README.md](R/pkg/README.md)** - R package documentation
+
+### Scala
+- **[docs/api/scala/](docs/api/scala/)** - Scala API docs (generated)
+
+### Java
+- **[docs/api/java/](docs/api/java/)** - Java API docs (generated)
+
+## Data Sources
+
+### Built-in Sources
+- **[docs/sql-data-sources-load-save-functions.md](docs/sql-data-sources-load-save-functions.md)**
+- **[docs/sql-data-sources-parquet.md](docs/sql-data-sources-parquet.md)**
+- **[docs/sql-data-sources-json.md](docs/sql-data-sources-json.md)**
+- **[docs/sql-data-sources-csv.md](docs/sql-data-sources-csv.md)**
+- **[docs/sql-data-sources-jdbc.md](docs/sql-data-sources-jdbc.md)**
+- **[docs/sql-data-sources-avro.md](docs/sql-data-sources-avro.md)**
+- **[docs/sql-data-sources-orc.md](docs/sql-data-sources-orc.md)**
+
+### External Integrations
+- **[docs/streaming-kafka-integration.md](docs/streaming-kafka-integration.md)** - Kafka integration
+- **[docs/streaming-kinesis-integration.md](docs/streaming-kinesis-integration.md)** - Kinesis integration
+- **[docs/structured-streaming-kafka-integration.md](docs/structured-streaming-kafka-integration.md)** - Structured Streaming with Kafka
+
+## Special Topics
+
+### Machine Learning
+- **[docs/ml-pipeline.md](docs/ml-pipeline.md)** - ML Pipelines
+- **[docs/ml-features.md](docs/ml-features.md)** - Feature transformers
+- **[docs/ml-classification-regression.md](docs/ml-classification-regression.md)** - Classification/Regression
+- **[docs/ml-clustering.md](docs/ml-clustering.md)** - Clustering
+- **[docs/ml-collaborative-filtering.md](docs/ml-collaborative-filtering.md)** - Recommendation
+- **[docs/ml-tuning.md](docs/ml-tuning.md)** - Hyperparameter tuning
+
+### Streaming
+- **[docs/structured-streaming-programming-guide.md](docs/structured-streaming-programming-guide.md)** - Structured Streaming guide
+
+### Graph Processing
+- **[docs/graphx-programming-guide.md](docs/graphx-programming-guide.md)** - GraphX guide
+
+## Additional Resources
+
+### Community
+- **[Apache Spark Website](https://spark.apache.org/)** - Official website
+- **[Spark Documentation](https://spark.apache.org/documentation.html)** - Online docs
+- **[Developer Tools](https://spark.apache.org/developer-tools.html)** - Developer resources
+- **[Community](https://spark.apache.org/community.html)** - Mailing lists and chat
+
+### External Links
+- **[Spark JIRA](https://issues.apache.org/jira/projects/SPARK)** - Issue tracker
+- **[GitHub Repository](https://github.com/apache/spark)** - Source code
+- **[Stack Overflow](https://stackoverflow.com/questions/tagged/apache-spark)** - Q&A
+
+## Document Organization
+
+### By Audience
+
+**For Users:**
+- Quick Start Guide
+- Programming Guides (SQL, Streaming, MLlib, GraphX)
+- Configuration Guide
+- Deployment Guides (YARN, Kubernetes)
+- Examples
+
+**For Developers:**
+- ARCHITECTURE.md
+- DEVELOPMENT.md
+- CODE_DOCUMENTATION_GUIDE.md
+- Module READMEs
+- Building Guide
+
+**For Administrators:**
+- Cluster Overview
+- Standalone Mode Guide
+- Monitoring Guide
+- Security Guide
+- Admin Scripts (sbin/)
+
+### By Topic
+
+**Getting Started:**
+1. README.md
+2. docs/quick-start.md
+3. docs/programming-guide.md
+
+**Core Concepts:**
+1. ARCHITECTURE.md
+2. core/README.md
+3. docs/rdd-programming-guide.md
+
+**Data Processing:**
+1. docs/sql-programming-guide.md
+2. docs/structured-streaming-programming-guide.md
+3. docs/ml-guide.md
+
+**Deployment:**
+1. docs/cluster-overview.md
+2. docs/submitting-applications.md
+3. docs/running-on-yarn.md or docs/running-on-kubernetes.md
+
+**Optimization:**
+1. docs/tuning.md
+2. docs/sql-performance-tuning.md
+3. docs/hardware-provisioning.md
+
+## Documentation Standards
+
+All documentation follows these principles:
+
+1. **Clarity**: Clear, concise explanations
+2. **Completeness**: Comprehensive coverage of topics
+3. **Examples**: Code examples for all concepts
+4. **Structure**: Consistent formatting and organization
+5. **Accuracy**: Up-to-date and technically correct
+6. **Accessibility**: Easy to find and navigate
+
+## Contributing to Documentation
+
+To contribute to Spark documentation:
+
+1. Follow the style guides in CODE_DOCUMENTATION_GUIDE.md
+2. Update relevant documentation when changing code
+3. Add examples for new features
+4. Test documentation builds locally
+5. Submit pull requests with documentation updates
+
+See [CONTRIBUTING.md](CONTRIBUTING.md) for details.
+
+## Building Documentation
+
+### Building User Documentation
+```bash
+cd docs
+bundle install
+bundle exec jekyll serve
+# View at http://localhost:4000
+```
+
+### Building API Documentation
+```bash
+# Scala API docs
+./build/mvn scala:doc
+
+# Java API docs
+./build/mvn javadoc:javadoc
+
+# Python API docs
+cd python/docs
+make html
+```
+
+## Getting Help
+
+If you can't find what you're looking for:
+
+1. Check the [Documentation Index](https://spark.apache.org/documentation.html)
+2. Search [Stack Overflow](https://stackoverflow.com/questions/tagged/apache-spark)
+3. Ask on the [user mailing list](mailto:user@spark.apache.org)
+4. Check [Spark JIRA](https://issues.apache.org/jira/projects/SPARK) for known issues
+
+## Last Updated
+
+This index was last updated: 2025-10-19
+
+For the most up-to-date documentation, visit [spark.apache.org/docs/latest](https://spark.apache.org/docs/latest/).