A comprehensive, educational demonstration of Apache Spark, PySpark DataFrames, and GraphFrames for graph analytics. This project runs entirely on a local Spark instance and includes three different graph datasets with multiple graph algorithms.
- Local Spark Setup: No cluster required - runs on your laptop
- Three Graph Types: Social network, citation network, and transportation network
- Four Graph Algorithms: PageRank, Connected Components, Shortest Paths, and Triangle Counting
- Synthetic Data: Realistic, reproducible datasets for learning
- Interactive Notebook: Step-by-step Jupyter notebook with explanations and visualizations
graphframes/
├── README.md # This file
├── requirements.txt # Python dependencies
├── setup_environment.sh # Automated setup script
├── data/ # Generated datasets
│ ├── social_network/
│ ├── citation_network/
│ └── transportation_network/
├── src/ # Python modules
│ ├── data_generator.py # Data synthesis
│ ├── graph_utils.py # Graph utilities
│ └── visualization.py # Plotting functions
└── notebooks/
└── comprehensive_demo.ipynb # Main demo notebook
- Java 11 or 17 (required for PySpark/Spark)
- macOS:
brew install openjdk@11 - Ubuntu/Debian:
sudo apt install openjdk-11-jdk - Windows: Download from Adoptium
- macOS:
- Python 3.8 or higher (tested with Python 3.12.2)
- 4GB+ RAM recommended
- macOS, Linux, or Windows with WSL
# Install Java (if not already installed)
# macOS:
brew install openjdk@11
# Add Java to your shell profile (~/.zshrc or ~/.bash_profile)
echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zshrc
echo 'export JAVA_HOME="$(/usr/libexec/java_home -v 11)"' >> ~/.zshrc
source ~/.zshrc
# Create conda environment from YAML file
conda env create -f environment.yml
# Activate the environment
conda activate graphframes-demo
# Verify Java is accessible
java -version# Run the setup script
./setup_environment.sh
# Activate the virtual environment
source venv/bin/activate# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export PYSPARK_PYTHON=$(which python3)
export PYSPARK_DRIVER_PYTHON=$(which python3)Generate the synthetic graph datasets:
# Ensure virtual environment is activated
source venv/bin/activate
# Generate all datasets
python -c "from src.data_generator import generate_all_datasets; generate_all_datasets()"This creates three datasets:
- Social Network: 800 users, ~2500 friendships
- Citation Network: 1000 papers, ~3500 citations
- Transportation Network: 250 stations, ~500 routes
Start Jupyter Notebook:
jupyter notebook notebooks/comprehensive_demo.ipynbThe notebook includes:
- PySpark Basics - DataFrame operations, SQL queries, joins
- GraphFrames Introduction - Graph creation and basic queries
- Social Network Analysis - Community detection and influence
- Citation Network Analysis - Academic impact and research clusters
- Transportation Network Analysis - Route optimization and hub identification
- Advanced Topics - Motif finding, subgraphs, and performance tips
Ranks nodes by importance based on incoming edges.
- Social Network: Find influential users
- Citation Network: Identify seminal papers
- Transportation Network: Find major transit hubs
Identifies disconnected subgraphs in the network.
- Social Network: Discover isolated communities
- Citation Network: Find research clusters
- Transportation Network: Check network connectivity
Calculates optimal paths between nodes.
- Social Network: Degrees of separation
- Citation Network: Trace idea lineage
- Transportation Network: Route optimization
Counts triangles each vertex participates in.
- Social Network: Measure social cohesion
- Citation Network: Cross-citation patterns
- Transportation Network: Route redundancy
Simulates a social media platform with:
- Users with names, ages, cities, occupations
- Friendships with relationship types and interaction scores
- Community structure and preferential attachment
Models academic paper citations with:
- Papers with titles, authors, years, venues, and fields
- Citations with context (background, methodology, comparison)
- Temporal ordering (papers only cite older papers)
- Influential "hub" papers
Represents a multi-modal transit system with:
- Stations with coordinates and passenger volumes
- Routes with distances, travel times, and frequencies
- Hub-and-spoke topology
- Multiple transportation modes (metro, bus, train)
Modify dataset parameters in src/data_generator.py:
# Adjust network sizes
generate_social_network(n_users=500, n_friendships=1500)
generate_citation_network(n_papers=800, n_citations=2500)
generate_transportation_network(n_stations=200, n_routes=400)All generators use seed=42 for reproducibility. Change the seed for different networks.
If you see "Unable to locate a Java Runtime" or "JAVA_GATEWAY_EXITED":
macOS:
# Install Java
brew install openjdk@11
# Set environment variables
export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"
export JAVA_HOME="$(/usr/libexec/java_home -v 11)"
# Add to shell profile for persistence
echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zshrc
echo 'export JAVA_HOME="$(/usr/libexec/java_home -v 11)"' >> ~/.zshrc
# Verify installation
java -versionLinux:
# Install Java
sudo apt install openjdk-11-jdk
# Set JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrcImportant: After setting Java environment variables, restart your terminal or Jupyter server for changes to take effect.
If you see "Checkpoint directory is not set" when using connectedComponents():
# Add this after creating SparkSession
import tempfile
import os
checkpoint_dir = os.path.join(tempfile.gettempdir(), "spark-checkpoint")
spark.sparkContext.setCheckpointDir(checkpoint_dir)If you see "java.lang.ClassNotFoundException: org.graphframes":
# In your notebook, ensure this configuration:
spark = SparkSession.builder \
.config("spark.jars.packages", "graphframes:graphframes:0.8.3-spark3.5-s_2.12") \
.getOrCreate()Increase Spark memory allocation:
spark = SparkSession.builder \
.config("spark.driver.memory", "8g") \
.config("spark.executor.memory", "8g") \
.getOrCreate()Ensure PySpark uses the correct Python:
export PYSPARK_PYTHON=$(which python3)
export PYSPARK_DRIVER_PYTHON=$(which python3)This project is for educational purposes. Feel free to use and modify for learning.
Suggestions and improvements are welcome! This is a learning resource designed to help others understand graph analytics with PySpark.