PySpark and GraphFrames Demonstration

A comprehensive, educational demonstration of Apache Spark, PySpark DataFrames, and GraphFrames for graph analytics. This project runs entirely on a local Spark instance and includes three different graph datasets with multiple graph algorithms.

Features

Local Spark Setup: No cluster required - runs on your laptop
Three Graph Types: Social network, citation network, and transportation network
Four Graph Algorithms: PageRank, Connected Components, Shortest Paths, and Triangle Counting
Synthetic Data: Realistic, reproducible datasets for learning
Interactive Notebook: Step-by-step Jupyter notebook with explanations and visualizations

Project Structure

graphframes/
├── README.md                              # This file
├── requirements.txt                       # Python dependencies
├── setup_environment.sh                   # Automated setup script
├── data/                                  # Generated datasets
│   ├── social_network/
│   ├── citation_network/
│   └── transportation_network/
├── src/                                   # Python modules
│   ├── data_generator.py                 # Data synthesis
│   ├── graph_utils.py                    # Graph utilities
│   └── visualization.py                  # Plotting functions
└── notebooks/
    └── comprehensive_demo.ipynb          # Main demo notebook

Prerequisites

Java 11 or 17 (required for PySpark/Spark)
- macOS: brew install openjdk@11
- Ubuntu/Debian: sudo apt install openjdk-11-jdk
- Windows: Download from Adoptium
Python 3.8 or higher (tested with Python 3.12.2)
4GB+ RAM recommended
macOS, Linux, or Windows with WSL

Installation

Option 1: Conda Environment (Recommended)

# Install Java (if not already installed)
# macOS:
brew install openjdk@11

# Add Java to your shell profile (~/.zshrc or ~/.bash_profile)
echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zshrc
echo 'export JAVA_HOME="$(/usr/libexec/java_home -v 11)"' >> ~/.zshrc
source ~/.zshrc

# Create conda environment from YAML file
conda env create -f environment.yml

# Activate the environment
conda activate graphframes-demo

# Verify Java is accessible
java -version

Option 2: Automated Setup with venv

# Run the setup script
./setup_environment.sh

# Activate the virtual environment
source venv/bin/activate

Option 3: Manual Setup

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export PYSPARK_PYTHON=$(which python3)
export PYSPARK_DRIVER_PYTHON=$(which python3)

Generating Datasets

Generate the synthetic graph datasets:

# Ensure virtual environment is activated
source venv/bin/activate

# Generate all datasets
python -c "from src.data_generator import generate_all_datasets; generate_all_datasets()"

This creates three datasets:

Social Network: 800 users, ~2500 friendships
Citation Network: 1000 papers, ~3500 citations
Transportation Network: 250 stations, ~500 routes

Running the Demo

Start Jupyter Notebook:

jupyter notebook notebooks/comprehensive_demo.ipynb

The notebook includes:

PySpark Basics - DataFrame operations, SQL queries, joins
GraphFrames Introduction - Graph creation and basic queries
Social Network Analysis - Community detection and influence
Citation Network Analysis - Academic impact and research clusters
Transportation Network Analysis - Route optimization and hub identification
Advanced Topics - Motif finding, subgraphs, and performance tips

Graph Algorithms Overview

PageRank

Ranks nodes by importance based on incoming edges.

Social Network: Find influential users
Citation Network: Identify seminal papers
Transportation Network: Find major transit hubs

Connected Components

Identifies disconnected subgraphs in the network.

Social Network: Discover isolated communities
Citation Network: Find research clusters
Transportation Network: Check network connectivity

Shortest Paths

Calculates optimal paths between nodes.

Social Network: Degrees of separation
Citation Network: Trace idea lineage
Transportation Network: Route optimization

Triangle Counting

Counts triangles each vertex participates in.

Social Network: Measure social cohesion
Citation Network: Cross-citation patterns
Transportation Network: Route redundancy

Datasets Description

Social Network

Simulates a social media platform with:

Users with names, ages, cities, occupations
Friendships with relationship types and interaction scores
Community structure and preferential attachment

Citation Network

Models academic paper citations with:

Papers with titles, authors, years, venues, and fields
Citations with context (background, methodology, comparison)
Temporal ordering (papers only cite older papers)
Influential "hub" papers

Transportation Network

Represents a multi-modal transit system with:

Stations with coordinates and passenger volumes
Routes with distances, travel times, and frequencies
Hub-and-spoke topology
Multiple transportation modes (metro, bus, train)

Customization

Modify dataset parameters in src/data_generator.py:

# Adjust network sizes
generate_social_network(n_users=500, n_friendships=1500)
generate_citation_network(n_papers=800, n_citations=2500)
generate_transportation_network(n_stations=200, n_routes=400)

All generators use seed=42 for reproducibility. Change the seed for different networks.

Troubleshooting

Java Runtime Not Found

If you see "Unable to locate a Java Runtime" or "JAVA_GATEWAY_EXITED":

macOS:

# Install Java
brew install openjdk@11

# Set environment variables
export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"
export JAVA_HOME="$(/usr/libexec/java_home -v 11)"

# Add to shell profile for persistence
echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zshrc
echo 'export JAVA_HOME="$(/usr/libexec/java_home -v 11)"' >> ~/.zshrc

# Verify installation
java -version

Linux:

# Install Java
sudo apt install openjdk-11-jdk

# Set JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc

Important: After setting Java environment variables, restart your terminal or Jupyter server for changes to take effect.

Checkpoint Directory Error

If you see "Checkpoint directory is not set" when using connectedComponents():

# Add this after creating SparkSession
import tempfile
import os
checkpoint_dir = os.path.join(tempfile.gettempdir(), "spark-checkpoint")
spark.sparkContext.setCheckpointDir(checkpoint_dir)

GraphFrames JAR Not Found

If you see "java.lang.ClassNotFoundException: org.graphframes":

# In your notebook, ensure this configuration:
spark = SparkSession.builder \
    .config("spark.jars.packages", "graphframes:graphframes:0.8.3-spark3.5-s_2.12") \
    .getOrCreate()

Out of Memory Errors

Increase Spark memory allocation:

spark = SparkSession.builder \
    .config("spark.driver.memory", "8g") \
    .config("spark.executor.memory", "8g") \
    .getOrCreate()

Python Version Issues

Ensure PySpark uses the correct Python:

export PYSPARK_PYTHON=$(which python3)
export PYSPARK_DRIVER_PYTHON=$(which python3)

Learning Resources

License

This project is for educational purposes. Feel free to use and modify for learning.

Contributing

Suggestions and improvements are welcome! This is a learning resource designed to help others understand graph analytics with PySpark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark and GraphFrames Demonstration

Features

Project Structure

Prerequisites

Installation

Option 1: Conda Environment (Recommended)

Option 2: Automated Setup with venv

Option 3: Manual Setup

Generating Datasets

Running the Demo

Graph Algorithms Overview

PageRank

Connected Components

Shortest Paths

Triangle Counting

Datasets Description

Social Network

Citation Network

Transportation Network

Customization

Troubleshooting

Java Runtime Not Found

Checkpoint Directory Error

GraphFrames JAR Not Found

Out of Memory Errors

Python Version Issues

Learning Resources

License

Contributing

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup_environment.sh		setup_environment.sh

jdesanto/graphframes

Folders and files

Latest commit

History

Repository files navigation

PySpark and GraphFrames Demonstration

Features

Project Structure

Prerequisites

Installation

Option 1: Conda Environment (Recommended)

Option 2: Automated Setup with venv

Option 3: Manual Setup

Generating Datasets

Running the Demo

Graph Algorithms Overview

PageRank

Connected Components

Shortest Paths

Triangle Counting

Datasets Description

Social Network

Citation Network

Transportation Network

Customization

Troubleshooting

Java Runtime Not Found

Checkpoint Directory Error

GraphFrames JAR Not Found

Out of Memory Errors

Python Version Issues

Learning Resources

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages