Skip to content

Comprehensive PySpark and GraphFrames demonstration with synthetic datasets and interactive tutorials

Notifications You must be signed in to change notification settings

jdesanto/graphframes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PySpark and GraphFrames Demonstration

A comprehensive, educational demonstration of Apache Spark, PySpark DataFrames, and GraphFrames for graph analytics. This project runs entirely on a local Spark instance and includes three different graph datasets with multiple graph algorithms.

Features

  • Local Spark Setup: No cluster required - runs on your laptop
  • Three Graph Types: Social network, citation network, and transportation network
  • Four Graph Algorithms: PageRank, Connected Components, Shortest Paths, and Triangle Counting
  • Synthetic Data: Realistic, reproducible datasets for learning
  • Interactive Notebook: Step-by-step Jupyter notebook with explanations and visualizations

Project Structure

graphframes/
├── README.md                              # This file
├── requirements.txt                       # Python dependencies
├── setup_environment.sh                   # Automated setup script
├── data/                                  # Generated datasets
│   ├── social_network/
│   ├── citation_network/
│   └── transportation_network/
├── src/                                   # Python modules
│   ├── data_generator.py                 # Data synthesis
│   ├── graph_utils.py                    # Graph utilities
│   └── visualization.py                  # Plotting functions
└── notebooks/
    └── comprehensive_demo.ipynb          # Main demo notebook

Prerequisites

  • Java 11 or 17 (required for PySpark/Spark)
    • macOS: brew install openjdk@11
    • Ubuntu/Debian: sudo apt install openjdk-11-jdk
    • Windows: Download from Adoptium
  • Python 3.8 or higher (tested with Python 3.12.2)
  • 4GB+ RAM recommended
  • macOS, Linux, or Windows with WSL

Installation

Option 1: Conda Environment (Recommended)

# Install Java (if not already installed)
# macOS:
brew install openjdk@11

# Add Java to your shell profile (~/.zshrc or ~/.bash_profile)
echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zshrc
echo 'export JAVA_HOME="$(/usr/libexec/java_home -v 11)"' >> ~/.zshrc
source ~/.zshrc

# Create conda environment from YAML file
conda env create -f environment.yml

# Activate the environment
conda activate graphframes-demo

# Verify Java is accessible
java -version

Option 2: Automated Setup with venv

# Run the setup script
./setup_environment.sh

# Activate the virtual environment
source venv/bin/activate

Option 3: Manual Setup

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export PYSPARK_PYTHON=$(which python3)
export PYSPARK_DRIVER_PYTHON=$(which python3)

Generating Datasets

Generate the synthetic graph datasets:

# Ensure virtual environment is activated
source venv/bin/activate

# Generate all datasets
python -c "from src.data_generator import generate_all_datasets; generate_all_datasets()"

This creates three datasets:

  • Social Network: 800 users, ~2500 friendships
  • Citation Network: 1000 papers, ~3500 citations
  • Transportation Network: 250 stations, ~500 routes

Running the Demo

Start Jupyter Notebook:

jupyter notebook notebooks/comprehensive_demo.ipynb

The notebook includes:

  1. PySpark Basics - DataFrame operations, SQL queries, joins
  2. GraphFrames Introduction - Graph creation and basic queries
  3. Social Network Analysis - Community detection and influence
  4. Citation Network Analysis - Academic impact and research clusters
  5. Transportation Network Analysis - Route optimization and hub identification
  6. Advanced Topics - Motif finding, subgraphs, and performance tips

Graph Algorithms Overview

PageRank

Ranks nodes by importance based on incoming edges.

  • Social Network: Find influential users
  • Citation Network: Identify seminal papers
  • Transportation Network: Find major transit hubs

Connected Components

Identifies disconnected subgraphs in the network.

  • Social Network: Discover isolated communities
  • Citation Network: Find research clusters
  • Transportation Network: Check network connectivity

Shortest Paths

Calculates optimal paths between nodes.

  • Social Network: Degrees of separation
  • Citation Network: Trace idea lineage
  • Transportation Network: Route optimization

Triangle Counting

Counts triangles each vertex participates in.

  • Social Network: Measure social cohesion
  • Citation Network: Cross-citation patterns
  • Transportation Network: Route redundancy

Datasets Description

Social Network

Simulates a social media platform with:

  • Users with names, ages, cities, occupations
  • Friendships with relationship types and interaction scores
  • Community structure and preferential attachment

Citation Network

Models academic paper citations with:

  • Papers with titles, authors, years, venues, and fields
  • Citations with context (background, methodology, comparison)
  • Temporal ordering (papers only cite older papers)
  • Influential "hub" papers

Transportation Network

Represents a multi-modal transit system with:

  • Stations with coordinates and passenger volumes
  • Routes with distances, travel times, and frequencies
  • Hub-and-spoke topology
  • Multiple transportation modes (metro, bus, train)

Customization

Modify dataset parameters in src/data_generator.py:

# Adjust network sizes
generate_social_network(n_users=500, n_friendships=1500)
generate_citation_network(n_papers=800, n_citations=2500)
generate_transportation_network(n_stations=200, n_routes=400)

All generators use seed=42 for reproducibility. Change the seed for different networks.

Troubleshooting

Java Runtime Not Found

If you see "Unable to locate a Java Runtime" or "JAVA_GATEWAY_EXITED":

macOS:

# Install Java
brew install openjdk@11

# Set environment variables
export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"
export JAVA_HOME="$(/usr/libexec/java_home -v 11)"

# Add to shell profile for persistence
echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zshrc
echo 'export JAVA_HOME="$(/usr/libexec/java_home -v 11)"' >> ~/.zshrc

# Verify installation
java -version

Linux:

# Install Java
sudo apt install openjdk-11-jdk

# Set JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc

Important: After setting Java environment variables, restart your terminal or Jupyter server for changes to take effect.

Checkpoint Directory Error

If you see "Checkpoint directory is not set" when using connectedComponents():

# Add this after creating SparkSession
import tempfile
import os
checkpoint_dir = os.path.join(tempfile.gettempdir(), "spark-checkpoint")
spark.sparkContext.setCheckpointDir(checkpoint_dir)

GraphFrames JAR Not Found

If you see "java.lang.ClassNotFoundException: org.graphframes":

# In your notebook, ensure this configuration:
spark = SparkSession.builder \
    .config("spark.jars.packages", "graphframes:graphframes:0.8.3-spark3.5-s_2.12") \
    .getOrCreate()

Out of Memory Errors

Increase Spark memory allocation:

spark = SparkSession.builder \
    .config("spark.driver.memory", "8g") \
    .config("spark.executor.memory", "8g") \
    .getOrCreate()

Python Version Issues

Ensure PySpark uses the correct Python:

export PYSPARK_PYTHON=$(which python3)
export PYSPARK_DRIVER_PYTHON=$(which python3)

Learning Resources

License

This project is for educational purposes. Feel free to use and modify for learning.

Contributing

Suggestions and improvements are welcome! This is a learning resource designed to help others understand graph analytics with PySpark.

About

Comprehensive PySpark and GraphFrames demonstration with synthetic datasets and interactive tutorials

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •