A comprehensive research toolkit for analyzing, profiling, and pruning Mixture-of-Experts (MoE) models to improve their efficiency and understand their behavior. This project focuses on the Qwen1.5-MoE-A2.7B model and provides tools for expert activation analysis, router behavior profiling, correlation analysis, and model pruning with performance evaluation.
- Features
- Installation
- Quick Start
- Project Structure
- Core Components
- Usage Guide
- Analysis Notebooks
- Configuration
- Evaluation Tasks
- Expert Activation Analysis: Monitor and analyze expert activation patterns across layers
- Router Behavior Profiling: Collect and analyze router logits to understand routing decisions
- Correlation Analysis: Compute correlations between router activations and expert usage across different task categories
- Statistical Analysis: Compute comprehensive statistics including mean, variance, frequency, and probability distributions
- Multiple Pruning Strategies: Support for masking and zeroing-based expert pruning
- Flexible Expert Selection: Prune least-used or most-used experts based on various criteria
- Performance Evaluation: Comprehensive evaluation framework using LM-Eval with support for multiple benchmarks
- Advanced Plotting: Generate correlation plots, usage matrices, and statistical visualizations
- Data Processing: Efficient text packing and dataset handling for large-scale analysis
- MMLU Category Analysis: Specialized tools for analyzing model behavior across MMLU categories
- Python: 3.8 or higher
- CUDA: Compatible GPU with CUDA support (recommended)
- PyTorch: With CUDA support
Install the required packages:
# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install core dependencies
pip install transformers datasets lm-eval matplotlib seaborn tqdm numpyEvaluate the model on standard benchmarks:
python scripts/evaluation.py --tasks mmlu --batch_size 8 --limit 100 --device cuda --model_name Qwen/Qwen1.5-MoE-A2.7BProfile the model on MMLU prompts and generate expert pruning metadata:
python scripts/profile_and_prune.py \
--model_name Qwen/Qwen1.5-MoE-A2.7B \
--mmlu_topic stem \
--sample_size 5 \
--output_file outputs/statistics/expert_stats.json \
--device cudaEvaluate a pruned model using pre-computed expert rankings:
python scripts/evaluation.py \
--model_name Qwen/Qwen1.5-MoE-A2.7B \
--tasks mmlu \
--batch_size 8 \
--limit 100 \
--use_pruned_model \
--pruned_metadata outputs/statistics/expert_stats.json \
--pruning_method zero \
--k 20 \
--device cuda \
--output_file outputs/evaluation_results/pruned_results.jsonRun correlation analysis across MMLU categories:
python run_mmlu_categories_correlation.pyFor convenience, use the provided shell script:
bash scripts/run_evaluation.shefficient_moe/
├── README.md # This file
├── run_mmlu_categories_correlation.py # MMLU category correlation analysis
├── analyze_expert_dynamics.ipynb # Expert activation dynamics
├── analyze_routing_statistics.ipynb # Router behavior analysis
│
├── scripts/ # Main executable scripts
│ ├── evaluation.py # Model evaluation script
│ ├── profile_and_prune.py # Model profiling and pruning
│ └── run_evaluation.sh # Evaluation runner script
│
├── utils/ # Utility modules
│ ├── __init__.py
│ ├── analysis_utils.py # Statistical analysis functions
│ ├── common_utils.py # Common helper functions
│ ├── data_utils.py # Dataset processing utilities
│ ├── hook_utils.py # Expert activation hook management
│ ├── model_utils.py # Model pruning utilities
│ ├── router_utils.py # Router logits collection
│ └── visualization_utils.py # Plotting and visualization
│
└── outputs/ # Generated outputs
├── evaluation_results/ # Evaluation results JSON files
├── plots/ # Generated plots and visualizations
├── prune_experts/ # Pre-computed expert rankings
└── statistics/ # Statistical analysis outputs
Main script for evaluating models with or without pruning.
Usage:
python scripts/evaluation.py [OPTIONS]Key Options:
--model_name: Name or path of the model to profile & prune (default:Qwen/Qwen1.5-MoE-A2.7B)--tasks: List of evaluation tasks (e.g.,mmlu,gsm8k,wikitext)--batch_size: Batch size for evaluation (default: 8)--limit: Limit number of examples for quick testing--use_pruned_model: Enable pruned model evaluation--pruned_metadata: Path to pruned expert metadata JSON file--k: Maximum number of experts to prune per layer (default: 20)--pruning_method: Pruning method -maskorzero(default:zero)--device: Device for model -cudaorcpu(default:cuda)--output_file: File path to save evaluation results JSON
Example:
python scripts/evaluation.py \
--model_name Qwen/Qwen1.5-MoE-A2.7B \
--tasks mmlu gsm8k \
--batch_size 16 \
--use_pruned_model \
--pruned_metadata outputs/statistics/experts_to_prune.json \
--pruning_method zero \
--k 15 \
--output_file results.jsonProfiles the model on various datasets and determines which experts to prune based on activation statistics.
Usage:
python scripts/profile_and_prune.py [OPTIONS]Key Options:
--model_name: Name or path of the model to profile & prune (default:Qwen/Qwen1.5-MoE-A2.7B)--prompts_file: Path to JSON file with prompt strings--mmlu_topic: MMLU topic category (humanities,stem,social_sciences,other)--gsm8k: Use GSM8K dataset for prompts--sample_size: Maximum samples per MMLU subject (default: 5)--output_file: Output file path for expert statistics--device: Device for model (default:cuda)
Example:
python scripts/profile_and_prune.py \
--model_name Qwen/Qwen1.5-MoE-A2.7B \
--mmlu_topic stem \
--sample_size 10 \
--output_file outputs/statistics/stem_experts.json \
--device cudaAnalyzes correlations between router activations and expert usage across MMLU categories.
Usage:
python run_mmlu_categories_correlation.pyOutput:
- Generates correlation plots in
outputs/plots/:mmlu_router_activation_pearson_all_categories.pngmmlu_router_activation_spearman_all_categories.png
- Statistical computation functions
- Correlation analysis (Pearson, Spearman)
- Expert ranking and selection utilities
- Model pruning functions (
apply_pruning) - Expert masking and zeroing implementations
- Router logit collection
- Routing pattern analysis
- Expert activation hook management
- Forward hook registration and data collection
- Dataset preparation utilities
- MMLU and GSM8K prompt preparation
- Text packing and dataloader creation
- Plotting functions for matrices, bar charts, correlations
- Visualization utilities for analysis results
- Common helper functions
- Expert metadata loading and processing
- Sets router logits to
-∞for pruned experts - Prevents tokens from being routed to pruned experts
- More aggressive pruning approach
- Completely removes pruned experts from routing decisions
- Zeros out outputs from pruned experts
- Tokens may still be routed to pruned experts, but their outputs are nullified
- Gentler pruning approach
- Preserves routing structure while nullifying expert contributions
# Step 1: Profile the model
python scripts/profile_and_prune.py \
--mmlu_topic stem \
--sample_size 10 \
--output_file outputs/statistics/stem_profile.json
# Step 2: Evaluate pruned model
python scripts/evaluation.py \
--tasks mmlu \
--use_pruned_model \
--pruned_metadata outputs/statistics/stem_profile.json \
--pruning_method zero \
--k 20 \
--output_file outputs/evaluation_results/pruned_mmlu.jsonfrom utils.model_utils import apply_pruning
from utils.common_utils import get_experts_to_prune_from_json
from transformers import AutoModelForCausalLM
# Load model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen1.5-MoE-A2.7B")
# Load expert pruning metadata
experts_to_prune = get_experts_to_prune_from_json(
path="outputs/statistics/experts_to_prune.json",
k=20
)
# Apply pruning
apply_pruning(model, experts_to_prune, mode="zero")from utils.data_utils import prepare_mmlu_prompts, create_packed_dataloader
# Prepare MMLU prompts
prompts = prepare_mmlu_prompts(
topic="stem",
max_samples_per_subject=5
)
# Create packed dataloader
loader = create_packed_dataloader(
tokenizer=tokenizer,
dataset_name="brando/small-c4-dataset",
split="train",
sample_size=512,
max_length=512
)from utils.visualization_utils import plot_matrix, plot_bar
# Plot expert usage matrix
plot_matrix(
expert_usage_matrix,
title="Expert Usage Patterns",
xlabel="Expert ID",
ylabel="Layer"
)
# Plot expert frequencies
plot_bar(
expert_frequencies,
title="Expert Usage Frequency",
xlabel="Expert ID",
ylabel="Frequency"
)The project includes Jupyter notebooks for interactive analysis:
-
analyze_expert_dynamics.ipynb:- Expert activation dynamics and patterns
- Temporal analysis of expert usage
- Activation monitoring across layers
-
analyze_routing_statistics.ipynb:- Router behavior analysis and statistics
- Routing pattern visualization
- Expert ranking by various criteria
These notebooks provide:
- Interactive exploration of expert behavior
- Custom analysis workflows
- Visualization of routing patterns
- Performance impact assessment
The project is configured for the Qwen1.5-MoE-A2.7B model by default. To use a different model:
- Update the
model_namevariable in scripts or notebooks - Ensure the model has MoE layers with the expected structure
- Adjust model loading parameters as needed
Expert pruning can be configured through:
- Pre-computed rankings: Use existing JSON files in
outputs/statistics/oroutputs/prune_experts/ - Custom rankings: Generate your own expert rankings using
profile_and_prune.py - Pruning parameters:
k: Number of experts to prune per layerpruning_method:maskorzero- Selection criteria: Based on mean, variance, frequency, or probability
The project uses the following output structure:
outputs/evaluation_results/: Evaluation result JSON filesoutputs/plots/: Generated plots and visualizationsoutputs/statistics/: Statistical analysis outputs and expert rankingsoutputs/prune_experts/: Pre-computed expert pruning metadata
The evaluation script supports various tasks from the LM-Eval framework:
mmlu: Massive Multitask Language Understandinggsm8k: Grade school math problemswikitext: Wikipedia text perplexityhellaswag: Commonsense reasoningarc: AI2 reasoning challengewinogrande: Common sense reasoningtruthfulqa: Truthful question answering
# Single task
python scripts/evaluation.py --tasks mmlu
# Multiple tasks
python scripts/evaluation.py --tasks mmlu gsm8k hellaswag arcNote: This project is designed for research purposes. Ensure you have appropriate computational resources (GPU) for running evaluations and analyses on MoE models.