Llama 4 Inference Guide

A comprehensive guide to setting up and running Llama 4 inference on high-performance hardware.

Prerequisites

Hardware Requirements:
- 8x NVIDIA H100 GPUs (minimum 5 GPUs for small context)
- Alternative: NVIDIA A100 GPUs
- Note: With INT4 quantization, fewer GPUs may be sufficient

Installation Steps

1. Install Miniconda

If you haven't already, update apt with:

apt update && apt upgrade -y

# Download Miniconda installer
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Run the installer
bash Miniconda3-latest-Linux-x86_64.sh
# Follow the prompts:
# - Accept license: yes
# - Installation location: [press ENTER for default]
# - Initialize Miniconda: yes

# Restart the terminal afterwards to apply the changes

2. Create Project Directory and Clone Repository

# Create and navigate to project directory
mkdir llama4
cd llama4

# Clone the repository
git clone https://github.com/AmarUCLA/Llama4Inference.git
cd Llama4Inference/

3. Set Up Python Environment

# Create a new conda environment with Python 3.12
conda create -n llamainference python=3.12

# Activate the environment
conda activate llamainference

# Install required packages
pip install -r requirements.txt

4. Authenticate with Hugging Face

# Log in to Hugging Face to access model files
huggingface-cli login
# Enter your Hugging Face token when prompted

Running Inference

Option 1: Serving with Web Interface

# Step 1: Start the model server in a tmux session
tmux new-session -s llama4-server
conda activate llamainference
bash serve.sh
# Detach from tmux session with Ctrl+b then d

# Step 2: Start the Streamlit interface in another tmux session
tmux new-session -s llama4-ui
conda activate llamainference  # Make sure to activate environment again
streamlit run streamlit_chat.py --server.port 8000 --server.address 0.0.0.0
# Detach from tmux session with Ctrl+b then d

Access the web interface at: http://[NODE_IP]:8000

Option 2: Batch Inference

For processing multiple inputs without the web interface:

# Activate environment if not already active
conda activate llamainference

# Run batch inference script
python batch_inference.py

Additional Information

Model Configuration

The model server (serve.sh) uses the following default configuration:

Model: Llama 4 Maverick (17B Active/128B Total parameters)
Context length: 430K Tokens
Precision: FP8

Monitoring and Management

To view or reattach to running tmux sessions:

# List sessions
tmux ls

# Reattach to a session
tmux attach-session -t llama4-server
# or
tmux attach-session -t llama4-ui

Troubleshooting

Out of memory errors: Reduce batch size or enable more aggressive quantization (Could do INT4)
Slow inference: Check GPU utilization and network bandwidth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Llama 4 Inference Guide

Prerequisites

Installation Steps

1. Install Miniconda

2. Create Project Directory and Clone Repository

3. Set Up Python Environment

4. Authenticate with Hugging Face

Running Inference

Option 1: Serving with Web Interface

Option 2: Batch Inference

Additional Information

Model Configuration

Monitoring and Management

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
batch_inference.py		batch_inference.py
requirements.txt		requirements.txt
serve.sh		serve.sh
streamlit_chat.py		streamlit_chat.py

Folders and files

Latest commit

History

Repository files navigation

Llama 4 Inference Guide

Prerequisites

Installation Steps

1. Install Miniconda

2. Create Project Directory and Clone Repository

3. Set Up Python Environment

4. Authenticate with Hugging Face

Running Inference

Option 1: Serving with Web Interface

Option 2: Batch Inference

Additional Information

Model Configuration

Monitoring and Management

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages