Lab 3: Running & Quantizing Llama Models on Android

This lab provides a hands-on walkthrough of how to run and optimize compact large language models (LLMs) directly on Android devices using the llama.cpp framework. You'll explore the full workflow, including downloading, converting, quantizing, deploying, and benchmarking Llama-style models. By the end, you'll have a working offline mobile LLM capable of running locally without server dependencies in an on-device android application.

Learning Goals

Understand the toolchain for working with llama.cpp
Learn how model quantization reduces memory and compute requirements
Modify an Android app to load and run local quantized models on device
Benchmark and compare the runtime performance of different quantized formats on an android device

⚙️ Why it matters: Running LLMs natively on mobile offers benefits like reduced latency, improved privacy, and offline capability—critical for edge AI applications.

Prerequisites

Component	Min version	Why you need it	Install hint
Python	3.9	for conversion and quantization scripts	https://python.org
Git	any	to clone repositories	`sudo apt install git`
CMake	3.16	to build `llama.cpp` tools	`brew install cmake` / `apt`
huggingface-cli	latest	to download models (auth required)	`pip install --upgrade huggingface_hub`
Android Studio	Hedgehog or newer	includes NDK & ADB for building and deploying	https://developer.android.com/studio
Android device	Android 10+, ≥ 6GB RAM	target runtime	any modern phone

Windows users: use WSL 2 with Ubuntu 22.04 for compatibility with build tools.

Step 1: Install Android Studio and Set Up the SDK

Android Studio includes all the tools needed to build and debug Android apps, including ADB (installed automatically as part of the Platform-Tools).

Download and install Android Studio on your computer.
During setup, ensure the following components are selected:
- Android SDK
- SDK Platform Tools (includes ADB — no separate install required)
- NDK
In Settings → Appearance & Behavior → System Settings → Android SDK, verify:
- SDK path: ~/Android/Sdk/
- NDK version ≥ 25
- Platform-Tools version ≥ 34
(Optional) Add the tools directory to your shell PATH so you can run ADB from any terminal:
```
export PATH=$PATH:$HOME/Android/Sdk/platform-tools
```

Step 2: Authenticate with Hugging Face

We use Hugging Face to download a pretrained Llama-style model.

On your computer, run:

pip install --upgrade huggingface_hub
huggingface-cli login

Log in at https://huggingface.co.
Go to Settings → Access Tokens.
Create a token with Read access.
Paste the token in your terminal.

Step 3: Clone & Build `llama.cpp`

This builds the tools used to convert and quantize models:

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DLLAMA_BUILD_TOOLS=ON
make

🧪 Why: This creates binaries like llama-quantize and llama-run, which are used to transform and test models.

Step 4: Set Up a Workspace for Models

On your computer, run:

cd ..
mkdir llama-models && cd llama-models

Organizing models in one place keeps your workflow tidy.

Step 5: Download a Pretrained 3B Llama model

On your computer, run:

huggingface-cli download unsloth/Llama-3.2-3B-Instruct --repo-type model --local-dir ./Llama-3.2-3B-Instruct

Tip: Use a 1B model variant for devices with less RAM and storage.

Step 6: Convert Huggingface model to GGUF Format

On your computer, run:

cd ../
python convert_hf_to_gguf.py llama-models/Llama-3.2-3B-Instruct --outfile llama-models/Llama-3.2-3B-Instruct-gguf

Why GGUF? It's the native llama.cpp format that enables fast, memory-mapped loading and is portable across platforms.

Step 7: Quantize the Model

Quantization is a technique that reduces the precision of the model's weights from 32-bit floating point numbers to lower bit representations (like 8-bit, 4-bit or even 2-bit integers). This significantly reduces model size and speeds up inference, with minimal impact on accuracy when done carefully.

For example, a 3B parameter model normally requires ~12GB in FP32. After quantization:

Q8_0 (8-bit) reduces it to ~3GB
Q4_K_M (4-bit) reduces it to ~1.5GB
TQ2_0 (2-bit) reduces it to ~750MB

To Quantize, we can use the llama.cpp tools. On your computer, run:

./build/bin/llama-quantize llama-models/Llama-3.2-3B-Instruct-gguf llama-models/Llama-3.2-3B-Instruct-gguf-Q8_0 Q8_0
./build/bin/llama-quantize llama-models/Llama-3.2-3B-Instruct-gguf llama-models/Llama-3.2-3B-Instruct-gguf-Q4_K_M Q4_K_M
./build/bin/llama-quantize llama-models/Llama-3.2-3B-Instruct-gguf llama-models/Llama-3.2-3B-Instruct-gguf-TQ2_0 TQ2_0

Format	Bitwidth	Trade-off
Q8_0	8-bit	High accuracy, large size
Q4_K_M	4-bit	Balance of sizespeed
TQ2_0	~2-bit	Tiny, fast, less accurate

Why Quantize? Reduces memory and compute cost, enabling real-time use on mobile.

Step 8: Test Inference Locally

On your computer, run:

./build/bin/llama-run llama-models/Llama-3.2-3B-Instruct-gguf-Q8_0 "What is quantization in the context of machine learning?"

Sanity check the model before deploying to Android. The command should output a response from the model to your terminal.

Step 9: Open the Android App Project

In Android Studio on your computer:

Open llama.cpp/examples/llama.android
Wait for Gradle sync (this may take 10-15 minutes)
Switch to "Project" view for easier navigation. You can do this by selecting 'Android' in the top left of your screen, then selecting 'project' from the dropdown.

Step 10: Enable Debugging mode and modify the app to use local models

First to enable debugging mode on your phone, do the following on your mobile:

Enable USB debugging in Developer Options of you mobile:
- Navigate to Settings > About phone
- Tap Build number 7 times to enable Developer options
- Return to Settings > Developer options
- Toggle on USB debugging

Then In Android Studio on your computer, edit the following files:

Edit llama.android/app/src/java/com.example.llama/MainActivity.kt Replace the models = listOf(...) section with:

val models = listOf(
    Downloadable("Llama-3.2-3B-q8", Uri.EMPTY, File(extFilesDir, "Llama-3.2-3B-Instruct-gguf-Q8_0")),
    Downloadable("Llama-3.2-3B-q4", Uri.EMPTY, File(extFilesDir, "Llama-3.2-3B-Instruct-gguf-Q4_K_M")),
    Downloadable("Llama-3.2-3B-q2", Uri.EMPTY, File(extFilesDir, "Llama-3.2-3B-Instruct-gguf-TQ2_0")),
)

Edit Downloadable.kt Replace the entire @Composable fun Button(...) implementation with:

@JvmStatic
@Composable
fun Button(viewModel: MainViewModel, dm: DownloadManager, item: Downloadable) {
    val fileExists = item.destination.exists()

    fun onClick() {
        if (fileExists) {
            viewModel.load(item.destination.path)
        }
    }

    Button(onClick = { onClick() }, enabled = fileExists) {
        Text(if (fileExists) "Load ${item.name}" else "${item.name} - Not found")
    }
}

Press the Play button in android studio to push the application to the device, and initialize the activation space. We haven't pushed the models yet so you won't be able to load them. We can fix that in the next step.

Step 11: Push Quantized Models to Android Device

Connect your device via USB cable and authorize your computer when prompted
Push the model files to device:

adb push llama-models/*gguf* /sdcard/Android/data/com.example.llama/files/

Verify the files have been transferred with:

adb shell ls /sdcard/Android/data/com.example.llama/files/

It should output something like this:

Llama-3.2-3B-Instruct-gguf-Q4_K_M
Llama-3.2-3B-Instruct-gguf-Q8_0
Llama-3.2-3B-Instruct-gguf-TQ2_0

Step 12: Build and Run the App

In Android Studio on your computer, press the Run button . The app will install on your device and open automatically.

Step 13: Use the App 🌟

With the app installed and your models loaded onto the device, it's time to interact with them on your Android phone.

User Interface Overview

When the app opens on your phone, you'll see a simple interface with buttons to select your model, a text input field, and action buttons:

Load: Select one of the quantized models you pushed earlier.
Text Field: Type your prompt or question.
Send: Run inference on-device and receive a generated response.

This setup lets you evaluate both the usability and responsiveness of different quantized versions of the same model.

Note: Only one model can be loaded per app session. To switch models, fully close the app and rerun it via Android Studio.

Benchmarking Model Performance

To evaluate how each quantized model performs, tap the Bench button after loading a model.
This runs a built-in benchmarking routine that measures how efficiently the model processes and generates tokens on your device.
The benchmark reports two key metrics:

Prompt processing speed (pp) — measures how quickly the model encodes the input prompt before generation begins.
This reflects the forward pass performance when the full prompt is fed to the model (e.g., how many input tokens per second it can process per second). It is usually a compute bound stage.
Token generation speed (tg) — measures how fast the model generates output tokens during the autoregressive decoding phase.
Each new token depends on the previously generated context, making this step more memory-bound.

Repeat this process for each model variant (e.g., Q8, Q4, Q2) to compare how quantization impacts runtime efficiency.

Your output should resemble:

Then summarize your results in a table:

Quant	Prompt Processing (pp)	Token Generation (tg)
Q8	11.0 tokens/s	6.8 tokens/s
Q4	12.0 tokens/s	7.1 tokens/s
Q2	9.7 tokens/s	10.0 tokens/s

💡 Insight:
Quantization can reduce precision slightly but improves memory usage and inference speed.
The gain is most visible during token generation, which is typically the most time- and memory-intensive phase of autoregressive decoding.

This step provides both a quantitative measure of on-device efficiency and a qualitative sense of responsiveness across different quantization levels.

🎓 Course Summary: What We've Learned

Across the three labs in this course, you've gained hands-on experience with the complete pipeline for deploying efficient AI models on mobile devices:

Lab 1: Extreme Quantization Fundamentals

Trained a baseline FP32 language model and progressively quantized to 8-bit → 4-bit → 2-bit → 1-bit
Observed accuracy degradation with extreme compression (1-bit = 32× size reduction)
Implemented extreme ternary/binary quantization techniques
Recovered accuracy using Quantization-Aware Training (QAT) to achieve near-FP32 performance

Lab 2: Hardware-Software Co-Design

Implemented QLinear layers with integer-only GEMM operations
Applied per-layer mixed-precision quantization (8/4/2-bit) using post-training techniques
Profiled model size vs. cross-entropy to measure layer-specific sensitivity
Performed automated bit-width search to optimize hardware-aware objective functions

Lab 3: Real-World Mobile Deployment on ARM

Deployed quantized LLMs on ARM-based Android devices using llama.cpp
Benchmarked performance: aggressive quantization (Q2) can double generation speed while reducing model size by ~4x
Experienced the practical benefits: reduced latency, offline capability, enhanced privacy

Key Insight: Quantization enables the impossible—running billion-parameter models on ARM phones. The 3B Llama model went from ~12GB (unusable on mobile) to ~750MB-3GB (runs smoothly on a typical phone).

Further Exploration

Experiment with different model sizes (1B vs 3B vs 7B) on your device
Try custom prompts to test model quality across quantization levels
Explore llama.cpp's quantization formats: Q5_K_M, Q6_K for different accuracy/speed balances
Read the llama.cpp documentation for advanced optimization flags
Check out GGUF model benchmarks for community performance comparisons

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lab 3: Running & Quantizing Llama Models on Android

Learning Goals

Prerequisites

Step 1: Install Android Studio and Set Up the SDK

Step 2: Authenticate with Hugging Face

Step 3: Clone & Build `llama.cpp`

Step 4: Set Up a Workspace for Models

Step 5: Download a Pretrained 3B Llama model

Step 6: Convert Huggingface model to GGUF Format

Step 7: Quantize the Model

Step 8: Test Inference Locally

Step 9: Open the Android App Project

Step 10: Enable Debugging mode and modify the app to use local models

Step 11: Push Quantized Models to Android Device

Step 12: Build and Run the App

Step 13: Use the App 🌟

User Interface Overview

Benchmarking Model Performance

🎓 Course Summary: What We've Learned

Lab 1: Extreme Quantization Fundamentals

Lab 2: Hardware-Software Co-Design

Lab 3: Real-World Mobile Deployment on ARM

Further Exploration

FilesExpand file tree

lab3.md

Latest commit

History

lab3.md

File metadata and controls

Lab 3: Running & Quantizing Llama Models on Android

Learning Goals

Prerequisites

Step 1: Install Android Studio and Set Up the SDK

Step 2: Authenticate with Hugging Face

Step 3: Clone & Build llama.cpp

Step 4: Set Up a Workspace for Models

Step 5: Download a Pretrained 3B Llama model

Step 6: Convert Huggingface model to GGUF Format

Step 7: Quantize the Model

Step 8: Test Inference Locally

Step 9: Open the Android App Project

Step 10: Enable Debugging mode and modify the app to use local models

Step 11: Push Quantized Models to Android Device

Step 12: Build and Run the App

Step 13: Use the App 🌟

User Interface Overview

Benchmarking Model Performance

🎓 Course Summary: What We've Learned

Lab 1: Extreme Quantization Fundamentals

Lab 2: Hardware-Software Co-Design

Lab 3: Real-World Mobile Deployment on ARM

Further Exploration

Step 3: Clone & Build `llama.cpp`