Skip to content

Latest commit

 

History

History
334 lines (226 loc) · 14 KB

File metadata and controls

334 lines (226 loc) · 14 KB

Lab 3: Running & Quantizing Llama Models on Android

This lab provides a hands-on walkthrough of how to run and optimize compact large language models (LLMs) directly on Android devices using the llama.cpp framework. You'll explore the full workflow, including downloading, converting, quantizing, deploying, and benchmarking Llama-style models. By the end, you'll have a working offline mobile LLM capable of running locally without server dependencies in an on-device android application.

Learning Goals

  • Understand the toolchain for working with llama.cpp
  • Learn how model quantization reduces memory and compute requirements
  • Modify an Android app to load and run local quantized models on device
  • Benchmark and compare the runtime performance of different quantized formats on an android device

⚙️ Why it matters: Running LLMs natively on mobile offers benefits like reduced latency, improved privacy, and offline capability—critical for edge AI applications.


Prerequisites

Component Min version Why you need it Install hint
Python 3.9 for conversion and quantization scripts https://python.org
Git any to clone repositories sudo apt install git
CMake 3.16 to build llama.cpp tools brew install cmake / apt
huggingface-cli latest to download models (auth required) pip install --upgrade huggingface_hub
Android Studio Hedgehog or newer includes NDK & ADB for building and deploying https://developer.android.com/studio
Android device Android 10+, ≥ 6GB RAM target runtime any modern phone

Windows users: use WSL 2 with Ubuntu 22.04 for compatibility with build tools.


Step 1: Install Android Studio and Set Up the SDK

Android Studio includes all the tools needed to build and debug Android apps, including ADB (installed automatically as part of the Platform-Tools).

  1. Download and install Android Studio on your computer.
  2. During setup, ensure the following components are selected:
    • Android SDK
    • SDK Platform Tools (includes ADB — no separate install required)
    • NDK
  3. In Settings → Appearance & Behavior → System Settings → Android SDK, verify:
    • SDK path: ~/Android/Sdk/
    • NDK version ≥ 25
    • Platform-Tools version ≥ 34
  4. (Optional) Add the tools directory to your shell PATH so you can run ADB from any terminal:
    export PATH=$PATH:$HOME/Android/Sdk/platform-tools

Step 2: Authenticate with Hugging Face

We use Hugging Face to download a pretrained Llama-style model.

On your computer, run:

pip install --upgrade huggingface_hub
huggingface-cli login
  1. Log in at https://huggingface.co.
  2. Go to Settings → Access Tokens.
  3. Create a token with Read access.
  4. Paste the token in your terminal.

Step 3: Clone & Build llama.cpp

This builds the tools used to convert and quantize models:

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DLLAMA_BUILD_TOOLS=ON
make

🧪 Why: This creates binaries like llama-quantize and llama-run, which are used to transform and test models.


Step 4: Set Up a Workspace for Models

On your computer, run:

cd ..
mkdir llama-models && cd llama-models

Organizing models in one place keeps your workflow tidy.


Step 5: Download a Pretrained 3B Llama model

On your computer, run:

huggingface-cli download unsloth/Llama-3.2-3B-Instruct --repo-type model --local-dir ./Llama-3.2-3B-Instruct

Tip: Use a 1B model variant for devices with less RAM and storage.


Step 6: Convert Huggingface model to GGUF Format

On your computer, run:

cd ../
python convert_hf_to_gguf.py llama-models/Llama-3.2-3B-Instruct --outfile llama-models/Llama-3.2-3B-Instruct-gguf

Why GGUF? It's the native llama.cpp format that enables fast, memory-mapped loading and is portable across platforms.


Step 7: Quantize the Model

Quantization is a technique that reduces the precision of the model's weights from 32-bit floating point numbers to lower bit representations (like 8-bit, 4-bit or even 2-bit integers). This significantly reduces model size and speeds up inference, with minimal impact on accuracy when done carefully.

For example, a 3B parameter model normally requires ~12GB in FP32. After quantization:

  • Q8_0 (8-bit) reduces it to ~3GB
  • Q4_K_M (4-bit) reduces it to ~1.5GB
  • TQ2_0 (2-bit) reduces it to ~750MB

To Quantize, we can use the llama.cpp tools. On your computer, run:

./build/bin/llama-quantize llama-models/Llama-3.2-3B-Instruct-gguf llama-models/Llama-3.2-3B-Instruct-gguf-Q8_0 Q8_0
./build/bin/llama-quantize llama-models/Llama-3.2-3B-Instruct-gguf llama-models/Llama-3.2-3B-Instruct-gguf-Q4_K_M Q4_K_M
./build/bin/llama-quantize llama-models/Llama-3.2-3B-Instruct-gguf llama-models/Llama-3.2-3B-Instruct-gguf-TQ2_0 TQ2_0
Format Bitwidth Trade-off
Q8_0 8-bit High accuracy, large size
Q4_K_M 4-bit Balance of sizespeed
TQ2_0 ~2-bit Tiny, fast, less accurate

Why Quantize? Reduces memory and compute cost, enabling real-time use on mobile.


Step 8: Test Inference Locally

On your computer, run:

./build/bin/llama-run llama-models/Llama-3.2-3B-Instruct-gguf-Q8_0 "What is quantization in the context of machine learning?"

Sanity check the model before deploying to Android. The command should output a response from the model to your terminal.


Step 9: Open the Android App Project

In Android Studio on your computer:

  • Open llama.cpp/examples/llama.android
  • Wait for Gradle sync (this may take 10-15 minutes)
  • Switch to "Project" view for easier navigation. You can do this by selecting 'Android' in the top left of your screen, then selecting 'project' from the dropdown.

Step 10: Enable Debugging mode and modify the app to use local models

First to enable debugging mode on your phone, do the following on your mobile:

  1. Enable USB debugging in Developer Options of you mobile:
    • Navigate to Settings > About phone
    • Tap Build number 7 times to enable Developer options
    • Return to Settings > Developer options
    • Toggle on USB debugging

Then In Android Studio on your computer, edit the following files:

Edit llama.android/app/src/java/com.example.llama/MainActivity.kt Replace the models = listOf(...) section with:

val models = listOf(
    Downloadable("Llama-3.2-3B-q8", Uri.EMPTY, File(extFilesDir, "Llama-3.2-3B-Instruct-gguf-Q8_0")),
    Downloadable("Llama-3.2-3B-q4", Uri.EMPTY, File(extFilesDir, "Llama-3.2-3B-Instruct-gguf-Q4_K_M")),
    Downloadable("Llama-3.2-3B-q2", Uri.EMPTY, File(extFilesDir, "Llama-3.2-3B-Instruct-gguf-TQ2_0")),
)

Edit Downloadable.kt Replace the entire @Composable fun Button(...) implementation with:

@JvmStatic
@Composable
fun Button(viewModel: MainViewModel, dm: DownloadManager, item: Downloadable) {
    val fileExists = item.destination.exists()

    fun onClick() {
        if (fileExists) {
            viewModel.load(item.destination.path)
        }
    }

    Button(onClick = { onClick() }, enabled = fileExists) {
        Text(if (fileExists) "Load ${item.name}" else "${item.name} - Not found")
    }
}

Press the Play button Llama Android App UI in android studio to push the application to the device, and initialize the activation space. We haven't pushed the models yet so you won't be able to load them. We can fix that in the next step.


Step 11: Push Quantized Models to Android Device

  1. Connect your device via USB cable and authorize your computer when prompted

  2. Push the model files to device:

adb push llama-models/*gguf* /sdcard/Android/data/com.example.llama/files/

Verify the files have been transferred with:

adb shell ls /sdcard/Android/data/com.example.llama/files/

It should output something like this:

Llama-3.2-3B-Instruct-gguf-Q4_K_M
Llama-3.2-3B-Instruct-gguf-Q8_0
Llama-3.2-3B-Instruct-gguf-TQ2_0

Step 12: Build and Run the App

In Android Studio on your computer, press the Run button Llama Android App UI. The app will install on your device and open automatically.


Step 13: Use the App 🌟

With the app installed and your models loaded onto the device, it's time to interact with them on your Android phone.

User Interface Overview

When the app opens on your phone, you'll see a simple interface with buttons to select your model, a text input field, and action buttons:

Llama Android App UI

  • Load: Select one of the quantized models you pushed earlier.
  • Text Field: Type your prompt or question.
  • Send: Run inference on-device and receive a generated response.

This setup lets you evaluate both the usability and responsiveness of different quantized versions of the same model.

Note: Only one model can be loaded per app session. To switch models, fully close the app and rerun it via Android Studio.


Benchmarking Model Performance

To evaluate how each quantized model performs, tap the Bench button after loading a model.
This runs a built-in benchmarking routine that measures how efficiently the model processes and generates tokens on your device.
The benchmark reports two key metrics:

  • Prompt processing speed (pp) — measures how quickly the model encodes the input prompt before generation begins.
    This reflects the forward pass performance when the full prompt is fed to the model (e.g., how many input tokens per second it can process per second). It is usually a compute bound stage.

  • Token generation speed (tg) — measures how fast the model generates output tokens during the autoregressive decoding phase.
    Each new token depends on the previously generated context, making this step more memory-bound.

Repeat this process for each model variant (e.g., Q8, Q4, Q2) to compare how quantization impacts runtime efficiency.

Your output should resemble:

Benchmark Results for Q8 Model

Then summarize your results in a table:

Quant Prompt Processing (pp) Token Generation (tg)
Q8 11.0 tokens/s 6.8 tokens/s
Q4 12.0 tokens/s 7.1 tokens/s
Q2 9.7 tokens/s 10.0 tokens/s

💡 Insight:
Quantization can reduce precision slightly but improves memory usage and inference speed.
The gain is most visible during token generation, which is typically the most time- and memory-intensive phase of autoregressive decoding.

This step provides both a quantitative measure of on-device efficiency and a qualitative sense of responsiveness across different quantization levels.


🎓 Course Summary: What We've Learned

Across the three labs in this course, you've gained hands-on experience with the complete pipeline for deploying efficient AI models on mobile devices:

Lab 1: Extreme Quantization Fundamentals

  • Trained a baseline FP32 language model and progressively quantized to 8-bit → 4-bit → 2-bit → 1-bit
  • Observed accuracy degradation with extreme compression (1-bit = 32× size reduction)
  • Implemented extreme ternary/binary quantization techniques
  • Recovered accuracy using Quantization-Aware Training (QAT) to achieve near-FP32 performance

Lab 2: Hardware-Software Co-Design

  • Implemented QLinear layers with integer-only GEMM operations
  • Applied per-layer mixed-precision quantization (8/4/2-bit) using post-training techniques
  • Profiled model size vs. cross-entropy to measure layer-specific sensitivity
  • Performed automated bit-width search to optimize hardware-aware objective functions

Lab 3: Real-World Mobile Deployment on ARM

  • Deployed quantized LLMs on ARM-based Android devices using llama.cpp
  • Benchmarked performance: aggressive quantization (Q2) can double generation speed while reducing model size by ~4x
  • Experienced the practical benefits: reduced latency, offline capability, enhanced privacy

Key Insight: Quantization enables the impossible—running billion-parameter models on ARM phones. The 3B Llama model went from ~12GB (unusable on mobile) to ~750MB-3GB (runs smoothly on a typical phone).

Further Exploration

  • Experiment with different model sizes (1B vs 3B vs 7B) on your device
  • Try custom prompts to test model quality across quantization levels
  • Explore llama.cpp's quantization formats: Q5_K_M, Q6_K for different accuracy/speed balances
  • Read the llama.cpp documentation for advanced optimization flags
  • Check out GGUF model benchmarks for community performance comparisons