diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index e6a6dd9a..f438d294 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -14,6 +14,12 @@ "skills": "./", "description": "Train or fine-tune language models using TRL on Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes hardware selection, cost estimation, Trackio monitoring, and Hub persistence." }, + { + "name": "huggingface-local-models", + "source": "./skills/huggingface-local-models", + "skills": "./", + "description": "Use to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving." + }, { "name": "huggingface-paper-publisher", "source": "./skills/huggingface-paper-publisher", diff --git a/README.md b/README.md index a50f86ac..ef0f23dc 100644 --- a/README.md +++ b/README.md @@ -95,6 +95,7 @@ This repository contains a few skills to get you started. You can also contribut | `huggingface-datasets` | Explore, query, and extract data from any Hugging Face dataset using the Dataset Viewer REST API and npx tooling. Zero Python dependencies — covers split/config discovery, row pagination, text search, filtering, SQL via parquetlens, and dataset upload via CLI. | [SKILL.md](skills/huggingface-datasets/SKILL.md) | | `huggingface-gradio` | Build Gradio web UIs and demos in Python. Use when creating or editing Gradio apps, components, event listeners, layouts, or chatbots. | [SKILL.md](skills/huggingface-gradio/SKILL.md) | | `huggingface-llm-trainer` | Train or fine-tune language models using TRL on Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes hardware selection, cost estimation, Trackio monitoring, and Hub persistence. | [SKILL.md](skills/huggingface-llm-trainer/SKILL.md) | +| `huggingface-local-models` | Use to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving. | [SKILL.md](skills/huggingface-local-models/SKILL.md) | | `huggingface-paper-publisher` | Publish and manage research papers on Hugging Face Hub. Supports creating paper pages, linking papers to models/datasets, claiming authorship, and generating professional markdown-based research articles. | [SKILL.md](skills/huggingface-paper-publisher/SKILL.md) | | `huggingface-papers` | Look up and read Hugging Face paper pages in markdown, and use the papers API for structured metadata like authors, linked models, datasets, Spaces, and media URLs when needed. | [SKILL.md](skills/huggingface-papers/SKILL.md) | | `huggingface-tool-builder` | Build reusable scripts for Hugging Face Hub and API workflows. Useful for chaining API calls, enriching Hub metadata, or automating repeated tasks. | [SKILL.md](skills/huggingface-tool-builder/SKILL.md) | diff --git a/agents/AGENTS.md b/agents/AGENTS.md index d591b085..f9389b8c 100644 --- a/agents/AGENTS.md +++ b/agents/AGENTS.md @@ -8,6 +8,7 @@ These skills are: - huggingface-datasets -> "skills/huggingface-datasets/SKILL.md" - huggingface-gradio -> "skills/huggingface-gradio/SKILL.md" - huggingface-llm-trainer -> "skills/huggingface-llm-trainer/SKILL.md" + - huggingface-local-models -> "skills/huggingface-local-models/SKILL.md" - huggingface-paper-publisher -> "skills/huggingface-paper-publisher/SKILL.md" - huggingface-papers -> "skills/huggingface-papers/SKILL.md" - huggingface-tool-builder -> "skills/huggingface-tool-builder/SKILL.md" @@ -24,6 +25,7 @@ huggingface-community-evals: `Run evaluations for Hugging Face Hub models using huggingface-datasets: `Use this skill for Hugging Face Dataset Viewer API workflows that fetch subset/split metadata, paginate rows, search text, apply filters, download parquet URLs, and read size or statistics.` huggingface-gradio: `Build Gradio web UIs and demos in Python. Use when creating or editing Gradio apps, components, event listeners, layouts, or chatbots.` huggingface-llm-trainer: `Train or fine-tune language and vision models using TRL (Transformer Reinforcement Learning) or Unsloth with Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes guidance on the TRL Jobs package, UV scripts with PEP 723 format, dataset preparation and validation, hardware selection, cost estimation, Trackio monitoring, Hub authentication, model selection/leaderboards and model persistence. Use for tasks involving cloud GPU training, GGUF conversion, or when users mention training on Hugging Face Jobs without local GPU setup.` +huggingface-local-models: `"Use to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving."` huggingface-paper-publisher: `Publish and manage research papers on Hugging Face Hub. Supports creating paper pages, linking papers to models/datasets, claiming authorship, and generating professional markdown-based research articles.` huggingface-papers: `Look up and read Hugging Face paper pages in markdown, and use the papers API for structured metadata such as authors, linked models/datasets/spaces, Github repo and project page. Use when the user shares a Hugging Face paper page URL, an arXiv URL or ID, or asks to summarize, explain, or analyze an AI research paper.` huggingface-tool-builder: `Use this skill when the user wants to build tool/scripts or achieve a task where using data from the Hugging Face API would help. This is especially useful when chaining or combining API calls or the task will be repeated/automated. This Skill creates a reusable script to fetch, enrich or process data.` diff --git a/scripts/publish.sh b/scripts/publish.sh index 9afe06ea..d1cceb01 100755 --- a/scripts/publish.sh +++ b/scripts/publish.sh @@ -26,17 +26,18 @@ run_generate() { } run_check() { - declare -A before + local before=() local changed=() for path in "${GENERATED_FILES[@]}"; do - before["$path"]="$(file_sig "$path")" + before+=("$(file_sig "$path")") done run_generate - for path in "${GENERATED_FILES[@]}"; do - if [[ "${before[$path]}" != "$(file_sig "$path")" ]]; then + for i in "${!GENERATED_FILES[@]}"; do + local path="${GENERATED_FILES[$i]}" + if [[ "${before[$i]}" != "$(file_sig "$path")" ]]; then changed+=("$path") fi done diff --git a/skills/huggingface-local-models/SKILL.md b/skills/huggingface-local-models/SKILL.md new file mode 100644 index 00000000..ece84955 --- /dev/null +++ b/skills/huggingface-local-models/SKILL.md @@ -0,0 +1,113 @@ +--- +name: huggingface-local-models +description: "Use to select models to run locally with llama.cpp and GGUF on CPU, Mac Metal, CUDA, or ROCm. Covers finding GGUFs, quant selection, running servers, exact GGUF file lookup, conversion, and OpenAI-compatible local serving." +--- + +# Hugging Face Local Models + +Search the Hugging Face Hub for llama.cpp-compatible GGUF repos, choose the right quant, and launch the model with `llama-cli` or `llama-server`. + +## Default Workflow + +1. Search the Hub with `apps=llama.cpp`. +2. Open `https://huggingface.co/?local-app=llama.cpp`. +3. Prefer the exact HF local-app snippet and quant recommendation when it is visible. +4. Confirm exact `.gguf` filenames with `https://huggingface.co/api/models//tree/main?recursive=true`. +5. Launch with `llama-cli -hf :` or `llama-server -hf :`. +6. Fall back to `--hf-repo` plus `--hf-file` when the repo uses custom file naming. +7. Convert from Transformers weights only if the repo does not already expose GGUF files. + +## Quick Start + +### Install llama.cpp + +```bash +brew install llama.cpp +winget install llama.cpp +``` + +```bash +git clone https://github.com/ggml-org/llama.cpp +cd llama.cpp +make +``` + +### Authenticate for gated repos + +```bash +hf auth login +``` + +### Search the Hub + +```text +https://huggingface.co/models?apps=llama.cpp&sort=trending +https://huggingface.co/models?search=Qwen3.6&apps=llama.cpp&sort=trending +https://huggingface.co/models?search=&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending +``` + +### Run directly from the Hub + +```bash +llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M +llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M +``` + +### Run an exact GGUF file + +```bash +llama-server \ + --hf-repo unsloth/Qwen3.6-35B-A3B-GGUF \ + --hf-file Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \ + -c 4096 +``` + +### Convert only when no GGUF is available + +```bash +hf download --local-dir ./model-src +python convert_hf_to_gguf.py ./model-src \ + --outfile model-f16.gguf \ + --outtype f16 +llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M +``` + +### Smoke test a local server + +```bash +llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M +``` + +```bash +curl http://localhost:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer no-key" \ + -d '{ + "messages": [ + {"role": "user", "content": "Write a limerick about exception handling"} + ] + }' +``` + +## Quant Choice + +- Prefer the exact quant that HF marks as compatible on the `?local-app=llama.cpp` page. +- Keep repo-native labels such as `UD-Q4_K_M` instead of normalizing them. +- Default to `Q4_K_M` unless the repo page or hardware profile suggests otherwise. +- Prefer `Q5_K_M` or `Q6_K` for code or technical workloads when memory allows. +- Consider `Q3_K_M`, `Q4_K_S`, or repo-specific `IQ` / `UD-*` variants for tighter RAM or VRAM budgets. +- Treat `mmproj-*.gguf` files as projector weights, not the main checkpoint. + +## Load References + +- Read [hub-discovery.md](references/hub-discovery.md) for URL-first workflows, model search, tree API extraction, and command reconstruction. +- Read [quantization.md](references/quantization.md) for format tables, model scaling, quality tradeoffs, and `imatrix`. +- Read [hardware.md](references/hardware.md) for Metal, CUDA, ROCm, or CPU build and acceleration details. + +## Resources + +- llama.cpp: `https://github.com/ggml-org/llama.cpp` +- Hugging Face GGUF + llama.cpp docs: `https://huggingface.co/docs/hub/gguf-llamacpp` +- Hugging Face Local Apps docs: `https://huggingface.co/docs/hub/main/local-apps` +- Hugging Face Local Agents docs: `https://huggingface.co/docs/hub/agents-local` +- GGUF converter Space: `https://huggingface.co/spaces/ggml-org/gguf-my-repo` diff --git a/skills/huggingface-local-models/references/hardware.md b/skills/huggingface-local-models/references/hardware.md new file mode 100644 index 00000000..33c15b41 --- /dev/null +++ b/skills/huggingface-local-models/references/hardware.md @@ -0,0 +1,38 @@ +## Hardware Acceleration + +### Apple Silicon (Metal) + +```bash +make clean && make GGML_METAL=1 +llama-cli -m model.gguf -ngl 99 -p "Hello" +``` + +### NVIDIA (CUDA) + +```bash +make clean && make GGML_CUDA=1 +llama-cli -m model.gguf -ngl 35 -p "Hello" + +# Hybrid for large models +llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 + +# Multi-GPU split +llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60 +``` + +### AMD (ROCm) + +```bash +make LLAMA_HIP=1 +llama-cli -m model.gguf -ngl 999 +``` + +### CPU + +```bash +# Match physical cores, not logical threads +llama-cli -m model.gguf -t 8 -p "Hello" + +# BLAS acceleration +make LLAMA_OPENBLAS=1 +``` \ No newline at end of file diff --git a/skills/huggingface-local-models/references/hub-discovery.md b/skills/huggingface-local-models/references/hub-discovery.md new file mode 100644 index 00000000..4f286ef4 --- /dev/null +++ b/skills/huggingface-local-models/references/hub-discovery.md @@ -0,0 +1,178 @@ +# Hugging Face URL Workflows for llama.cpp + +Use URL-only workflows first. Do not require `hf` or API clients just to find GGUF files, choose a quant, or build a `llama-server` command. + +## Contents + +- Core URLs +- Search for llama.cpp-compatible models +- Use the local-app page for the recommended quant +- Confirm exact files from the tree API +- Build the command +- Example: `unsloth/Qwen3.6-35B-A3B-GGUF` +- Notes + +## Core URLs + +```text +Search: +https://huggingface.co/models?apps=llama.cpp&sort=trending + +Search with text: +https://huggingface.co/models?search=&apps=llama.cpp&sort=trending + +Search with size bounds: +https://huggingface.co/models?search=&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending + +Repo local-app view: +https://huggingface.co/?local-app=llama.cpp + +Repo tree API: +https://huggingface.co/api/models//tree/main?recursive=true + +Repo file tree: +https://huggingface.co//tree/main +``` + +## 1. Search for llama.cpp-compatible models + +Start from the models page with `apps=llama.cpp`. + +Use: + +- `search=` for model family names such as `Qwen`, `Gemma`, `Phi`, or `Mistral` +- `num_parameters=min:0,max:24B` or similar if the user has hardware limits +- `sort=trending` when the user wants popular repos right now + +Do not start with random GGUF repos if the user has not chosen a model family yet. Search first, shortlist second. + +Example: https://huggingface.co/models?search=Qwen&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending + +## 2. Use the local-app page for the recommended quant + +Open: + +```text +https://huggingface.co/?local-app=llama.cpp +``` + +Extract, in order: + +1. The exact `Use this model` snippet, if it is visible as text +2. The `Hardware compatibility` section from the fetched page text or HTML: + - quant label + - file size + - bit-depth grouping +3. Any extra launch flags shown in the snippet, such as `--jinja` + +Treat the HF local-app snippet as the source of truth when it is visible. + +Do this by reading the URL itself, not by assuming the UI rendered in a browser. If the fetched page source does not expose `Hardware compatibility`, say that the section was not text-visible and fall back to the tree API plus generic guidance from `quantization.md`. + +## 3. Confirm exact files from the tree API + +Open: + +```text +https://huggingface.co/api/models//tree/main?recursive=true +``` + +Treat the JSON response as the source of truth for repo inventory. + +Keep entries where: + +- `type` is `file` +- `path` ends with `.gguf` + +Use these fields: + +- `path` for the filename and subdirectory +- `size` for the byte size +- optionally `lfs.size` to confirm the LFS payload size + +Separate files into: + +- quantized single-file checkpoints, for example `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf` +- projector weights, usually `mmproj-*.gguf` +- BF16 shard files, usually under `BF16/` +- everything else + +Ignore unless the user asks: + +- `README.md` +- imatrix or calibration blobs + +Use `https://huggingface.co//tree/main` only as a human fallback if the API endpoint fails or the user wants the web view. + +## 4. Build the command + +Preferred order: + +1. Copy the exact HF snippet from the local-app page +2. If the page gives a clean quant label, use shorthand selection: + +```bash +llama-server -hf : +``` + +3. If you need an exact file from the tree API, use the file-specific form: + +```bash +llama-server --hf-repo --hf-file +``` + +4. For CLI usage instead of a server, use: + +```bash +llama-cli -hf : +``` + +Use the exact-file form when the repo uses custom labels or nonstandard naming that could make `:` ambiguous. + +## 5. Example: `unsloth/Qwen3.6-35B-A3B-GGUF` + +Use these URLs: + +```text +https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF?local-app=llama.cpp +https://huggingface.co/api/models/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main?recursive=true +https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main +``` + +On the local-app page, the hardware compatibility section can expose entries such as: + +- `UD-IQ4_XS` - 17.7 GB +- `UD-Q4_K_S` - 20.9 GB +- `UD-Q4_K_M` - 22.1 GB +- `UD-Q5_K_M` - 26.5 GB +- `UD-Q6_K` - 29.3 GB +- `Q8_0` - 36.9 GB + +On the tree API, you can confirm exact filenames such as: + +- `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf` +- `Qwen3.6-35B-A3B-UD-Q5_K_M.gguf` +- `Qwen3.6-35B-A3B-UD-Q6_K.gguf` +- `Qwen3.6-35B-A3B-Q8_0.gguf` +- `mmproj-F16.gguf` + +Good final output for this repo: + +```text +Repo: unsloth/Qwen3.6-35B-A3B-GGUF +Recommended quant from HF: UD-Q4_K_M (22.1 GB) +llama-server: llama-server --hf-repo unsloth/Qwen3.6-35B-A3B-GGUF --hf-file Qwen3.6-35B-A3B-UD-Q4_K_M.gguf +Other GGUFs: +- Qwen3.6-35B-A3B-UD-Q5_K_M.gguf - 26.5 GB +- Qwen3.6-35B-A3B-UD-Q6_K.gguf - 29.3 GB +- Qwen3.6-35B-A3B-Q8_0.gguf - 36.9 GB +Projector: +- mmproj-F16.gguf - 899 MB +``` + +## Notes + +- Repo-specific quant labels matter. Do not rewrite `UD-Q4_K_M` to `Q4_K_M` unless the page itself does. +- `mmproj` files are projector weights for multimodal models, not the main language model checkpoint. +- If the HF hardware compatibility panel is missing because the user has no hardware profile configured, or because the fetched page source did not expose it, still use the tree API plus generic quant guidance from `quantization.md`. +- If the repo already has GGUFs, do not jump straight to conversion workflows. diff --git a/skills/huggingface-local-models/references/quantization.md b/skills/huggingface-local-models/references/quantization.md new file mode 100644 index 00000000..2099976e --- /dev/null +++ b/skills/huggingface-local-models/references/quantization.md @@ -0,0 +1,256 @@ +# GGUF Quantization Guide + +Complete guide to GGUF quantization formats and model conversion. + +## Contents + +- Hub-first quant selection +- Quantization Formats +- Converting Models +- K-Quantization Methods +- Quality Testing +- Use Case Guide +- Model Size Scaling +- Finding Pre-Quantized Models +- Importance Matrices (`imatrix`) +- Troubleshooting + +## Hub-first quant selection + +Before using generic tables, open the model repo with: + +```text +https://huggingface.co/?local-app=llama.cpp +``` + +Prefer the exact quant labels and sizes shown in the `Hardware compatibility` section of the fetched `?local-app=llama.cpp` page text or HTML. Then confirm the matching filenames in: + +```text +https://huggingface.co/api/models//tree/main?recursive=true +``` + +Use the Hub page first, and only fall back to the generic heuristics below when the repo page does not expose a clear recommendation. + +## Quantization Formats + +**GGUF** (GPT-Generated Unified Format) - Standard format for llama.cpp models. + +### Format Comparison + +| Format | Perplexity | Size (7B) | Tokens/sec | Notes | +|--------|------------|-----------|------------|-------| +| FP16 | 5.9565 (baseline) | 13.0 GB | 15 tok/s | Original quality | +| Q8_0 | 5.9584 (+0.03%) | 7.0 GB | 25 tok/s | Nearly lossless | +| **Q6_K** | 5.9642 (+0.13%) | 5.5 GB | 30 tok/s | Best quality/size | +| **Q5_K_M** | 5.9796 (+0.39%) | 4.8 GB | 35 tok/s | Balanced | +| **Q4_K_M** | 6.0565 (+1.68%) | 4.1 GB | 40 tok/s | **Recommended** | +| Q4_K_S | 6.1125 (+2.62%) | 3.9 GB | 42 tok/s | Faster, lower quality | +| Q3_K_M | 6.3184 (+6.07%) | 3.3 GB | 45 tok/s | Small models only | +| Q2_K | 6.8673 (+15.3%) | 2.7 GB | 50 tok/s | Not recommended | + +**Recommendation**: Use **Q4_K_M** for best balance of quality and speed. + +## Converting Models + +### Hugging Face to GGUF + +```bash +# 1. Download Hugging Face model +hf download meta-llama/Llama-2-7b-chat-hf \ + --local-dir models/llama-2-7b-chat/ + +# 2. Convert to FP16 GGUF +python convert_hf_to_gguf.py \ + models/llama-2-7b-chat/ \ + --outtype f16 \ + --outfile models/llama-2-7b-chat-f16.gguf + +# 3. Quantize to Q4_K_M +./llama-quantize \ + models/llama-2-7b-chat-f16.gguf \ + models/llama-2-7b-chat-Q4_K_M.gguf \ + Q4_K_M +``` + +### Batch quantization + +```bash +# Quantize to multiple formats +for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do + ./llama-quantize \ + model-f16.gguf \ + model-${quant}.gguf \ + $quant +done +``` + +## K-Quantization Methods + +**K-quants** use mixed precision for better quality: +- Attention weights: Higher precision +- Feed-forward weights: Lower precision + +**Variants**: +- `_S` (Small): Faster, lower quality +- `_M` (Medium): Balanced (recommended) +- `_L` (Large): Better quality, larger size + +**Example**: `Q4_K_M` +- `Q4`: 4-bit quantization +- `K`: Mixed precision method +- `M`: Medium quality + +## Quality Testing + +```bash +# Calculate perplexity (quality metric) +./llama-perplexity \ + -m model.gguf \ + -f wikitext-2-raw/wiki.test.raw \ + -c 512 + +# Lower perplexity = better quality +# Baseline (FP16): ~5.96 +# Q4_K_M: ~6.06 (+1.7%) +# Q2_K: ~6.87 (+15.3% - too much degradation) +``` + +## Use Case Guide + +### General purpose (chatbots, assistants) +``` +Q4_K_M - Best balance +Q5_K_M - If you have extra RAM +``` + +### Code generation +``` +Q5_K_M or Q6_K - Higher precision helps with code +``` + +### Creative writing +``` +Q4_K_M - Sufficient quality +Q3_K_M - Acceptable for draft generation +``` + +### Technical/medical +``` +Q6_K or Q8_0 - Maximum accuracy +``` + +### Edge devices (Raspberry Pi) +``` +Q2_K or Q3_K_S - Fit in limited RAM +``` + +## Model Size Scaling + +### 7B parameter models + +| Format | Size | RAM needed | +|--------|------|------------| +| Q2_K | 2.7 GB | 5 GB | +| Q3_K_M | 3.3 GB | 6 GB | +| Q4_K_M | 4.1 GB | 7 GB | +| Q5_K_M | 4.8 GB | 8 GB | +| Q6_K | 5.5 GB | 9 GB | +| Q8_0 | 7.0 GB | 11 GB | + +### 13B parameter models + +| Format | Size | RAM needed | +|--------|------|------------| +| Q2_K | 5.1 GB | 8 GB | +| Q3_K_M | 6.2 GB | 10 GB | +| Q4_K_M | 7.9 GB | 12 GB | +| Q5_K_M | 9.2 GB | 14 GB | +| Q6_K | 10.7 GB | 16 GB | + +### 70B parameter models + +| Format | Size | RAM needed | +|--------|------|------------| +| Q2_K | 26 GB | 32 GB | +| Q3_K_M | 32 GB | 40 GB | +| Q4_K_M | 41 GB | 48 GB | +| Q4_K_S | 39 GB | 46 GB | +| Q5_K_M | 48 GB | 56 GB | + +**Recommendation for 70B**: Use Q3_K_M or Q4_K_S to fit in consumer hardware. + +## Finding Pre-Quantized Models + +Use the Hub search with the llama.cpp app filter: + +```text +https://huggingface.co/models?apps=llama.cpp&sort=trending +https://huggingface.co/models?search=&apps=llama.cpp&sort=trending +https://huggingface.co/models?search=&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending +``` + +For a specific repo, open: + +```text +https://huggingface.co/?local-app=llama.cpp +https://huggingface.co/api/models//tree/main?recursive=true +``` + +Then launch directly from the Hub without extra Hub tooling: + +```bash +llama-cli -hf :Q4_K_M +llama-server -hf :Q4_K_M +``` + +If you need the exact file name from the tree API: + +```bash +llama-server --hf-repo --hf-file +``` + +## Importance Matrices (imatrix) + +**What**: Calibration data to improve quantization quality. + +**Benefits**: +- 10-20% perplexity improvement with Q4 +- Essential for Q3 and below + +**Usage**: +```bash +# 1. Generate importance matrix +./llama-imatrix \ + -m model-f16.gguf \ + -f calibration-data.txt \ + -o model.imatrix + +# 2. Quantize with imatrix +./llama-quantize \ + --imatrix model.imatrix \ + model-f16.gguf \ + model-Q4_K_M.gguf \ + Q4_K_M +``` + +**Calibration data**: +- Use domain-specific text (e.g., code for code models) +- ~100MB of representative text +- Higher quality data = better quantization + +## Troubleshooting + +**Model outputs gibberish**: +- Quantization too aggressive (Q2_K) +- Try Q4_K_M or Q5_K_M +- Verify model converted correctly + +**Out of memory**: +- Use lower quantization (Q4_K_S instead of Q5_K_M) +- Offload fewer layers to GPU (`-ngl`) +- Use smaller context (`-c 2048`) + +**Slow inference**: +- Higher quantization uses more compute +- Q8_0 much slower than Q4_K_M +- Consider speed vs quality trade-off