huggingface · saahil1801 · Mar 26, 2024 · Mar 26, 2024 · Apr 5, 2024 · Apr 5, 2024
diff --git a/mixtral.md b/mixtral.md
@@ -285,8 +285,32 @@ output = model.generate(**inputs, max_new_tokens=50)
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 
-Note that for both QLoRA and GPTQ you need at least 30 GB of GPU VRAM to fit the model. You can make it work with 24 GB if you use `device_map="auto"`, like in the example above, so some layers are offloaded to CPU.
+If you have [exllama kernels installed](https://github.com/turboderp/exllama), you can leverage them to run the GPTQ model. To do so, load the model with a custom GPTQ configuration where you set the desired parameters:
+
+```python
+import torch
+from transformers
+
+model_id = "TheBloke/Mixtral-8x7B-v0.1-GPTQ"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+gptq_config = GPTQConfig(bits=4, use_exllama=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    quantization_config=gptq_config,
+    device_map="auto"
+)
 
+prompt = "[INST] Explain what a Mixture of Experts is in less than 100 words. [/INST]"
+inputs = tokenizer(prompt, return_tensors="pt").to(0)
+
+output = model.generate(**inputs, max_new_tokens=50)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+If left unset , the "use_exllama" parameter defaults to True , enabling the exllama backend functionality, specifically designed to work with the "bits" value of 4.
-If left unset , the "use_exllama" parameter defaults to True , enabling the exllama backend functionality, specifically designed to work with the "bits" value of 4.
+If left unset, `use_exllama` defaults to `True` when kernels are installed.
-If left unset , the "use_exllama" parameter defaults to True , enabling the exllama backend functionality, specifically designed to work with the "bits" value of 4.
+If left unset, `use_exllama` defaults to `True` when kernels are installed.
+
+Note that for both QLoRA and GPTQ you need at least 30 GB of GPU VRAM to fit the model. You can make it work with 24 GB if you use `device_map="auto"`, like in the example above, so some layers are offloaded to CPU.
 
 ## Disclaimers and ongoing work