Skip to content

Commit 196cc92

Browse files
committed
Update documentation from main repository
1 parent cd0a76b commit 196cc92

File tree

1 file changed

+48
-1
lines changed

1 file changed

+48
-1
lines changed

docs/api/cli.md

Lines changed: 48 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,52 @@ This file can be incomplete, and missing sections will be filled in by the defau
154154
}
155155
```
156156

157+
##### Example Quantization Configuration (`config.json`)
158+
`quantization_config` can be obtained from any configuration used in `transformers` via the `.to_json_file(filename)` function:
159+
160+
```python
161+
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
162+
quantization_config.to_json_file("quantization_config.json")
163+
164+
```
165+
Then copy it into `config.json`:
166+
167+
```json
168+
{
169+
"model": "",
170+
"backend": "transformers",
171+
"num_gpus": 1,
172+
"auto_scaling_config": {
173+
"metric": "concurrency",
174+
"target": 1,
175+
"min_instances": 0,
176+
"max_instances": 10,
177+
"keep_alive": 0
178+
},
179+
"backend_config": {
180+
"pretrained_model_name_or_path": "",
181+
"device_map": "auto",
182+
"torch_dtype": "float16",
183+
"hf_model_class": "AutoModelForCausalLM",
184+
"quantization_config": {
185+
"_load_in_4bit": false,
186+
"_load_in_8bit": true,
187+
"bnb_4bit_compute_dtype": "float32",
188+
"bnb_4bit_quant_storage": "uint8",
189+
"bnb_4bit_quant_type": "fp4",
190+
"bnb_4bit_use_double_quant": false,
191+
"llm_int8_enable_fp32_cpu_offload": false,
192+
"llm_int8_has_fp16_weight": false,
193+
"llm_int8_skip_modules": null,
194+
"llm_int8_threshold": 6.0,
195+
"load_in_4bit": false,
196+
"load_in_8bit": true,
197+
"quant_method": "bitsandbytes"
198+
}
199+
}
200+
}
201+
```
202+
157203
Below is a description of all the fields in config.json.
158204

159205
| Field | Description |
@@ -174,6 +220,7 @@ Below is a description of all the fields in config.json.
174220
| backend_config.hf_model_class | HuggingFace model class. |
175221
| backend_config.enable_lora | Set to true to enable loading LoRA adapters during inference. |
176222
| backend_config.lora_adapters| A dictionary of LoRA adapters in the format `{name: path}`, where each path is a local or Hugging Face-hosted LoRA adapter directory. |
223+
| backend_config.quantization_config| A dictionary specifying the desired `BitsAndBytesConfig`. Can be obtained by saving a `BitsAndBytesConfig` to JSON via `BitsAndBytesConfig.to_json_file(filename). Defaults to None.|
177224
178225
### sllm-cli delete
179226
Delete deployed models by name, or delete specific LoRA adapters associated with a base model.
@@ -406,4 +453,4 @@ sllm-cli status
406453
#### Example
407454
```bash
408455
sllm-cli status
409-
```
456+
```

0 commit comments

Comments
 (0)