Open
Description
🚀 The feature, motivation and pitch
torchchat provides quantization functionality but the interaction isn't ideal.
Currently
Currently you can run generate with a --quantization flag and it quantize the model before running inference. This means that you have to quantize (expensive) every time you want to run generate. It also means you can't ask for a quantized model for use in another project.
python3 torchchat.py generate llama2 --quantize '{"linear:int4": {"groupsize": 128}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"mps"}}' --prompt "Once upon a time," --max-new-tokens 256 --num-samples 3 --seed 42
Expectation
Ideally you should be able to save/cache the quantized model during generate and run a command to just quantize and get the outputted model
Add a quantize command
python3 torchchat.py quantize llama2 '{"linear:int4": {"groupsize": 128}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"mps"}}' --output-file llama2-4b.pt
- the quantize.md file should enumerate all options available in a table along with a suggested option
- the quantize.md file should provide examples
- the readme.md file should include one example for 4bit (often used for benchmarking)
- optional output-file (if flag isn't present just store it in our default model location with a well defined scheme
{modelname}-{quant}.pt
)
Cache the quantized versions when you run using --quantize
- the
list
command should show the quantized versions - the
remove
command should delete the quantized version
Additionally
- common quantization recipes are stored in a json file and can be passed in
torchchat/config/quantization/llama-4bit.json
- with the contents
{"linear:int4": {"groupsize": 128}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"mps"}}
- allowing for the command
python3 torchchat.py quantize llama2 --config config/quantization/llama-4bit.json --output-file llama2-4bit.json
Alternatives
No response
Additional context
No response
RFC (Optional)
No response