Skip to content

RFC: Make quantization a first class feature #1032

Open
@byjlw

Description

@byjlw

🚀 The feature, motivation and pitch

torchchat provides quantization functionality but the interaction isn't ideal.

Currently

Currently you can run generate with a --quantization flag and it quantize the model before running inference. This means that you have to quantize (expensive) every time you want to run generate. It also means you can't ask for a quantized model for use in another project.

python3 torchchat.py generate llama2 --quantize '{"linear:int4": {"groupsize": 128}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"mps"}}' --prompt "Once upon a time," --max-new-tokens 256 --num-samples 3 --seed 42

Expectation

Ideally you should be able to save/cache the quantized model during generate and run a command to just quantize and get the outputted model

Add a quantize command

python3 torchchat.py quantize llama2  '{"linear:int4": {"groupsize": 128}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"mps"}}' --output-file llama2-4b.pt
  • the quantize.md file should enumerate all options available in a table along with a suggested option
  • the quantize.md file should provide examples
  • the readme.md file should include one example for 4bit (often used for benchmarking)
  • optional output-file (if flag isn't present just store it in our default model location with a well defined scheme {modelname}-{quant}.pt)

Cache the quantized versions when you run using --quantize

  • the list command should show the quantized versions
  • the remove command should delete the quantized version

Additionally

  • common quantization recipes are stored in a json file and can be passed in
  • torchchat/config/quantization/llama-4bit.json
  • with the contents {"linear:int4": {"groupsize": 128}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"mps"}}
  • allowing for the command python3 torchchat.py quantize llama2 --config config/quantization/llama-4bit.json --output-file llama2-4bit.json

Alternatives

No response

Additional context

No response

RFC (Optional)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions