RFC: Make quantization a first class feature

### 🚀 The feature, motivation and pitch

torchchat provides quantization functionality but the interaction isn't ideal. 
### Currently
Currently you can run generate with a --quantization flag and it quantize the model before running inference. This means that you have to quantize (expensive) every time you want to run generate. It also means you can't ask for a quantized model for use in another project. 
```
python3 torchchat.py generate llama2 --quantize '{"linear:int4": {"groupsize": 128}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"mps"}}' --prompt "Once upon a time," --max-new-tokens 256 --num-samples 3 --seed 42
```

### Expectation
Ideally you should be able to save/cache the quantized model during generate *and* run a command to just quantize and get the outputted model

**Add a quantize command**
```
python3 torchchat.py quantize llama2  '{"linear:int4": {"groupsize": 128}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"mps"}}' --output-file llama2-4b.pt
``` 
- the quantize.md file should enumerate all options available in a table along with a suggested option
- the quantize.md file should provide examples
- the readme.md file should include one example for 4bit (often used for benchmarking)
- optional output-file (if flag isn't present just store it in our default model location with a well defined scheme `{modelname}-{quant}.pt`)

**Cache the quantized versions when you run using --quantize**
- the `list` command should show the quantized versions
- the `remove` command should delete the quantized version

### Additionally
- common quantization recipes are stored in a json file and can be passed in
 - `torchchat/config/quantization/llama-4bit.json`
 - with the contents `{"linear:int4": {"groupsize": 128}, "precision": {"dtype":"float16"}, "executor":{"accelerator":"mps"}}`
 - allowing for the command `python3 torchchat.py quantize llama2 --config config/quantization/llama-4bit.json --output-file llama2-4bit.json`
 

### Alternatives

_No response_

### Additional context

_No response_

### RFC (Optional)

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Make quantization a first class feature #1032

🚀 The feature, motivation and pitch

Currently

Expectation

Additionally

Alternatives

Additional context

RFC (Optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Make quantization a first class feature #1032

Description

🚀 The feature, motivation and pitch

Currently

Expectation

Additionally

Alternatives

Additional context

RFC (Optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions