Description
🐞 Describe the Bug
When converting models config options not included in the architecture config are not imported from the Hugging Face model's config.json
.
This creates an unexpected and undocumented requirement for manual configuration, which can lead to costly mistakes.
The following critical options are affected:
window_size
andmax_windows_layers
for models trained with windowed attention (e.g. Qwen 2), see Qwen2 converter #163.router_aux_loss_coef
for MoEs such as Mixtral.
Currently, the load-from-HF-model feature suggests seamless integration, but this bug prevents complete and accurate model conversion. Users are likely to assume the conversion will "just work" and may unknowingly train models with incorrect configurations.
🔄 Steps to Reproduce
-
Load a HF model using Fast-LLM:
Use a HF model that requires non-architecture-specific parameters (e.g.,window_size
for sliding window attention). -
Observe missing configurations:
Check the output model configuration. Notice that parameters not included in the architecture config are missing or set to default values, potentially breaking the model.
🎯 Expected Behavior
Fast-LLM should correctly import all relevant configuration options from the Hugging Face config.json
, not just those in the architecture configuration. This ensures that models are fully converted and behave as expected, without requiring manual intervention or hidden knowledge about configuration quirks.