Releases · huggingface/accelerate

[bug] fix device index bug for model training loaded with bitsandbytes by @faaany in #3408
[docs] add the missing import torch by @faaany in #3396
minor doc fixes by @nbroad1881 in #3365
fix: ensure CLI args take precedence over config file. by @cyr0930 in #3409
fix: Add device=torch.get_default_device() in torch.Generators by @saforem2 in #3420
Add Tecorigin SDAA accelerator support by @siqi654321 in #3330
fix typo : thier -> their by @hackty in #3423
Fix quality by @muellerzr in #3424
Distributed inference example for llava_next by @VladOS95-cyber in #3417
HPU support by @IlyasMoutawwakil in #3378

New Contributors

@cyr0930 made their first contribution in #3409
@saforem2 made their first contribution in #3420
@siqi654321 made their first contribution in #3330
@hackty made their first contribution in #3423
@VladOS95-cyber made their first contribution in #3417
@IlyasMoutawwakil made their first contribution in #3378

Full Changelog: v1.4.0...v1.5.0

Contributors

saforem2, muellerzr, and 7 other contributors

Assets 2

17 Feb 17:18

muellerzr

v1.4.0

b431d1f

v1.4.0: `torchao` FP8, TP & dataLoader support, fix memory leak

`torchao` FP8, initial Tensor Parallel support, and memory leak fixes

`torchao` FP8

This release introduces a new FP8 API and brings in a new backend: torchao. To use, pass in AORecipeKwargs to the Accelerator while setting mixed_precision="fp8". This is initial support, as it matures we will incorporate more into it (such as accelerate config/yaml) in future releases. See our benchmark examples here

TensorParallel

We have intial support for an in-house solution to TP when working with accelerate dataloaders. check out the PR here

Bug fixes

fix triton version check by @faaany in #3345
fix torch_dtype in estimate memory by @SunMarc in #3383
works for fp8 with deepspeed by @XiaobingSuper in #3361
[memory leak] Replace GradientState -> DataLoader reference with weakrefs by @tomaarsen in #3391

What's Changed

fix triton version check by @faaany in #3345
[tests] enable BNB test cases in tests/test_quantization.py on XPU by @faaany in #3349
[Dev] Update release directions by @muellerzr in #3352
[tests] make cuda-only test work on other hardware accelerators by @faaany in #3302
[tests] remove require_non_xpu test markers by @faaany in #3301
Support more functionalities for MUSA backend by @fmo-mt in #3359
[tests] enable more bnb tests on XPU by @faaany in #3350
feat: support tensor parallel & Data loader by @kmehant in #3173
DeepSpeed github repo move sync by @stas00 in #3376
[tests] Fix bnb cpu error by @faaany in #3351
fix torch_dtype in estimate memory by @SunMarc in #3383
works for fp8 with deepspeed by @XiaobingSuper in #3361
fix: typos in documentation files by @maximevtush in #3388
[examples] upgrade code for seed setting by @faaany in #3387
[memory leak] Replace GradientState -> DataLoader reference with weakrefs by @tomaarsen in #3391
add xpu check in get_quantized_model_device_map by @faaany in #3397
Torchao float8 training by @muellerzr in #3348

New Contributors

@kmehant made their first contribution in #3173
@XiaobingSuper made their first contribution in #3361
@maximevtush made their first contribution in #3388

Full Changelog: v1.3.0...v1.4.0

Contributors

muellerzr, stas00, and 7 other contributors

Assets 2

17 Jan 15:56

muellerzr

v1.3.0

d8f314c

v1.3.0 Bug fixes + Require torch 2.0

Torch 2.0

As it's been ~2 years since torch 2.0 was first released, we are now requiring this as the minimum version for Accelerate, which similarly was done in transformers as of its last release.

Core

[docs] no hard-coding cuda by @faaany in #3270
fix load_state_dict for npu by @ji-huazhong in #3211
Add keep_torch_compile param to unwrap_model and extract_model_from_parallel for distributed compiled model. by @ggoggam in #3282
[tests] make cuda-only test case device-agnostic by @faaany in #3340
latest bnb no longer has optim_args attribute on optimizer by @winglian in #3311
add torchdata version check to avoid "in_order" error by @faaany in #3344
[docs] fix typo, change "backoff_filter" to "backoff_factor" by @suchot in #3296
dataloader: check that in_order is in kwargs before trying to drop it by @dvrogozh in #3346
feat(tpu): remove nprocs from xla.spawn by @tengomucho in #3324

Big Modeling

Fix test_nested_hook by @SunMarc in #3289
correct the return statement of _init_infer_auto_device_map by @Nech-C in #3279
Use torch.xpu.mem_get_info for XPU by @dvrogozh in #3275
Ensure that tied parameter is children of module by @pablomlago in #3327
Fix for offloading when using TorchAO >= 0.7.0 by @a-r-r-o-w in #3332
Fix offload generate tests by @SunMarc in #3334

Examples

Give example on how to handle gradient accumulation with cross-entropy by @ylacombe in #3193

Full Changelog

What's Changed

[docs] no hard-coding cuda by @faaany in #3270
fix load_state_dict for npu by @ji-huazhong in #3211
Fix test_nested_hook by @SunMarc in #3289
correct the return statement of _init_infer_auto_device_map by @Nech-C in #3279
Give example on how to handle gradient accumulation with cross-entropy by @ylacombe in #3193
Use torch.xpu.mem_get_info for XPU by @dvrogozh in #3275
Add keep_torch_compile param to unwrap_model and extract_model_from_parallel for distributed compiled model. by @ggoggam in #3282
Ensure that tied parameter is children of module by @pablomlago in #3327
Bye bye torch <2 by @muellerzr in #3331
Fixup docker build err by @muellerzr in #3333
feat(tpu): remove nprocs from xla.spawn by @tengomucho in #3324
Fix offload generate tests by @SunMarc in #3334
[tests] make cuda-only test case device-agnostic by @faaany in #3340
latest bnb no longer has optim_args attribute on optimizer by @winglian in #3311
Fix for offloading when using TorchAO >= 0.7.0 by @a-r-r-o-w in #3332
add torchdata version check to avoid "in_order" error by @faaany in #3344
[docs] fix typo, change "backoff_filter" to "backoff_factor" by @suchot in #3296
dataloader: check that in_order is in kwargs before trying to drop it by @dvrogozh in #3346

New Contributors

@ylacombe made their first contribution in #3193
@ggoggam made their first contribution in #3282
@pablomlago made their first contribution in #3327
@tengomucho made their first contribution in #3324
@suchot made their first contribution in #3296

Full Changelog: v1.2.1...v1.3.0

Contributors

winglian, tengomucho, and 11 other contributors

Assets 2

13 Dec 18:56

muellerzr

v1.2.1

e3cd148

v1.2.1: Patchfix

fix: add max_memory to _init_infer_auto_device_map's return statement in #3279 by @Nech-C
fix load_state_dict for npu in #3211 by @statelesshz

Full Changelog: v1.2.0...v1.2.1

Contributors

ji-huazhong and Nech-C

Assets 2

13 Dec 18:47

muellerzr

v1.2.0

32afaa4

v1.2.0: Bug Squashing & Fixes across the board

Core

enable find_executable_batch_size on XPU by @faaany in #3236
Use numpy._core instead of numpy.core by @qgallouedec in #3247
Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in #3066
Allow for full dynamo config passed to Accelerator by @muellerzr in #3251
[WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in #3252
[data_loader] Optionally also propagate set_epoch to batch sampler by @tomaarsen in #3246
use XPU instead of GPU in the accelerate config prompt text by @faaany in #3268

Big Modeling

Fix align_module_device, ensure only cpu tensors for get_state_dict_offloaded_model by @kylesayrs in #3217
Remove hook for bnb 4-bit by @SunMarc in #3223
[docs] add instruction to install bnb on non-cuda devices by @faaany in #3227
Take care of case when "_tied_weights_keys" is not an attribute by @fabianlim in #3226
Update deferring_execution.md by @max-yue in #3262
Revert default behavior of get_state_dict_from_offload by @kylesayrs in #3253
Fix: Resolve #3060, preload_module_classes is lost for nested modules by @wejoncy in #3248

DeepSpeed

Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in #3255
support for wrapped schedulefree optimizer when using deepspeed by @winglian in #3266

Documentation

Update code in tracking documentation by @faaany in #3235
Replaced set/check breakpoint with set/check trigger in the troubleshooting documentation by @relh in #3259
Update set-seed by @faaany in #3228
Fix typo by @faaany in #3221
Use real path for checkpoint by @faaany in #3220
Fixed multiple typos for Tutorials and Guides docs by @henryhmko in #3274

New Contributors

@winglian made their first contribution in #3266
@max-yue made their first contribution in #3262
@as12138 made their first contribution in #3261
@relh made their first contribution in #3259
@wejoncy made their first contribution in #3248
@henryhmko made their first contribution in #3274

Full Changelog

Fix align_module_device, ensure only cpu tensors for get_state_dict_offloaded_model by @kylesayrs in #3217
remove hook for bnb 4-bit by @SunMarc in #3223
enable find_executable_batch_size on XPU by @faaany in #3236
take care of case when "_tied_weights_keys" is not an attribute by @fabianlim in #3226
[docs] update code in tracking documentation by @faaany in #3235
Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in #3066
[data_loader] Optionally also propagate set_epoch to batch sampler by @tomaarsen in #3246
[docs] add instruction to install bnb on non-cuda devices by @faaany in #3227
Use numpy._core instead of numpy.core by @qgallouedec in #3247
Allow for full dynamo config passed to Accelerator by @muellerzr in #3251
[WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in #3252
use XPU instead of GPU in the accelerate config prompt text by @faaany in #3268
support for wrapped schedulefree optimizer when using deepspeed by @winglian in #3266
Update deferring_execution.md by @max-yue in #3262
Fix: Resolve #3257 by @as12138 in #3261
Replaced set/check breakpoint with set/check trigger in the troubleshooting documentation by @relh in #3259
Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in #3255
Revert default behavior of get_state_dict_from_offload by @kylesayrs in #3253
Fix: Resolve #3060, preload_module_classes is lost for nested modules by @wejoncy in #3248
[docs] update set-seed by @faaany in #3228
[docs] fix typo by @faaany in #3221
[docs] use real path for checkpoint by @faaany in #3220
Fixed multiple typos for Tutorials and Guides docs by @henryhmko in #3274

Code Diff

Release diff: v1.1.1...v1.2.0

Contributors

winglian, relh, and 14 other contributors

Assets 2

01 Nov 15:30

muellerzr

v1.1.0

d0e80e5

v1.1.0: Python 3.9 minimum, torch dynamo deepspeed support, and bug fixes

Internals:

Allow for a data_seed argument in #3150
Trigger weights_only=True by default for all compatible objects when checkpointing and saving with torch.save in #3036
Handle negative values for dim input in pad_across_processes in #3114
Enable cpu bnb distributed lora finetune in #3159

DeepSpeed

Support torch dynamo for deepspeed>=0.14.4 in #3069

Megatron

update Megatron-LM plugin code to version 0.8.0 or higher in #3174

Big Model Inference

New has_offloaded_params utility added in #3188

Examples

Florence2 distributed inference example in #3123

Full Changelog

Handle negative values for dim input in pad_across_processes by @mariusarvinte in #3114
Fixup DS issue with weakref by @muellerzr in #3143
Refactor scaler to util by @muellerzr in #3142
DS fix, continued by @muellerzr in #3145
Florence2 distributed inference example by @hlky in #3123
POC: Allow for a data_seed by @muellerzr in #3150
Adding multi gpu speech generation by @dame-cell in #3149
support torch dynamo for deepspeed>=0.14.4 by @oraluben in #3069
Fixup Zero3 + save_model by @muellerzr in #3146
Trigger weights_only=True by default for all compatible objects by @muellerzr in #3036
Remove broken dynamo test by @oraluben in #3155
fix version check bug in get_xpu_available_memory by @faaany in #3165
enable cpu bnb distributed lora finetune by @jiqing-feng in #3159
[Utils] has_offloaded_params by @kylesayrs in #3188
fix bnb by @eljandoubi in #3186
[docs] update neptune API by @faaany in #3181
docs: fix a wrong word in comment in src/accelerate/accelerate.py:1255 by @Rebornix-zero in #3183
[docs] use nn.module instead of tensor as model by @faaany in #3157
Fix typo by @kylesayrs in #3191
MLU devices : Checks if mlu is available via an cndev-based check which won't trigger the drivers and leave mlu by @huismiling in #3187
update Megatron-LM plugin code to version 0.8.0 or higher. by @eljandoubi in #3174
🚨 🚨 🚨 Goodbye Python 3.8! 🚨 🚨 🚨 by @muellerzr in #3194
Update transformers.deepspeed references from transformers 4.46.0 release by @loadams in #3196
eliminate dead code by @statelesshz in #3198
take torch.nn.Module model into account when moving to device by @faaany in #3167
[docs] add xpu part and fix bug in torchrun by @faaany in #3166
Models With Tied Weights Need Re-Tieing After FSDP Param Init by @fabianlim in #3154
add the missing xpu for local sgd by @faaany in #3163
typo fix in big_modeling.py by @a-r-r-o-w in #3207
[Utils] align_module_device by @kylesayrs in #3204

New Contributors

@mariusarvinte made their first contribution in #3114
@hlky made their first contribution in #3123
@dame-cell made their first contribution in #3149
@kylesayrs made their first contribution in #3188
@eljandoubi made their first contribution in #3186
@Rebornix-zero made their first contribution in #3183
@loadams made their first contribution in #3196

Full Changelog: v1.0.1...v1.1.0

Contributors

huismiling, oraluben, and 13 other contributors

Assets 2

12 Oct 03:01

muellerzr

v1.0.1

a427548

v1.0.1: Bugfix

Bugfixes

Fixes an issue where the auto values were no longer being parsed when using deepspeed
Fixes a broken test in the deepspeed tests related to the auto values

Full Changelog: v1.0.0...v1.0.1

Assets 2

07 Oct 15:42

muellerzr

v1.0.0

5d71646

Accelerate 1.0.0 is here!

🚀 Accelerate 1.0 🚀

With accelerate 1.0, we are officially stating that the core parts of the API are now "stable" and ready for the future of what the world of distributed training and PyTorch has to handle. With these release notes, we will focus first on the major breaking changes to get your code fixed, followed by what is new specifically between 0.34.0 and 1.0.

To read more, check out our official blog here

Migration assistance

Passing in dispatch_batches, split_batches, even_batches, and use_seedable_sampler to the Accelerator() should now be handled by creating an accelerate.utils.DataLoaderConfiguration() and passing this to the Accelerator() instead (Accelerator(dataloader_config=DataLoaderConfiguration(...)))
Accelerator().use_fp16 and AcceleratorState().use_fp16 have been removed; this should be replaced by checking accelerator.mixed_precision == "fp16"
Accelerator().autocast() no longer accepts a cache_enabled argument. Instead, an AutocastKwargs() instance should be used which handles this flag (among others) passing it to the Accelerator (Accelerator(kwargs_handlers=[AutocastKwargs(cache_enabled=True)]))
accelerate.utils.is_tpu_available should be replaced with accelerate.utils.is_torch_xla_available
accelerate.utils.modeling.shard_checkpoint should be replaced with split_torch_state_dict_into_shards from the huggingface_hub library
accelerate.tqdm.tqdm() no longer accepts True/False as the first argument, and instead, main_process_only should be passed in as a named argument

Multiple Model DeepSpeed Support

After long request, we finally have multiple model DeepSpeed support in Accelerate! (though it is quite early still). Read the full tutorial here, however essentially:

When using multiple models, a DeepSpeed plugin should be created for each model (and as a result, a separate config). a few examples are below:

Knowledge distillation

(Where we train only one model, zero3, and another is used for inference, zero2)

from accelerate import Accelerator
from accelerate.utils import DeepSpeedPlugin

zero2_plugin = DeepSpeedPlugin(hf_ds_config="zero2_config.json")
zero3_plugin = DeepSpeedPlugin(hf_ds_config="zero3_config.json")

deepspeed_plugins = {"student": zero2_plugin, "teacher": zero3_plugin}


accelerator = Accelerator(deepspeed_plugins=deepspeed_plugins)

To then select which plugin to be used at a certain time (aka when calling prepare), we call `accelerator.state.select_deepspeed_plugin("name"), where the first plugin is active by default:

accelerator.state.select_deepspeed_plugin("student")
student_model, optimizer, scheduler = ...
student_model, optimizer, scheduler, train_dataloader = accelerator.prepare(student_model, optimizer, scheduler, train_dataloader)

accelerator.state.select_deepspeed_plugin("teacher") # This will automatically enable zero init
teacher_model = AutoModel.from_pretrained(...)
teacher_model = accelerator.prepare(teacher_model)

Multiple disjoint models

For disjoint models, separate accelerators should be used for each model, and their own .backward() should be called later:

for batch in dl:
    outputs1 = first_model(**batch)
    first_accelerator.backward(outputs1.loss)
    first_optimizer.step()
    first_scheduler.step()
    first_optimizer.zero_grad()
    
    outputs2 = model2(**batch)
    second_accelerator.backward(outputs2.loss)
    second_optimizer.step()
    second_scheduler.step()
    second_optimizer.zero_grad()

FP8

We've enabled MS-AMP support up to FSDP. At this time we are not going forward with implementing FSDP support with MS-AMP, due to design issues between both libraries that don't make them inter-op easily.

FSDP

Fixed FSDP auto_wrap using characters instead of full str for layers
Re-enable setting state dict type manually

Big Modeling

Removed cpu restriction for bnb training

What's Changed

Fix FSDP auto_wrap using characters instead of full str for layers by @muellerzr in #3075
Allow DataLoaderAdapter subclasses to be pickled by implementing __reduce__ by @byi8220 in #3074
Fix three typos in src/accelerate/data_loader.py by @xiabingquan in #3082
Re-enable setting state dict type by @muellerzr in #3084
Support sequential cpu offloading with torchao quantized tensors by @a-r-r-o-w in #3085
fix bug in _get_named_modules by @faaany in #3052
use the correct available memory API for XPU by @faaany in #3076
fix skip_keys usage in forward hooks by @152334H in #3088
Update README.md to include distributed image generation gist by @sayakpaul in #3077
MAINT: Upgrade ruff to v0.6.4 by @BenjaminBossan in #3095
Revert "Enable Unwrapping for Model State Dicts (FSDP)" by @SunMarc in #3096
MS-AMP support (w/o FSDP) by @muellerzr in #3093
[docs] DataLoaderConfiguration docstring by @stevhliu in #3103
MAINT: Permission for GH token in stale.yml by @BenjaminBossan in #3102
[docs] Doc sprint by @stevhliu in #3099
Update image ref for docs by @muellerzr in #3105
No more t5 by @muellerzr in #3107
[docs] More docstrings by @stevhliu in #3108
🚨🚨🚨 The Great Deprecation 🚨🚨🚨 by @muellerzr in #3098
POC: multiple model/configuration DeepSpeed support by @muellerzr in #3097
Fixup test_sync w/ deprecated stuff by @muellerzr in #3109
Switch to XLA instead of TPU by @SunMarc in #3118
[tests] skip pippy tests for XPU by @faaany in #3119
Fixup multiple model DS tests by @muellerzr in #3131
remove cpu restriction for bnb training by @jiqing-feng in #3062
fix deprecated torch.cuda.amp.GradScaler FutureWarning for pytorch 2.4+ by @Mon-ius in #3132
🐛 [HotFix] Handle Profiler Activities Based on PyTorch Version by @yhna940 in #3136
only move model to device when model is in cpu and target device is xpu by @faaany in #3133
fix tip brackets typo by @davanstrien in #3129
typo of "scalar" instead of "scaler" by @tonyzhaozh in #3116
MNT Permission for PRs for GH token in stale.yml by @BenjaminBossan in #3112

New Contributors

@xiabingquan made their first contribution in #3082
@a-r-r-o-w made their first contribution in #3085
@152334H made their first contribution in #3088
@sayakpaul made their first contribution in #3077
@Mon-ius made their first contribution in #3132
@davanstrien made their first contribution in #3129
@tonyzhaozh made their first contribution in #3116

Full Changelog: v0.34.2...v1.0.0

Contributors

BenjaminBossan, muellerzr, and 13 other contributors

Assets 2

05 Sep 15:36

muellerzr

v0.34.1

beb4378

v0.34.1 Patchfix

Bug fixes

Fixes an issue where processed DataLoaders could no longer be pickled in #3074 thanks to @byi8220
Fixes an issue when using FSDP where default_transformers_cls_names_to_wrap would separate _no_split_modules by characters instead of keeping it as a list of layer names in #3075

Full Changelog: v0.34.0...v0.34.1

Contributors

byi8220

Assets 2

Releases: huggingface/accelerate

Patch: v1.5.2

v1.5.0: HPU support

HPU Support

What's Changed

New Contributors

Contributors

v1.4.0: `torchao` FP8, TP & dataLoader support, fix memory leak

torchao FP8, initial Tensor Parallel support, and memory leak fixes

torchao FP8

TensorParallel

Bug fixes

What's Changed

New Contributors

Contributors

v1.3.0 Bug fixes + Require torch 2.0

Torch 2.0

Core

Big Modeling

Examples

Full Changelog

What's Changed

New Contributors

Contributors

v1.2.1: Patchfix

Contributors

v1.2.0: Bug Squashing & Fixes across the board

Core

Big Modeling

DeepSpeed

Documentation

New Contributors

Full Changelog

Code Diff

Contributors

v1.1.0: Python 3.9 minimum, torch dynamo deepspeed support, and bug fixes

Internals:

DeepSpeed

Megatron

Big Model Inference

Examples

Full Changelog

New Contributors

Contributors

v1.0.1: Bugfix

Bugfixes

Accelerate 1.0.0 is here!

🚀 Accelerate 1.0 🚀

Migration assistance

Multiple Model DeepSpeed Support

Knowledge distillation

Multiple disjoint models

FP8

FSDP

Big Modeling

What's Changed

New Contributors

Contributors

v0.34.1 Patchfix

Bug fixes

Contributors

`torchao` FP8, initial Tensor Parallel support, and memory leak fixes

`torchao` FP8