Releases: huggingface/accelerate
Patch: v1.5.2
Bug Fixes:
- Fixed an issue with
torch.get_default_device()
requiring a higher version than what we support - Fixed a broken
pytest
import in prod
Full Changelog: v1.5.0...v1.5.2
v1.5.0: HPU support
HPU Support
- Adds in HPU accelerator support for 🤗 Accelerate
What's Changed
- [bug] fix device index bug for model training loaded with bitsandbytes by @faaany in #3408
- [docs] add the missing
import torch
by @faaany in #3396 - minor doc fixes by @nbroad1881 in #3365
- fix: ensure CLI args take precedence over config file. by @cyr0930 in #3409
- fix: Add
device=torch.get_default_device()
intorch.Generator
s by @saforem2 in #3420 - Add Tecorigin SDAA accelerator support by @siqi654321 in #3330
- fix typo : thier -> their by @hackty in #3423
- Fix quality by @muellerzr in #3424
- Distributed inference example for llava_next by @VladOS95-cyber in #3417
- HPU support by @IlyasMoutawwakil in #3378
New Contributors
- @cyr0930 made their first contribution in #3409
- @saforem2 made their first contribution in #3420
- @siqi654321 made their first contribution in #3330
- @hackty made their first contribution in #3423
- @VladOS95-cyber made their first contribution in #3417
- @IlyasMoutawwakil made their first contribution in #3378
Full Changelog: v1.4.0...v1.5.0
v1.4.0: `torchao` FP8, TP & dataLoader support, fix memory leak
torchao
FP8, initial Tensor Parallel support, and memory leak fixes
torchao
FP8
This release introduces a new FP8 API and brings in a new backend: torchao
. To use, pass in AORecipeKwargs
to the Accelerator
while setting mixed_precision="fp8"
. This is initial support, as it matures we will incorporate more into it (such as accelerate config
/yaml) in future releases. See our benchmark examples here
TensorParallel
We have intial support for an in-house solution to TP when working with accelerate dataloaders. check out the PR here
Bug fixes
- fix triton version check by @faaany in #3345
- fix torch_dtype in estimate memory by @SunMarc in #3383
- works for fp8 with deepspeed by @XiaobingSuper in #3361
- [
memory leak
] Replace GradientState -> DataLoader reference with weakrefs by @tomaarsen in #3391
What's Changed
- fix triton version check by @faaany in #3345
- [tests] enable BNB test cases in
tests/test_quantization.py
on XPU by @faaany in #3349 - [Dev] Update release directions by @muellerzr in #3352
- [tests] make cuda-only test work on other hardware accelerators by @faaany in #3302
- [tests] remove
require_non_xpu
test markers by @faaany in #3301 - Support more functionalities for MUSA backend by @fmo-mt in #3359
- [tests] enable more bnb tests on XPU by @faaany in #3350
- feat: support tensor parallel & Data loader by @kmehant in #3173
- DeepSpeed github repo move sync by @stas00 in #3376
- [tests] Fix bnb cpu error by @faaany in #3351
- fix torch_dtype in estimate memory by @SunMarc in #3383
- works for fp8 with deepspeed by @XiaobingSuper in #3361
- fix: typos in documentation files by @maximevtush in #3388
- [examples] upgrade code for seed setting by @faaany in #3387
- [
memory leak
] Replace GradientState -> DataLoader reference with weakrefs by @tomaarsen in #3391 - add xpu check in
get_quantized_model_device_map
by @faaany in #3397 - Torchao float8 training by @muellerzr in #3348
New Contributors
- @kmehant made their first contribution in #3173
- @XiaobingSuper made their first contribution in #3361
- @maximevtush made their first contribution in #3388
Full Changelog: v1.3.0...v1.4.0
v1.3.0 Bug fixes + Require torch 2.0
Torch 2.0
As it's been ~2 years since torch 2.0 was first released, we are now requiring this as the minimum version for Accelerate, which similarly was done in transformers
as of its last release.
Core
- [docs] no hard-coding cuda by @faaany in #3270
- fix load_state_dict for npu by @ji-huazhong in #3211
- Add
keep_torch_compile
param tounwrap_model
andextract_model_from_parallel
for distributed compiled model. by @ggoggam in #3282 - [tests] make cuda-only test case device-agnostic by @faaany in #3340
- latest bnb no longer has optim_args attribute on optimizer by @winglian in #3311
- add torchdata version check to avoid "in_order" error by @faaany in #3344
- [docs] fix typo, change "backoff_filter" to "backoff_factor" by @suchot in #3296
- dataloader: check that in_order is in kwargs before trying to drop it by @dvrogozh in #3346
- feat(tpu): remove nprocs from xla.spawn by @tengomucho in #3324
Big Modeling
- Fix test_nested_hook by @SunMarc in #3289
- correct the return statement of _init_infer_auto_device_map by @Nech-C in #3279
- Use torch.xpu.mem_get_info for XPU by @dvrogozh in #3275
- Ensure that tied parameter is children of module by @pablomlago in #3327
- Fix for offloading when using TorchAO >= 0.7.0 by @a-r-r-o-w in #3332
- Fix offload generate tests by @SunMarc in #3334
Examples
Full Changelog
What's Changed
- [docs] no hard-coding cuda by @faaany in #3270
- fix load_state_dict for npu by @ji-huazhong in #3211
- Fix test_nested_hook by @SunMarc in #3289
- correct the return statement of _init_infer_auto_device_map by @Nech-C in #3279
- Give example on how to handle gradient accumulation with cross-entropy by @ylacombe in #3193
- Use torch.xpu.mem_get_info for XPU by @dvrogozh in #3275
- Add
keep_torch_compile
param tounwrap_model
andextract_model_from_parallel
for distributed compiled model. by @ggoggam in #3282 - Ensure that tied parameter is children of module by @pablomlago in #3327
- Bye bye torch <2 by @muellerzr in #3331
- Fixup docker build err by @muellerzr in #3333
- feat(tpu): remove nprocs from xla.spawn by @tengomucho in #3324
- Fix offload generate tests by @SunMarc in #3334
- [tests] make cuda-only test case device-agnostic by @faaany in #3340
- latest bnb no longer has optim_args attribute on optimizer by @winglian in #3311
- Fix for offloading when using TorchAO >= 0.7.0 by @a-r-r-o-w in #3332
- add torchdata version check to avoid "in_order" error by @faaany in #3344
- [docs] fix typo, change "backoff_filter" to "backoff_factor" by @suchot in #3296
- dataloader: check that in_order is in kwargs before trying to drop it by @dvrogozh in #3346
New Contributors
- @ylacombe made their first contribution in #3193
- @ggoggam made their first contribution in #3282
- @pablomlago made their first contribution in #3327
- @tengomucho made their first contribution in #3324
- @suchot made their first contribution in #3296
Full Changelog: v1.2.1...v1.3.0
v1.2.1: Patchfix
- fix: add max_memory to _init_infer_auto_device_map's return statement in #3279 by @Nech-C
- fix load_state_dict for npu in #3211 by @statelesshz
Full Changelog: v1.2.0...v1.2.1
v1.2.0: Bug Squashing & Fixes across the board
Core
- enable
find_executable_batch_size
on XPU by @faaany in #3236 - Use
numpy._core
instead ofnumpy.core
by @qgallouedec in #3247 - Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in #3066
- Allow for full dynamo config passed to Accelerator by @muellerzr in #3251
- [WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in #3252
- [
data_loader
] Optionally also propagate set_epoch to batch sampler by @tomaarsen in #3246 - use XPU instead of GPU in the
accelerate config
prompt text by @faaany in #3268
Big Modeling
- Fix
align_module_device
, ensure only cpu tensors forget_state_dict_offloaded_model
by @kylesayrs in #3217 - Remove hook for bnb 4-bit by @SunMarc in #3223
- [docs] add instruction to install bnb on non-cuda devices by @faaany in #3227
- Take care of case when "_tied_weights_keys" is not an attribute by @fabianlim in #3226
- Update deferring_execution.md by @max-yue in #3262
- Revert default behavior of
get_state_dict_from_offload
by @kylesayrs in #3253 - Fix: Resolve #3060,
preload_module_classes
is lost for nested modules by @wejoncy in #3248
DeepSpeed
- Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in #3255
- support for wrapped schedulefree optimizer when using deepspeed by @winglian in #3266
Documentation
-
Replaced set/check breakpoint with set/check trigger in the troubleshooting documentation by @relh in #3259
-
Fixed multiple typos for Tutorials and Guides docs by @henryhmko in #3274
New Contributors
- @winglian made their first contribution in #3266
- @max-yue made their first contribution in #3262
- @as12138 made their first contribution in #3261
- @relh made their first contribution in #3259
- @wejoncy made their first contribution in #3248
- @henryhmko made their first contribution in #3274
Full Changelog
- Fix
align_module_device
, ensure only cpu tensors forget_state_dict_offloaded_model
by @kylesayrs in #3217 - remove hook for bnb 4-bit by @SunMarc in #3223
- enable
find_executable_batch_size
on XPU by @faaany in #3236 - take care of case when "_tied_weights_keys" is not an attribute by @fabianlim in #3226
- [docs] update code in tracking documentation by @faaany in #3235
- Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in #3066
- [
data_loader
] Optionally also propagate set_epoch to batch sampler by @tomaarsen in #3246 - [docs] add instruction to install bnb on non-cuda devices by @faaany in #3227
- Use
numpy._core
instead ofnumpy.core
by @qgallouedec in #3247 - Allow for full dynamo config passed to Accelerator by @muellerzr in #3251
- [WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in #3252
- use XPU instead of GPU in the
accelerate config
prompt text by @faaany in #3268 - support for wrapped schedulefree optimizer when using deepspeed by @winglian in #3266
- Update deferring_execution.md by @max-yue in #3262
- Fix: Resolve #3257 by @as12138 in #3261
- Replaced set/check breakpoint with set/check trigger in the troubleshooting documentation by @relh in #3259
- Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in #3255
- Revert default behavior of
get_state_dict_from_offload
by @kylesayrs in #3253 - Fix: Resolve #3060,
preload_module_classes
is lost for nested modules by @wejoncy in #3248 - [docs] update set-seed by @faaany in #3228
- [docs] fix typo by @faaany in #3221
- [docs] use real path for
checkpoint
by @faaany in #3220 - Fixed multiple typos for Tutorials and Guides docs by @henryhmko in #3274
Code Diff
Release diff: v1.1.1...v1.2.0
v1.1.0: Python 3.9 minimum, torch dynamo deepspeed support, and bug fixes
Internals:
- Allow for a
data_seed
argument in #3150 - Trigger
weights_only=True
by default for all compatible objects when checkpointing and saving withtorch.save
in #3036 - Handle negative values for
dim
input inpad_across_processes
in #3114 - Enable cpu bnb distributed lora finetune in #3159
DeepSpeed
- Support torch dynamo for deepspeed>=0.14.4 in #3069
Megatron
- update Megatron-LM plugin code to version 0.8.0 or higher in #3174
Big Model Inference
- New
has_offloaded_params
utility added in #3188
Examples
- Florence2 distributed inference example in #3123
Full Changelog
- Handle negative values for
dim
input inpad_across_processes
by @mariusarvinte in #3114 - Fixup DS issue with weakref by @muellerzr in #3143
- Refactor scaler to util by @muellerzr in #3142
- DS fix, continued by @muellerzr in #3145
- Florence2 distributed inference example by @hlky in #3123
- POC: Allow for a
data_seed
by @muellerzr in #3150 - Adding multi gpu speech generation by @dame-cell in #3149
- support torch dynamo for deepspeed>=0.14.4 by @oraluben in #3069
- Fixup Zero3 +
save_model
by @muellerzr in #3146 - Trigger
weights_only=True
by default for all compatible objects by @muellerzr in #3036 - Remove broken dynamo test by @oraluben in #3155
- fix version check bug in
get_xpu_available_memory
by @faaany in #3165 - enable cpu bnb distributed lora finetune by @jiqing-feng in #3159
- [Utils]
has_offloaded_params
by @kylesayrs in #3188 - fix bnb by @eljandoubi in #3186
- [docs] update neptune API by @faaany in #3181
- docs: fix a wrong word in comment in src/accelerate/accelerate.py:1255 by @Rebornix-zero in #3183
- [docs] use nn.module instead of tensor as model by @faaany in #3157
- Fix typo by @kylesayrs in #3191
- MLU devices : Checks if mlu is available via an cndev-based check which won't trigger the drivers and leave mlu by @huismiling in #3187
- update Megatron-LM plugin code to version 0.8.0 or higher. by @eljandoubi in #3174
- 🚨 🚨 🚨 Goodbye Python 3.8! 🚨 🚨 🚨 by @muellerzr in #3194
- Update transformers.deepspeed references from transformers 4.46.0 release by @loadams in #3196
- eliminate dead code by @statelesshz in #3198
- take
torch.nn.Module
model into account when moving to device by @faaany in #3167 - [docs] add xpu part and fix bug in
torchrun
by @faaany in #3166 - Models With Tied Weights Need Re-Tieing After FSDP Param Init by @fabianlim in #3154
- add the missing xpu for local sgd by @faaany in #3163
- typo fix in big_modeling.py by @a-r-r-o-w in #3207
- [Utils]
align_module_device
by @kylesayrs in #3204
New Contributors
- @mariusarvinte made their first contribution in #3114
- @hlky made their first contribution in #3123
- @dame-cell made their first contribution in #3149
- @kylesayrs made their first contribution in #3188
- @eljandoubi made their first contribution in #3186
- @Rebornix-zero made their first contribution in #3183
- @loadams made their first contribution in #3196
Full Changelog: v1.0.1...v1.1.0
v1.0.1: Bugfix
Bugfixes
- Fixes an issue where the
auto
values were no longer being parsed when using deepspeed - Fixes a broken test in the deepspeed tests related to the auto values
Full Changelog: v1.0.0...v1.0.1
Accelerate 1.0.0 is here!
🚀 Accelerate 1.0 🚀
With accelerate
1.0, we are officially stating that the core parts of the API are now "stable" and ready for the future of what the world of distributed training and PyTorch has to handle. With these release notes, we will focus first on the major breaking changes to get your code fixed, followed by what is new specifically between 0.34.0 and 1.0.
To read more, check out our official blog here
Migration assistance
- Passing in
dispatch_batches
,split_batches
,even_batches
, anduse_seedable_sampler
to theAccelerator()
should now be handled by creating anaccelerate.utils.DataLoaderConfiguration()
and passing this to theAccelerator()
instead (Accelerator(dataloader_config=DataLoaderConfiguration(...))
) Accelerator().use_fp16
andAcceleratorState().use_fp16
have been removed; this should be replaced by checkingaccelerator.mixed_precision == "fp16"
Accelerator().autocast()
no longer accepts acache_enabled
argument. Instead, anAutocastKwargs()
instance should be used which handles this flag (among others) passing it to theAccelerator
(Accelerator(kwargs_handlers=[AutocastKwargs(cache_enabled=True)])
)accelerate.utils.is_tpu_available
should be replaced withaccelerate.utils.is_torch_xla_available
accelerate.utils.modeling.shard_checkpoint
should be replaced withsplit_torch_state_dict_into_shards
from thehuggingface_hub
libraryaccelerate.tqdm.tqdm()
no longer acceptsTrue
/False
as the first argument, and instead,main_process_only
should be passed in as a named argument
Multiple Model DeepSpeed Support
After long request, we finally have multiple model DeepSpeed support in Accelerate! (though it is quite early still). Read the full tutorial here, however essentially:
When using multiple models, a DeepSpeed plugin should be created for each model (and as a result, a separate config). a few examples are below:
Knowledge distillation
(Where we train only one model, zero3, and another is used for inference, zero2)
from accelerate import Accelerator
from accelerate.utils import DeepSpeedPlugin
zero2_plugin = DeepSpeedPlugin(hf_ds_config="zero2_config.json")
zero3_plugin = DeepSpeedPlugin(hf_ds_config="zero3_config.json")
deepspeed_plugins = {"student": zero2_plugin, "teacher": zero3_plugin}
accelerator = Accelerator(deepspeed_plugins=deepspeed_plugins)
To then select which plugin to be used at a certain time (aka when calling prepare
), we call `accelerator.state.select_deepspeed_plugin("name"), where the first plugin is active by default:
accelerator.state.select_deepspeed_plugin("student")
student_model, optimizer, scheduler = ...
student_model, optimizer, scheduler, train_dataloader = accelerator.prepare(student_model, optimizer, scheduler, train_dataloader)
accelerator.state.select_deepspeed_plugin("teacher") # This will automatically enable zero init
teacher_model = AutoModel.from_pretrained(...)
teacher_model = accelerator.prepare(teacher_model)
Multiple disjoint models
For disjoint models, separate accelerators should be used for each model, and their own .backward()
should be called later:
for batch in dl:
outputs1 = first_model(**batch)
first_accelerator.backward(outputs1.loss)
first_optimizer.step()
first_scheduler.step()
first_optimizer.zero_grad()
outputs2 = model2(**batch)
second_accelerator.backward(outputs2.loss)
second_optimizer.step()
second_scheduler.step()
second_optimizer.zero_grad()
FP8
We've enabled MS-AMP support up to FSDP. At this time we are not going forward with implementing FSDP support with MS-AMP, due to design issues between both libraries that don't make them inter-op easily.
FSDP
- Fixed FSDP auto_wrap using characters instead of full str for layers
- Re-enable setting state dict type manually
Big Modeling
- Removed cpu restriction for bnb training
What's Changed
- Fix FSDP auto_wrap using characters instead of full str for layers by @muellerzr in #3075
- Allow DataLoaderAdapter subclasses to be pickled by implementing
__reduce__
by @byi8220 in #3074 - Fix three typos in src/accelerate/data_loader.py by @xiabingquan in #3082
- Re-enable setting state dict type by @muellerzr in #3084
- Support sequential cpu offloading with torchao quantized tensors by @a-r-r-o-w in #3085
- fix bug in
_get_named_modules
by @faaany in #3052 - use the correct available memory API for XPU by @faaany in #3076
- fix
skip_keys
usage in forward hooks by @152334H in #3088 - Update README.md to include distributed image generation gist by @sayakpaul in #3077
- MAINT: Upgrade ruff to v0.6.4 by @BenjaminBossan in #3095
- Revert "Enable Unwrapping for Model State Dicts (FSDP)" by @SunMarc in #3096
- MS-AMP support (w/o FSDP) by @muellerzr in #3093
- [docs] DataLoaderConfiguration docstring by @stevhliu in #3103
- MAINT: Permission for GH token in stale.yml by @BenjaminBossan in #3102
- [docs] Doc sprint by @stevhliu in #3099
- Update image ref for docs by @muellerzr in #3105
- No more t5 by @muellerzr in #3107
- [docs] More docstrings by @stevhliu in #3108
- 🚨🚨🚨 The Great Deprecation 🚨🚨🚨 by @muellerzr in #3098
- POC: multiple model/configuration DeepSpeed support by @muellerzr in #3097
- Fixup test_sync w/ deprecated stuff by @muellerzr in #3109
- Switch to XLA instead of TPU by @SunMarc in #3118
- [tests] skip pippy tests for XPU by @faaany in #3119
- Fixup multiple model DS tests by @muellerzr in #3131
- remove cpu restriction for bnb training by @jiqing-feng in #3062
- fix deprecated
torch.cuda.amp.GradScaler
FutureWarning for pytorch 2.4+ by @Mon-ius in #3132 - 🐛 [HotFix] Handle Profiler Activities Based on PyTorch Version by @yhna940 in #3136
- only move model to device when model is in cpu and target device is xpu by @faaany in #3133
- fix tip brackets typo by @davanstrien in #3129
- typo of "scalar" instead of "scaler" by @tonyzhaozh in #3116
- MNT Permission for PRs for GH token in stale.yml by @BenjaminBossan in #3112
New Contributors
- @xiabingquan made their first contribution in #3082
- @a-r-r-o-w made their first contribution in #3085
- @152334H made their first contribution in #3088
- @sayakpaul made their first contribution in #3077
- @Mon-ius made their first contribution in #3132
- @davanstrien made their first contribution in #3129
- @tonyzhaozh made their first contribution in #3116
Full Changelog: v0.34.2...v1.0.0
v0.34.1 Patchfix
Bug fixes
- Fixes an issue where processed
DataLoaders
could no longer be pickled in #3074 thanks to @byi8220 - Fixes an issue when using FSDP where
default_transformers_cls_names_to_wrap
would separate_no_split_modules
by characters instead of keeping it as a list of layer names in #3075
Full Changelog: v0.34.0...v0.34.1