Skip to content

Releases: huggingface/accelerate

Patch: v1.5.2

14 Mar 14:16
Compare
Choose a tag to compare

Bug Fixes:

  • Fixed an issue with torch.get_default_device() requiring a higher version than what we support
  • Fixed a broken pytest import in prod

Full Changelog: v1.5.0...v1.5.2

v1.5.0: HPU support

12 Mar 14:18
Compare
Choose a tag to compare

HPU Support

  • Adds in HPU accelerator support for 🤗 Accelerate

What's Changed

New Contributors

Full Changelog: v1.4.0...v1.5.0

v1.4.0: `torchao` FP8, TP & dataLoader support, fix memory leak

17 Feb 17:18
Compare
Choose a tag to compare

torchao FP8, initial Tensor Parallel support, and memory leak fixes

torchao FP8

This release introduces a new FP8 API and brings in a new backend: torchao. To use, pass in AORecipeKwargs to the Accelerator while setting mixed_precision="fp8". This is initial support, as it matures we will incorporate more into it (such as accelerate config/yaml) in future releases. See our benchmark examples here

TensorParallel

We have intial support for an in-house solution to TP when working with accelerate dataloaders. check out the PR here

Bug fixes

What's Changed

New Contributors

Full Changelog: v1.3.0...v1.4.0

v1.3.0 Bug fixes + Require torch 2.0

17 Jan 15:56
Compare
Choose a tag to compare

Torch 2.0

As it's been ~2 years since torch 2.0 was first released, we are now requiring this as the minimum version for Accelerate, which similarly was done in transformers as of its last release.

Core

  • [docs] no hard-coding cuda by @faaany in #3270
  • fix load_state_dict for npu by @ji-huazhong in #3211
  • Add keep_torch_compile param to unwrap_model and extract_model_from_parallel for distributed compiled model. by @ggoggam in #3282
  • [tests] make cuda-only test case device-agnostic by @faaany in #3340
  • latest bnb no longer has optim_args attribute on optimizer by @winglian in #3311
  • add torchdata version check to avoid "in_order" error by @faaany in #3344
  • [docs] fix typo, change "backoff_filter" to "backoff_factor" by @suchot in #3296
  • dataloader: check that in_order is in kwargs before trying to drop it by @dvrogozh in #3346
  • feat(tpu): remove nprocs from xla.spawn by @tengomucho in #3324

Big Modeling

Examples

  • Give example on how to handle gradient accumulation with cross-entropy by @ylacombe in #3193

Full Changelog

What's Changed

New Contributors

Full Changelog: v1.2.1...v1.3.0

v1.2.1: Patchfix

13 Dec 18:56
Compare
Choose a tag to compare
  • fix: add max_memory to _init_infer_auto_device_map's return statement in #3279 by @Nech-C
  • fix load_state_dict for npu in #3211 by @statelesshz

Full Changelog: v1.2.0...v1.2.1

v1.2.0: Bug Squashing & Fixes across the board

13 Dec 18:47
Compare
Choose a tag to compare

Core

  • enable find_executable_batch_size on XPU by @faaany in #3236
  • Use numpy._core instead of numpy.core by @qgallouedec in #3247
  • Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in #3066
  • Allow for full dynamo config passed to Accelerator by @muellerzr in #3251
  • [WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in #3252
  • [data_loader] Optionally also propagate set_epoch to batch sampler by @tomaarsen in #3246
  • use XPU instead of GPU in the accelerate config prompt text by @faaany in #3268

Big Modeling

  • Fix align_module_device, ensure only cpu tensors for get_state_dict_offloaded_model by @kylesayrs in #3217
  • Remove hook for bnb 4-bit by @SunMarc in #3223
  • [docs] add instruction to install bnb on non-cuda devices by @faaany in #3227
  • Take care of case when "_tied_weights_keys" is not an attribute by @fabianlim in #3226
  • Update deferring_execution.md by @max-yue in #3262
  • Revert default behavior of get_state_dict_from_offload by @kylesayrs in #3253
  • Fix: Resolve #3060, preload_module_classes is lost for nested modules by @wejoncy in #3248

DeepSpeed

  • Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in #3255
  • support for wrapped schedulefree optimizer when using deepspeed by @winglian in #3266

Documentation

New Contributors

Full Changelog

  • Fix align_module_device, ensure only cpu tensors for get_state_dict_offloaded_model by @kylesayrs in #3217
  • remove hook for bnb 4-bit by @SunMarc in #3223
  • enable find_executable_batch_size on XPU by @faaany in #3236
  • take care of case when "_tied_weights_keys" is not an attribute by @fabianlim in #3226
  • [docs] update code in tracking documentation by @faaany in #3235
  • Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in #3066
  • [data_loader] Optionally also propagate set_epoch to batch sampler by @tomaarsen in #3246
  • [docs] add instruction to install bnb on non-cuda devices by @faaany in #3227
  • Use numpy._core instead of numpy.core by @qgallouedec in #3247
  • Allow for full dynamo config passed to Accelerator by @muellerzr in #3251
  • [WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in #3252
  • use XPU instead of GPU in the accelerate config prompt text by @faaany in #3268
  • support for wrapped schedulefree optimizer when using deepspeed by @winglian in #3266
  • Update deferring_execution.md by @max-yue in #3262
  • Fix: Resolve #3257 by @as12138 in #3261
  • Replaced set/check breakpoint with set/check trigger in the troubleshooting documentation by @relh in #3259
  • Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in #3255
  • Revert default behavior of get_state_dict_from_offload by @kylesayrs in #3253
  • Fix: Resolve #3060, preload_module_classes is lost for nested modules by @wejoncy in #3248
  • [docs] update set-seed by @faaany in #3228
  • [docs] fix typo by @faaany in #3221
  • [docs] use real path for checkpoint by @faaany in #3220
  • Fixed multiple typos for Tutorials and Guides docs by @henryhmko in #3274

Code Diff

Release diff: v1.1.1...v1.2.0

v1.1.0: Python 3.9 minimum, torch dynamo deepspeed support, and bug fixes

01 Nov 15:30
Compare
Choose a tag to compare

Internals:

  • Allow for a data_seed argument in #3150
  • Trigger weights_only=True by default for all compatible objects when checkpointing and saving with torch.save in #3036
  • Handle negative values for dim input in pad_across_processes in #3114
  • Enable cpu bnb distributed lora finetune in #3159

DeepSpeed

  • Support torch dynamo for deepspeed>=0.14.4 in #3069

Megatron

  • update Megatron-LM plugin code to version 0.8.0 or higher in #3174

Big Model Inference

  • New has_offloaded_params utility added in #3188

Examples

  • Florence2 distributed inference example in #3123

Full Changelog

New Contributors

Full Changelog: v1.0.1...v1.1.0

v1.0.1: Bugfix

12 Oct 03:01
Compare
Choose a tag to compare

Bugfixes

  • Fixes an issue where the auto values were no longer being parsed when using deepspeed
  • Fixes a broken test in the deepspeed tests related to the auto values

Full Changelog: v1.0.0...v1.0.1

Accelerate 1.0.0 is here!

07 Oct 15:42
Compare
Choose a tag to compare

🚀 Accelerate 1.0 🚀

With accelerate 1.0, we are officially stating that the core parts of the API are now "stable" and ready for the future of what the world of distributed training and PyTorch has to handle. With these release notes, we will focus first on the major breaking changes to get your code fixed, followed by what is new specifically between 0.34.0 and 1.0.

To read more, check out our official blog here

Migration assistance

  • Passing in dispatch_batches, split_batches, even_batches, and use_seedable_sampler to the Accelerator() should now be handled by creating an accelerate.utils.DataLoaderConfiguration() and passing this to the Accelerator() instead (Accelerator(dataloader_config=DataLoaderConfiguration(...)))
  • Accelerator().use_fp16 and AcceleratorState().use_fp16 have been removed; this should be replaced by checking accelerator.mixed_precision == "fp16"
  • Accelerator().autocast() no longer accepts a cache_enabled argument. Instead, an AutocastKwargs() instance should be used which handles this flag (among others) passing it to the Accelerator (Accelerator(kwargs_handlers=[AutocastKwargs(cache_enabled=True)]))
  • accelerate.utils.is_tpu_available should be replaced with accelerate.utils.is_torch_xla_available
  • accelerate.utils.modeling.shard_checkpoint should be replaced with split_torch_state_dict_into_shards from the huggingface_hub library
  • accelerate.tqdm.tqdm() no longer accepts True/False as the first argument, and instead, main_process_only should be passed in as a named argument

Multiple Model DeepSpeed Support

After long request, we finally have multiple model DeepSpeed support in Accelerate! (though it is quite early still). Read the full tutorial here, however essentially:

When using multiple models, a DeepSpeed plugin should be created for each model (and as a result, a separate config). a few examples are below:

Knowledge distillation

(Where we train only one model, zero3, and another is used for inference, zero2)

from accelerate import Accelerator
from accelerate.utils import DeepSpeedPlugin

zero2_plugin = DeepSpeedPlugin(hf_ds_config="zero2_config.json")
zero3_plugin = DeepSpeedPlugin(hf_ds_config="zero3_config.json")

deepspeed_plugins = {"student": zero2_plugin, "teacher": zero3_plugin}


accelerator = Accelerator(deepspeed_plugins=deepspeed_plugins)

To then select which plugin to be used at a certain time (aka when calling prepare), we call `accelerator.state.select_deepspeed_plugin("name"), where the first plugin is active by default:

accelerator.state.select_deepspeed_plugin("student")
student_model, optimizer, scheduler = ...
student_model, optimizer, scheduler, train_dataloader = accelerator.prepare(student_model, optimizer, scheduler, train_dataloader)

accelerator.state.select_deepspeed_plugin("teacher") # This will automatically enable zero init
teacher_model = AutoModel.from_pretrained(...)
teacher_model = accelerator.prepare(teacher_model)

Multiple disjoint models

For disjoint models, separate accelerators should be used for each model, and their own .backward() should be called later:

for batch in dl:
    outputs1 = first_model(**batch)
    first_accelerator.backward(outputs1.loss)
    first_optimizer.step()
    first_scheduler.step()
    first_optimizer.zero_grad()
    
    outputs2 = model2(**batch)
    second_accelerator.backward(outputs2.loss)
    second_optimizer.step()
    second_scheduler.step()
    second_optimizer.zero_grad()

FP8

We've enabled MS-AMP support up to FSDP. At this time we are not going forward with implementing FSDP support with MS-AMP, due to design issues between both libraries that don't make them inter-op easily.

FSDP

  • Fixed FSDP auto_wrap using characters instead of full str for layers
  • Re-enable setting state dict type manually

Big Modeling

  • Removed cpu restriction for bnb training

What's Changed

New Contributors

Full Changelog: v0.34.2...v1.0.0

v0.34.1 Patchfix

05 Sep 15:36
Compare
Choose a tag to compare

Bug fixes

  • Fixes an issue where processed DataLoaders could no longer be pickled in #3074 thanks to @byi8220
  • Fixes an issue when using FSDP where default_transformers_cls_names_to_wrap would separate _no_split_modules by characters instead of keeping it as a list of layer names in #3075

Full Changelog: v0.34.0...v0.34.1