01 Nov 06:07

dakinggg

b297981

v0.26.1 Latest

Latest

What's Changed

Private link error handling by @nancyhung in #3689

Full Changelog: v0.26.0...v0.26.1

Contributors

nancyhung

Assets 2

25 Oct 21:36

irenedea

v0.26.0

c0cb58f

v0.26.0

What's New

1. Torch 2.5.0 Compatibility (#3609)

We've added support for torch 2.5.0, including necessary patches to Torch.

Deprecations and Breaking Changes

1. FSDP Configuration Changes(#3681)

We no longer support passing fsdp_config and fsdp_auto_wrap directly to Trainer.

If you'd like to specify an fsdp config and configure fsdp auto wrapping, you should use parallelism_config.

trainer = Trainer(
    parallelism_config = {
        'fsdp': { 
            'auto_wrap': True
            ...
        }
    }
)

2. Removal of Pytorch Legacy Sharded Checkpoint Support (#3631)

PyTorch briefly used a different sharded checkpoint format than the current one, which was quickly deprecated by PyTorch. We have removed support for this format. We initially removed support for saving in this format in #2262, and the original feature was added in #1902. Please reach out if you have concerns or need help converting your checkpoints to the new format.

What's Changed

Add backward compatibility checkpoint tests for v0.25.0 by @dakinggg in #3635
Don't use TP when tensor_parallel_degree is 1 by @eitanturok in #3636
Update huggingface-hub requirement from <0.25,>=0.21.2 to >=0.21.2,<0.26 by @dependabot in #3637
Update transformers requirement from !=4.34.0,<4.45,>=4.11 to >=4.11,!=4.34.0,<4.46 by @dependabot in #3638
Bump databricks-sdk from 0.32.0 to 0.33.0 by @dependabot in #3639
Remove Legacy Checkpointing by @mvpatel2000 in #3631
Surface UC permission error by @b-chu in #3642
Tensor Parallelism Tests by @eitanturok in #3620
Switch to log.info for deterministic mode by @mvpatel2000 in #3643
Update pre-commit requirement from <4,>=3.4.0 to >=3.4.0,<5 by @dependabot in #3645
Update peft requirement from <0.13,>=0.10.0 to >=0.10.0,<0.14 by @dependabot in #3646
Create callback to load checkpoint by @irenedea in #3641
Bump jupyter from 1.0.0 to 1.1.1 by @dependabot in #3595
Fix DB SDK Import by @mvpatel2000 in #3648
Bump coverage[toml] from 7.6.0 to 7.6.3 by @dependabot in #3651
Bump pypandoc from 1.13 to 1.14 by @dependabot in #3652
Replace list with Sequence by @KuuCi in #3654
Add better error handling for non-rank 0 during Monolithic Checkpoint Loading by @j316chuck in #3647
Raising a better warning if train or eval did not process any data. by @ethantang-db in #3656
Fix Logo by @XiaohanZhangCMU in #3659
Update huggingface-hub requirement from <0.26,>=0.21.2 to >=0.21.2,<0.27 by @dependabot in #3668
Bump cryptography from 42.0.8 to 43.0.3 by @dependabot in #3667
Bump pytorch to 2.5.0 by @b-chu in #3663
Don't overwrite sys.excepthook in mlflow logger by @dakinggg in #3675
Fix pull request target by @b-chu in #3676
Use a temp path to save local checkpoints for remote save path by @irenedea in #3673
Loss gen tokens by @dakinggg in #3677
Refactor maybe_create_object_store_from_uri by @irenedea in #3679
Don't error if some batch slice has no loss generating tokens by @dakinggg in #3682
Bump version to 0.27.0.dev0 by @irenedea in #3681

New Contributors

@ethantang-db made their first contribution in #3656

Full Changelog: v0.25.0...v0.26.0

Contributors

j316chuck, irenedea, and 8 other contributors

Assets 2

24 Sep 20:56

dakinggg

v0.25.0

0c4e110

v0.25.0

What's New

1. Torch 2.4.1 Compatibility (#3609)

We've added support for torch 2.4.1, including necessary patches to Torch.

Deprecations and breaking changes

1. Microbatch device movement (#3567)

Instead of moving the entire batch to device at once, we now move each microbatch to device. This saves memory for large inputs, e.g. multimodal data, when training with many microbatches.

This change may affect certain callbacks which run operations on the batch which require it to be moved to an accelerator ahead of time, such as the two changed in this PR. There shouldn't be too many of these callbacks, so we anticipate this change will be relatively safe.

2. DeepSpeed deprecation version (#3634)

We have update the Composer version that we will remove support for DeepSpeed to 0.27.0. Please reach out on GitHub if you have any concerns about this.

3. PyTorch legacy sharded checkpoint format

PyTorch briefly used a different sharded checkpoint format than the current one, which was quickly deprecated by PyTorch. We have continued to support loading legacy format checkpoints for a while, but we will likely be removing support for this format entirely in an upcoming release. We initially removed support for saving in this format in #2262, and the original feature was added in #1902. Please reach out if you have concerns or need help converting your checkpoints to the new format.

What's Changed

Set dev version back to 0.25.0.dev0 by @snarayan21 in #3582
Microbatch Device Movement by @mvpatel2000 in #3567
Init Dist Default None by @mvpatel2000 in #3585
Explicit None Check in get_device by @mvpatel2000 in #3586
Update protobuf requirement from <5.28 to <5.29 by @dependabot in #3591
Bump databricks-sdk from 0.30.0 to 0.31.1 by @dependabot in #3592
Update ci-testing to 0.2.2 by @dakinggg in #3590
Bump Mellanox Tools by @mvpatel2000 in #3597
Roll back ci-testing for daillies by @mvpatel2000 in #3598
Revert driver changes by @mvpatel2000 in #3599
Remove step in log_image for MLFlow by @mvpatel2000 in #3601
Reduce system metrics logging frequency by @chenmoneygithub in #3604
Bump databricks-sdk from 0.31.1 to 0.32.0 by @dependabot in #3608
torch2.4.1 by @bigning in #3609
Test with torch2.4.1 image by @bigning in #3610
fix 2.4.1 test by @bigning in #3612
Remove tensor option for _global_exception_occured by @irenedea in #3611
Update error message for overwrite to be more user friendly by @mvpatel2000 in #3619
Update wandb requirement from <0.18,>=0.13.2 to >=0.13.2,<0.19 by @dependabot in #3615
Fix RNG key checking by @dakinggg in #3623
Update datasets requirement from <3,>=2.4 to >=2.4,<4 by @dependabot in #3626
Disable exceptions for MosaicML Logger by @mvpatel2000 in #3627
Fix CPU dailies by @mvpatel2000 in #3628
fix 2.4.1ckpt by @bigning in #3629
More checkpoint debug logs by @mvpatel2000 in #3632
Lower DeepSpeed deprecation version by @mvpatel2000 in #3634
Bump version 25 by @dakinggg in #3633

Full Changelog: v0.24.1...v0.25.0

Contributors

bigning, irenedea, and 5 other contributors

Assets 2

27 Aug 22:37

snarayan21

v0.24.1

3c7fefb

v0.24.1

Bug Fixes

1. Disallow passing device_mesh to FSDPConfig (#3580)

Explicitly errors if device_mesh is passed to FSDPConfig. This completes the deprecation from v0.24.0 and also addresses cases where a user specified a device mesh but it was ignored, leading to training with the incorrect parallelism style (e.g., using FSDP instead of HSDP).

What's Changed

Bump main version to 0.25.0.dev0 by @snarayan21 in #3573
update daily by @KevDevSha in #3572
Bump pandoc from 2.3 to 2.4 by @dependabot in #3575
Update transformers requirement from !=4.34.0,<4.44,>=4.11 to >=4.11,!=4.34.0,<4.45 by @dependabot in #3574
Checkpoint backwards compatibility tests for v0.24.0 by @snarayan21 in #3579
Error if device mesh specified in fsdp config by @snarayan21 in #3580
Bump version to 0.24.1. by @snarayan21 in #3581

Full Changelog: v0.24.0...v0.24.1

Contributors

dependabot, snarayan21, and KevDevSha

Assets 2

26 Aug 14:48

snarayan21

v0.24.0

020b0ef

v0.24.0

What's New

1. Torch 2.4 Compatibility (#3542, #3549, #3553, #3552, #3565)

Composer now supports Torch 2.4! We are tracking a few issues with the latest PyTorch we have raised with the PyTorch team related to checkpointing:

[PyTorch Issue] Distributed checkpointing using PyTorch DCP has issues with stateless optimizers, e.g. SGD. We recommend using composer.optim.DecoupledSGDW as a workaround.
[PyTorch Issue] Distributed checkpointing using PyTorch DCP broke backwards compatibility. We have patched this using the following planner, but this may break custom planner loading.

2. New checkpointing APIs (#3447, #3474, #3488, #3452)

We've added new checkpointing APIs to download, upload, and load / save, so that checkpointing is usable outside of a Trainer object. We will be fully migrating to these new APIs in the next minor release.

3: Improved Auto-microbatching (#3510, #3522)

We've fixed deadlocks with auto-microbatching with FSDP, bringing throughput in line with manually setting the microbatch size. This is achieved through enabling sync hooks wherever a training run might OOM to find the correct microbatch size, and disabling these hooks for the rest of training.

Bug Fixes

1. Fix checkpoint symlink uploads (#3376)

Ensures that checkpoint files are uploaded before the symlink file, fixing errors with missing or incomplete checkpoints.

2. Optimizer tracks same parameters after FSDP wrapping (#3502)

When only a subset of parameters should be tracked by the optimizer, FSDP wrapping will now not interfere.

What's Changed

Bump ipykernel from 6.29.2 to 6.29.5 by @dependabot in #3459
Update torchmetrics requirement from <1.3.3,>=0.10.0 to >=1.4.0.post0,<1.4.1 by @dependabot in #3460
[Checkpoint] Fix symlink issue where symlink file uploaded before checkpoint files upload by @bigning in #3376
Bump databricks-sdk from 0.28.0 to 0.29.0 by @dependabot in #3456
Remove Log Exception by @jjanezhang in #3464
Corrected docs for MFU in SpeedMonitor by @JackZ-db in #3469
[checkpoint v2] Download api by @bigning in #3447
Upload api by @bigning in #3474
[Checkpoint V2] Upload API by @bigning in #3488
Load api by @eracah in #3452
Add helpful comment explaining HSDP initialization seeding by @mvpatel2000 in #3470
Add fit start to mosaicmllogger by @ethanma-db in #3467
Remove OOM-Driven FSDP Deadlocks and Increase Throughput of Automicrobatching by @JackZ-db in #3510
Move hooks and fsdp modules onto state rather than trainer by @JackZ-db in #3522
Bump coverage[toml] from 7.5.4 to 7.6.0 by @dependabot in #3471
revert a wip PR by @bigning in #3475
Change FP8 Eval to default to activation dtype by @j316chuck in #3454
Get a shared file system safe signal file name by @dakinggg in #3485
Bumping flash attention version to v2.6.2 by @ShashankMosaicML in #3489
Bump to Pytorch 2.4 by @mvpatel2000 in #3542
Add Torch 2.4 Tests by @mvpatel2000 in #3549
Fix torch 2.4 images for tests by @snarayan21 in #3553
Fix torch 2.4 tests by @mvpatel2000 in #3552
Fix bug when subset of model parameters is passed into optimizer with FSDP by @sashaDoubov in #3502
Correctly process parallelism_config['tp'] when it's a dict by @snarayan21 in #3434
[torch2.4] Fix sharded checkpointing backward compatibility issue by @bigning in #3565
[fix-daily] Use composer get_model_state_dict instead of torch's by @eracah in #3492
Load Microbatches instead of Entire Batches to GPU by @JackZ-db in #3487
Make Pytest log in color in Github Action by @eitanturok in #3505
Revert "Load Microbatches instead of Entire Batches to GPU " by @JackZ-db in #3508
Bump transformers version by @dakinggg in #3511
Fix FSDP Config Validation by @mvpatel2000 in #3530
Add FSDP input validation for use_orig_params and activation_cpu_offload flag by @j316chuck in #3515
Fix checkpoint events by @b-chu in #3468
Patch conf.py for readthedocs sphinx injection deprecation. by @mvpatel2000 in #3491
save load path in state and pass to mosaicmllogger by @ethanma-db in #3506
Disable gcs azure daily test by @bigning in #3514
Update huggingface-hub requirement from <0.24,>=0.21.2 to >=0.21.2,<0.25 by @dependabot in #3481
restore version on dev by @XiaohanZhangCMU in #3451
Deprecate deepspeed by @dakinggg in #3512
Update importlib-metadata requirement from <7,>=5.0.0 to >=5.0.0,<9 by @dependabot in #3519
Update peft requirement from <0.12,>=0.10.0 to >=0.10.0,<0.13 by @dependabot in #3518
Use gloo as part of DeviceGPU's process group backend by @snarayan21 in #3509
Add a monitor of mlflow logger so that it sets run status as failed if main thread exits unexpectedly by @chenmoneygithub in #3449
Revert "Use gloo as part of DeviceGPU's process group backend (#3509)" by @snarayan21 in #3523
Fix autoresume docstring (save_overwrite) by @eracah in #3526
Unpin pip by @dakinggg in #3524
hasattr check for Wandb 0.17.6 by @mvpatel2000 in #3531
Remove dev on github workflows by @mvpatel2000 in #3536
Remove dev branch in GPU workflows by @mvpatel2000 in #3539
restore google cloud object store test by @bigning in #3538
Update moto[s3] requirement from <5,>=4.0.1 to >=4.0.1,<6 by @dependabot in #3516
use s3 boto3 Adaptive retry as default retry mode by @bigning in #3543
Use python 3.11 in GAs by @eitanturok in #3529
Implement ruff rules enforcing pep 585 by @snarayan21 in #3551
Update numpy requirement from <2.1.0,>=1.21.5 to >=1.21.5,<2.2.0 by @dependabot in #3556
Bump databricks-sdk from 0.29.0 to 0.30.0 by @dependabot in #3559
Update Optim to DecoupledSGD in Notebooks by @mvpatel2000 in #3554
Remove lambda code eval testing by @mvpatel2000 in #3560
Restore Azure Tests by @mvpatel2000 in #3561
Remove tokens for to_next_epoch by @mvpatel2000 in #3562
Change iteration timestamp for old checkpoints by @b-chu in #3563
Fix typo in composer_collect_env by @dakinggg in #3566
Add default value to get_device() by @coryMosaicML in #3568
add ghcr and update build matrix generator by @KevDevSha in #3465
Bump aws_ofi_nccl to 1.11.0 by @willgleich in #3569
allow listed runners by @KevDevSha in #3486
fix runner linux-ubuntu > ubuntu-latest by @KevDevSha in #3571
Bump version to v0.24.0 + deprecations by @snarayan21 in https://github.co...

Contributors

bigning, sashaDoubov, and 17 other contributors

Assets 2

03 Jul 02:08

XiaohanZhangCMU

v0.23.5

56ccc2e

v0.23.5

What's New

1. Variable length dataloaders (#3416)

Adds support for dataloaders with rank-dependent lengths. The solution terminates iteration for dataloaders on all ranks when the first dataloader finishes.

Bug Fixed

1. Remove close flush for mosaicml logger (#3446)

Previously, the MosaicML Logger sporadically raised an error when the python interpreter was shutting down as it attempted to flush data on Event.CLOSE using futures, which cannot be scheduled at that time. Instead, we now only block on finishing existing data upload on Event.CLOSE, avoiding scheduling new futures.

What's Changed

Update numpy requirement from <1.27.0,>=1.21.5 to >=1.21.5,<2.1.0 by @dependabot in #3406
Restore dev version by @karan6181 in #3417
Save checkpoint to disk for API with new save layout by @eracah in #3399
Patch PyTorch 2.3.1 by @mvpatel2000 in #3419
Fixes some typing issues by @dakinggg in #3418
Fix style by @b-chu in #3420
Bump coverage[toml] from 7.5.3 to 7.5.4 by @dependabot in #3422
Update psutil requirement from <6,>=5.8.0 to >=5.8.0,<7 by @dependabot in #3424
Add support for variable length dataloaders in DDP by @JAEarly in #3416
Hsdp + MoE CI tests by @KuuCi in #3378
Bumping MLflow version to 2.14.1 by @JackZ-db in #3425
Skip HSDP + TP pytests that require torch 2.3 or above by @KuuCi in #3426
Remove CodeQL workflow by @mvpatel2000 in #3429
Remove save overwrite by @mvpatel2000 in #3431
Fixes to TP Docs by @snarayan21 in #3430
Lower the system metrics logging frequency to reduce MLflow server's load by @chenmoneygithub in #3436
Update paramiko requirement from <3,>=2.11.0 to >=3.4.0,<4 by @dependabot in #3439
Bump CI testing version by @mvpatel2000 in #3433
Fix docstring for EVAL_AFTER_ALL/EVAL_BEFORE_ALL by @mvpatel2000 in #3445
Remove close flush for mosaicml logger by @mvpatel2000 in #3446
Remove MosaicMLLambdaEvalClient by @aspfohl in #3432
Relax hf hub pin by @dakinggg in #3435
Pytest skip 2 by @KuuCi in #3448
bump version v0.23.5 by @XiaohanZhangCMU in #3450

Full Changelog: v0.23.4...v0.23.5

Contributors

eracah, JAEarly, and 11 other contributors

Assets 2

21 Jun 15:09

mvpatel2000

v0.23.4

ec8799d

v0.23.4

Bug Fixes

1. Patch PyTorch 2.3.1 (#3419)

Fixes missing import when monkeypatching device mesh functions in PyTorch 2.3.1. This is necessary for MoE training.

Full Changelog: v0.23.3...v0.23.4

Assets 2

21 Jun 00:18

karan6181

v0.23.3

7c7f6de

v0.23.3

New Features

1. Update mlflow logger to use the new API with time-dimension to view images in MLFlow (#3286)

We've enhanced the MLflow logger's log_image function to use the new API with time-dimension support, enabling images to be viewed in MLflow.

2. Add logging buffer time to MLFLow logger (#3401)

We've added the logging_buffer_seconds argument to the MLflow logger, which specifies how many seconds to buffer before sending logs to the MLflow tracking server.

Bug Fixes

1. Only require `databricks-sdk` when on Databricks platform (#3389)

Previously, MLFlow always imported the databricks-sdk. Now, we only require the sdk if on the databricks platform and using databricks secrets to access managed MLFlow.

2. Skip extra dataset state load during job resumption (#3393)

Previously, when loading a checkpoint with train_dataloader, the dataset_state would load first, and if train_dataloader was set again afterward, load_state_dict would be called with a None value. Now, we've added a check in the train_dataloader setter to skip this redundant load.

3. Fix auto-microbatching on CUDA 12.4 (#3400)

In CUDA 12.4, the out-of-memory error message has changed to CUDA error: out of memory. Previously, our logic hardcoded checks for CUDA out of memory when using device_train_microbatch_size="auto". Now, we check for both CUDA out of memory and CUDA error: out of memory.

4. Fix mlflow logging to Databricks workspace file paths which startswith `/Shared/` prefix (#3410)

Previously, for MLflow logging, we prepended the path /Users/ to all user-provided logging paths on the Databricks platform, if not specified, including paths starting with /Shared/, which was incorrect since /Shared/ indicates a shared workspace. Now, the /Users/ prepend is skipped for paths starting with /Shared/.

What's Changed

Bump CI from 0.0.7 to 0.0.8 by @KuuCi in #3383
Fix backward compatibility caused by missing eval metrics class by @bigning in #3385
Bump version v0.23.2 by @bigning in #3386
Restore dev version by @bigning in #3388
Only requires databricks-sdk when inside the Databricks platform by @antoinebrl in #3389
Update packaging requirement from <24.1,>=21.3.0 to >=21.3.0,<24.2 by @dependabot in #3392
Bump cryptography from 42.0.6 to 42.0.8 by @dependabot in #3391
Skip extra dataset state load by @mvpatel2000 in #3393
Remove FSDP restriction from PyTorch 1.13 by @mvpatel2000 in #3395
Check for 'CUDA error: out of memory' when auto-microbatching by @JAEarly in #3400
Add tokens to iterations by @b-chu in #3374
Busy wait utils in dist by @dakinggg in #3396
Add buffering time to mlflow logger by @chenmoneygithub in #3401
Add missing import for PyTorch 2.3.1 device mesh slicing by @mvpatel2000 in #3402
Add pynvml to mlflow dep group by @dakinggg in #3404
min/max flagging added to system_metrics_monitor with only non-redundant, necessary gpu metrics logged by @JackZ-db in #3373
Simplify launcher world size parsing by @mvpatel2000 in #3398
Optionally use flash-attn's CE loss for metrics by @snarayan21 in #3394
log image fix by @jessechancy in #3286
[ckpt-rewr] Save state dict API by @eracah in #3372
Revert "Optionally use flash-attn's CE loss for metrics (#3394)" by @snarayan21 in #3408
CPU tests image fix by @snarayan21 in #3409
Add setter for epoch in iteration by @b-chu in #3407
Move pillow dep as required by @mvpatel2000 in #3412
fixing mlflow logging to Databricks workspace file paths with /Shared/ prefix by @JackZ-db in #3410
Bump version v0.23.3 by @karan6181 in #3414

New Contributors

@JackZ-db made their first contribution in #3373

Full Changelog: v0.23.2...v0.23.3

Contributors

bigning, eracah, and 12 other contributors

Assets 2

08 Jun 03:11

bigning

v0.23.2

c7020f8

v0.23.2

Bug Fixes

Fix backward compatibility issue caused by missing eval metrics class

What's Changed:

Fix backward compatibility issue caused by missing eval metrics class by @bigning in #3385

Full Changelog: v0.23.1...release/v0.23.2

Contributors

bigning

Assets 2

07 Jun 15:03

mvpatel2000

v0.23.1

7a533cb

v0.23.1

What's New

1. PyTorch 2.3.1 Upgrade

Composer now supports PyTorch 2.3.1.

What's Changed

Torch 2.3.1 Upgrade by @mvpatel2000 in #3367
Fix monkeypatch imports by @mvpatel2000 in #3375
Remove unnecessary state dict and load_state_dict functions by @eracah in #3361
Adding checkpoint backwards compatibility tests after 0.23.0 release by @bigning in #3377
prepare_fsdp_module documentation fix by @KuuCi in #3379
Composer version bump to v0.23.1 by @snarayan21 in #3380
Clear caplog and use as context manager in test_logging by @snarayan21 in #3382

Full Changelog: v0.23.0...v0.23.1

Contributors

bigning, eracah, and 3 other contributors

Assets 2

Releases: mosaicml/composer

v0.26.1

What's Changed

Contributors

v0.26.0

What's New

1. Torch 2.5.0 Compatibility (#3609)

Deprecations and Breaking Changes

1. FSDP Configuration Changes(#3681)

2. Removal of Pytorch Legacy Sharded Checkpoint Support (#3631)

What's Changed

New Contributors

Contributors

v0.25.0

What's New

1. Torch 2.4.1 Compatibility (#3609)

Deprecations and breaking changes

1. Microbatch device movement (#3567)

2. DeepSpeed deprecation version (#3634)

3. PyTorch legacy sharded checkpoint format

What's Changed

Contributors

v0.24.1

Bug Fixes

What's Changed

Contributors

v0.24.0

What's New

1. Torch 2.4 Compatibility (#3542, #3549, #3553, #3552, #3565)

2. New checkpointing APIs (#3447, #3474, #3488, #3452)

3: Improved Auto-microbatching (#3510, #3522)

Bug Fixes

1. Fix checkpoint symlink uploads (#3376)

2. Optimizer tracks same parameters after FSDP wrapping (#3502)

What's Changed

Contributors

v0.23.5

What's New

1. Variable length dataloaders (#3416)

Bug Fixed

1. Remove close flush for mosaicml logger (#3446)

What's Changed

Contributors

v0.23.4

Bug Fixes

v0.23.3

New Features

1. Update mlflow logger to use the new API with time-dimension to view images in MLFlow (#3286)

2. Add logging buffer time to MLFLow logger (#3401)

Bug Fixes

1. Only require databricks-sdk when on Databricks platform (#3389)

2. Skip extra dataset state load during job resumption (#3393)

3. Fix auto-microbatching on CUDA 12.4 (#3400)

4. Fix mlflow logging to Databricks workspace file paths which startswith /Shared/ prefix (#3410)

What's Changed

New Contributors

Contributors

v0.23.2

Bug Fixes

What's Changed:

Contributors

v0.23.1

What's New

What's Changed

Contributors

1. Only require `databricks-sdk` when on Databricks platform (#3389)

4. Fix mlflow logging to Databricks workspace file paths which startswith `/Shared/` prefix (#3410)