Add Autoconfig and Coordinated_Optimizer implementations for Tensor Parallel Autosharding #21703

buildwithsuhana · 2025-10-01T07:27:33Z

This PR introduces support for tensor parallelism autosharding in Keras, enabling users to shard large model layers across multiple devices. This is a crucial feature for training models that are too large to fit into the memory of a single accelerator.

The implementation is centered around two new components:

autoconfig.py: This module contains the logic to analyze a Keras model, identify sharding candidates (e.g., Dense, EinsumDense layers), and generate a sharding plan.

coordinated_optimizer.py: This is an optimizer wrapper that consumes the sharding plan. During training, it intercepts gradients for sharded variables and performs a collective AllReduce to ensure weight updates are correctly synchronized across all devices.

…uhana/keras into Tensor_parallel_keras

…tp_2

gemini-code-assist · 2025-10-01T07:27:53Z

Summary of Changes

Hello @buildwithsuhana, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances Keras's distributed training capabilities by introducing automatic tensor parallelism. It provides the necessary infrastructure to intelligently shard large model layers across multiple devices, enabling the training of models that would otherwise exceed the memory capacity of a single accelerator. The changes include a system for automatically determining sharding strategies for model layers and a specialized optimizer that coordinates state and gradient updates across the sharded components.

Highlights

Tensor Parallel Autosharding Configuration: Introduced autoconfig.py which provides automatic analysis of Keras models to generate a sharding plan for tensor parallelism. This module includes heuristics to classify Dense layers (up-projection, down-projection) and applies appropriate column-wise or row-wise sharding rules to Dense, EinsumDense, and Embedding layers.
Coordinated Optimizer for Distributed Training: Added coordinated_optimizer.py which implements a TensorParallelOptimizer wrapper. This optimizer manages sharded optimizer states across multiple devices and synchronizes gradients using collective operations (like AllReduce) based on the tensor parallelism configuration, ensuring correct weight updates in a distributed environment.
Comprehensive Unit Testing: New unit tests (autoconfig_test.py and coordinated_optimizer_test.py) have been added to validate the functionality of both the autoconfiguration logic and the coordinated optimizer, covering various scenarios including nested models, different layer types, and optimizer state management.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant new functionality for tensor parallelism autosharding. The implementation is split across autoconfig.py for sharding plan generation and coordinated_optimizer.py for synchronized training. My review has identified a few issues ranging from critical to medium severity. Notably, there's a critical bug in TensorParallelOptimizer that prevents the use of learning rate schedules. I've also found some high-severity maintainability concerns in CoordinatedOptimizer due to its reliance on parsing internal variable names, which is fragile. Additionally, there are potential correctness issues in autoconfig.py and some violations of the Keras API design style guide. Addressing these points will improve the robustness and maintainability of this new feature.

keras/src/distribution/tensor_parallel/coordinated_optimizer.py

keras/src/distribution/tensor_parallel/autoconfig.py

keras/src/distribution/tensor_parallel/coordinated_optimizer.py

keras/src/distribution/tensor_parallel/autoconfig.py

keras/src/distribution/tensor_parallel/coordinated_optimizer.py

keras/src/distribution/tensor_parallel/coordinated_optimizer_test.py

codecov-commenter · 2025-10-01T17:39:32Z

Codecov Report

❌ Patch coverage is 45.20548% with 320 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.22%. Comparing base (5ae5503) to head (5c24951).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
...tribution/tensor_parallel/coordinated_optimizer.py	56.88%	81 Missing and 13 partials ⚠️
...src/distribution/tensor_parallel/communications.py	29.89%	68 Missing ⚠️
keras/src/backend/jax/distributed_backend.py	24.69%	60 Missing and 1 partial ⚠️
...ras/src/distribution/tensor_parallel/autoconfig.py	56.79%	25 Missing and 10 partials ⚠️
keras/src/distribution/tensor_parallel/config.py	35.00%	26 Missing ⚠️
...distribution/tensor_parallel/state_action_keras.py	44.68%	25 Missing and 1 partial ⚠️
keras/src/backend/distributed/backend_resolver.py	33.33%	9 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #21703      +/-   ##
==========================================
- Coverage   82.59%   82.22%   -0.38%     
==========================================
  Files         572      580       +8     
  Lines       58327    58906     +579     
  Branches     9131     9232     +101     
==========================================
+ Hits        48177    48437     +260     
- Misses       7818     8112     +294     
- Partials     2332     2357      +25

Flag	Coverage Δ
keras	`82.03% <45.20%> (-0.37%)`	⬇️
keras-jax	`63.13% <45.20%> (-0.18%)`	⬇️
keras-numpy	`57.53% <45.20%> (-0.12%)`	⬇️
keras-openvino	`34.32% <30.82%> (+<0.01%)`	⬆️
keras-tensorflow	`63.86% <45.20%> (-0.19%)`	⬇️
keras-torch	`63.47% <45.20%> (-0.16%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

buildwithsuhana added 30 commits September 26, 2025 12:23

Added tensor parallel for keras (Part 1/3)

a27367a

Removed unnecessary lines

488cd8f

Fixes suggested by Gemini

71ddd1a

Fixes suggested by Gemini

bc4e4e2

Fixes suggested by Gemini

d4200b5

Fixes suggested by Gemini

21f89a2

Fixes suggested by Gemini

299bd45

Fixes suggested by Gemini

da625e1

Fixing the failing test

c233b8c

Fixing the failing test

7b8d733

Fixing test

f825cd3

Adding tests for distributed_backends

3725180

Modifications for failing tests

a6c8a96

Modified for failing test

3fabfde

Modified for failing test

b133752

Modified for failing test

83c2e3f

added debuggers

3f3be6b

removed debuggers

be325ab

Merge branch 'keras-team:master' into Tensor_parallel_keras

e1282ac

Removed the tensorflow, numpy and torch backends

fc11aaa

Merge branch 'Tensor_parallel_keras' of https://github.com/buildwiths…

ef6e2a0

…uhana/keras into Tensor_parallel_keras

Refactoring the code

bea6ffa

Refactoring the code

4e00245

refactoring

2f973b0

Adding necessary docstrings

bdb2b84

Added autoconfig_keras

7546d14

Adding autoconfig and autoconfig_test for tensor parallel

10a06f1

Added coordinated_optimizer.py

3919778

Merge branch 'keras-team:master' into tp_2

e27780e

Added coordinated_optimizer and test

d18bfb9

buildwithsuhana added 3 commits October 1, 2025 12:39

Merge branch 'tp_2' of https://github.com/buildwithsuhana/keras into …

2a31b6f

…tp_2

Removing redundant files

06dbfd1

formatting the code

07a3680

google-ml-butler bot added the size:XL label Oct 1, 2025

google-ml-butler bot assigned gbaned Oct 1, 2025

buildwithsuhana marked this pull request as draft October 1, 2025 07:27

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

buildwithsuhana added 2 commits October 1, 2025 22:44

adding part 1 as well to this

67682dd

adding part 1 as well to this

deb769b

buildwithsuhana added 6 commits October 1, 2025 23:26

adding part 1 as well to this

122d8fd

fixing tests

79c1920

adding sharding_keras

fe0b36c

fixing tests

5c24951

testing_models

64cf251

testing

5061cce

buildwithsuhana closed this Oct 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Autoconfig and Coordinated_Optimizer implementations for Tensor Parallel Autosharding #21703

Add Autoconfig and Coordinated_Optimizer implementations for Tensor Parallel Autosharding #21703

Uh oh!

buildwithsuhana commented Oct 1, 2025

Uh oh!

gemini-code-assist bot commented Oct 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Oct 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add Autoconfig and Coordinated_Optimizer implementations for Tensor Parallel Autosharding #21703

Add Autoconfig and Coordinated_Optimizer implementations for Tensor Parallel Autosharding #21703

Uh oh!

Conversation

buildwithsuhana commented Oct 1, 2025

Uh oh!

gemini-code-assist bot commented Oct 1, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

codecov-commenter commented Oct 1, 2025 •

edited

Loading