feat: refactor main_ds.py #498

cdoern · 2025-04-25T17:49:17Z

Introduce a new design for key components of main_ds.py. Namely splitting Model initialization, Accelerator initialization, Optimizer initialization, and Checkpoint saving initialization
into classes:

Model
Accelerator
Checkpointer

NOTE: a follow up to this work will be to introduce classes/structure for the DataLoader, Sampler, etc. This was left out of this PR given the already large scope of change.

The Model class wraps the various AutoModel classes we support -- and aims to be a lightweight wrapper to help with usability of the library with different model types.
setup_optimizer resides within the model class and returns one of the optimizer types we support

The Accelerator class aims to both store commonly accessed variables associated with the accelerated model and abstract model/optimizer mutation away from the user who should only access our Model and Optimizer classes

The Checkpointer class introduces a unified approach to our various checkpointing techniques. A user can pass in their checkpointing style (full_state or hf_format), and the checkpointer, via checkpointer.checkpoint, will save the model using the selected method and other techniques (LoRA).

These classes are one of a few steps needed to "SDK-ify" the training library

Adding structure to code via classes can either be someone's favorite or least favorite thing. So I figured I'd explain myself before continuing. Here is my rationale:

Classes provide logical structuring to code, especially code meant to be a publicly consumable SDK and allows you to associate related objects and methods with one another.

Being able to group functionality under the Model, Accelerator, and Checkpointer classes inherently reduces code complexity and duplication. Being able to store things like , self.distributed_framework,self.lora_config, etc in a way such that within the class they are accessible within different methods allows the arguments per method to go down drastically, as well as complex return values. Simpler methods and argument/return values allows for simpler testing of code.

additionally add test_train and test_model unit tests and a new PR e2e test.

add test_train.py which mocks train and train_epoch. Additionally add test_model.py which tests instantiation of all new classes: Model, Accelerator, and Checkpointer

a new PR Large e2e job will run testing the SDK and calculating loss. For now it uses ilab data generate to get skills and knowledge data, but should be converted to use a static dataset from HF.

Loss on the main branch using exiting L40s x4 Large Job:

https://github.com/instructlab/training/actions/runs/14767228280/attempts/1#summary-41467828439

Loss on this PR using L40s x4 Large Job:

https://github.com/instructlab/training/actions/runs/14812223441

Loss on this PR using the new L40s x4 Large SDK Job (expect slightly different values)

https://github.com/instructlab/training/actions/runs/14812221612

github-actions · 2025-05-03T12:59:38Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

github-actions · 2025-05-03T14:09:38Z

e2e workflow failed on this PR: View run, please investigate.

github-actions · 2025-05-03T14:20:29Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

github-actions · 2025-05-03T15:28:34Z

e2e workflow failed on this PR: View run, please investigate.

github-actions · 2025-05-03T15:35:14Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

github-actions · 2025-05-03T19:25:15Z

e2e workflow succeeded on this PR: View run, congrats!

RobotSail · 2025-05-04T00:34:31Z

@cdoern I haven't gone through this in-depth yet, but I'm wondering if it would make sense to build into a v1 and do a major version bump to v1.0.0.

I'm not sure how much you are looking to break APIs that we have today, but if we went down this path then it would be a very good time to explore this option. This way we can work it into something nicer and ergonomic compared to how it is today.

Since I think it would also be unfair for you to be the only person working on this, and have the PR sitting while changes are actively being made in the repo, I propose this:

We create a new branch training-v1 where we can merge this in as soon as you're ready. This way myself and anyone else can start using it and helping to build on top of it
We can begin cutting dev releases such as instructlab-training-v1.0.0a1, instructlab-training-v1.0.0.preview, etc. to consume this and use it
We should identify any other bottlenecks of the current library and address them in this branch (only big things that are difficult to change later, we shouldn't hold this for too long)
We can create compatibility APIs so that any consumers (primarily ilab) aren't affected, similar to what we've done for the data processing refactor and what @booxter was suggesting on the other PR
Once we are ready, we can merge into main and cut a new release

cdoern · 2025-05-04T01:12:39Z

@RobotSail I think that is a great idea! Thank you for looking at this, I really wanted to get the loss gathering working to prove at the very least I wasn't degrading anything -- but a branch other than main makes complete sense to me so we can begin iterating on this!! I can open a different PR to that branch as opposed to the main branch here. Again, thanks for thinking about this and taking a look at the PR :)

RobotSail · 2025-05-04T01:15:03Z

@cdoern It looks like the loss does work 😄

fynnsu

Looks good. Added a few suggestions.

fynnsu · 2025-05-05T17:51:15Z

src/instructlab/training/train.py

+            if local_rank == 0
+            else None
+        )
+    all_metrics: List[Metrics] = []


Do we need this Metrics collection? It seems like it's not currently being used here (values collected but not returned/used in this fn). I also belief the new logging implementation #500, should handle metric collection, so that we don't have to pass a list of Metrics up the stack.

Yeah it in general seems there is a lot of overlapping work that you've done @fynnsu which is better suited to be built on top of the new version @cdoern is building out here, so I wonder if it would be easier to merge this into a dev branch, merge the other PRs on there, then release it as main once everything is ready.

fynnsu · 2025-05-05T18:01:09Z

src/instructlab/training/model.py

+        model: Model,
+        optimizer: torch.optim.Optimizer,
+        accelerator: Accelerator,
+        strategy="all",


Add output_dir to Checkpointer init / attributes so it doesn't need to be passed to every method.

fynnsu · 2025-05-05T18:08:40Z

src/instructlab/training/model.py

+        model: Model,
+        samples_per_gpu: int,
+        grad_accum: int,
+        train_loader: DataLoader,


I don't think the train_loader should be attached to Accelerator, these feel like separate concepts to me.

Currently, the only real "use" of train_loader in this class is getting its length for calculating LR schedule. However, the train_loader is also being accessed directly from the Accelerator attribute in train.py which feels like an unintuitive way to access the data, we should just pass dataloaders directly to the train fns.

fynnsu · 2025-05-05T18:14:45Z

src/instructlab/training/model.py

+        deepspeed_cpu_offload_optimizer: Optional[bool] = False,
+        deepspeed_cpu_offload_optimizer_pin_memory: Optional[bool] = False,
+        deepspeed_cpu_offload_optimizer_ratio: Optional[float] = None,


Initialize DeepSpeedOptions in main_ds.py and pass it (or None) into Accelerator.__init__. Then you can just store self.deepspeed_cpu_offload_options on the accelerator.

fynnsu · 2025-05-05T18:37:00Z

src/instructlab/training/model.py

+
+    def save_dict_accelerate(
+        self,
+        accelerator: Accelerator,


I'm pretty sure this type is wrong. This should be referring to accelerate.Accelerator which you've aliased to TransformersAccel in this file.

In general, I find this duplicate use of Accelerator to be quite confusing. It would probably be beneficial if we renamed our Accelerator to something like DistributedBackend or whatever to avoid the name conflict.

fynnsu · 2025-05-05T19:09:02Z

src/instructlab/training/model.py

+    def save_dict_accelerate(
+        self,
+        accelerator: Accelerator,
+        state_to_save,


It would also be useful to have type hints for these methods. This is potentially not your fault as I imagine you mostly just moved this code around but this method is currently only called with state_to_save as a model's state_dict(). With that type info, setting state_to_save.modules and state_to_save.parameters to a noop doesn't make sense because those attributes don't exist in the first place.

fynnsu · 2025-05-05T19:13:30Z

src/instructlab/training/model.py

+        epoch: int,
+        samples_seen: int,
+        last_epoch: bool = False,
+        **kwargs,


epoch and kwargs not used

fynnsu · 2025-05-05T19:21:09Z

src/instructlab/training/model.py

+        output_dir,
+        epoch: int,
+        samples_seen: int,
+        **kwargs,


kwargs not used. I also just raised #527 to add checks for unused parameters to our linter.

fynnsu · 2025-05-05T19:25:53Z

src/instructlab/training/model.py

+        return cls(
+            model=model,
+            grad_accum=grad_accum,
+            train_loader=train_loader,
+            distributed_framework=DistributedBackend.FSDP,
+            samples_per_gpu=samples_per_gpu,
+            fsdp_sharding_strategy=fsdp_sharding_strategy,
+            fsdp_cpu_offload_params=fsdp_cpu_offload_params,
+            save_samples=save_samples,
+        )


Maybe these two classmethods should just use **kwargs?

fynnsu · 2025-05-05T19:30:09Z

src/instructlab/training/model.py

+        self.model = add_noisy_embeddings(self.model, noise_alpha=self.noise_alpha)
+
+    @staticmethod
+    def supports_flash_attention(device_id=0):


I think this staticmethod and the next check_flash_attn_enabled were better as functions in utils.py. They're only loosely model related, really they're system related. And they don't use any attributes from Model

mergify · 2025-05-06T04:20:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

booxter

I've looked at the patch from birds view. I haven't checked the logic in details. This is because I believe this PR will have to be split into separately reviewable changes before it can be considered for inclusion. Probably a series of commits, each doing one logical transformation.

I won't give you the immediate recipe on the exact split that would make sense here - after all, you probably know better which logical steps this large PR consists of than anyone else - but as a generic example, here's what I would expect from a PR of this size and impact (or a set of such PRs):

a commit moving a unit of code from module A to B (with no changes to what the code does);
a commit cleaning up the unit of code (e.g. cleaning up spurious logging);
perhaps another commit cleaning up some other issue in the moved code;
another commit introducing a stub class Model that demonstrates the interface but has no implementation, with some basic unit tests creating an object of the type;
another commit plugging the previously moved and transformed unit of code into one of the methods of the new class, with additional unit test for this method, ...;
...more commits in between of the similar nature...
a separate commit introducing a tox target triggering an integration test (the e2e thingy included here);
a separate commit refactoring existing smoke workflow moving the code to initialize a virtual environment with necessary cuda dependencies and ilab into a reusable script;
a separate commit adding a CI job triggering the script above + new tox target;
another commit with docs...

Each of the commits would then describe in its message what is being done (can be as simple as "Move X function to model Y").

This is just an outline trying to show how this would look like in general terms. These commits can then be lumped into a single PR or - preferrably? - multiple (so that we can start merging pieces that are ready as they shape up; each PR probably combining a number of these smaller separately reviewable commits).

In this way, we will be able to give proper attention to each of the changes included here. Without it, we would have to take the refactor execution at faith value, which is a risk for regressions.

(I don't think going the route of a feature branch for this work is feasible either. This is because it won't make it any easier to review this work in a single piece later when integration from the branch back to main would have to happen. So it will only delay the process - and exacerbate the problem because more work will be merged on top of it in the meantime.)

If you need help figuring out how the split should be executed - from the code logic or git perspective - let's discuss this. I have some experience dealing with large refactors (from both sides).

booxter · 2025-05-08T17:35:41Z

scripts/test-data/profile-l40s-x4.yaml

@@ -0,0 +1,157 @@
+chat:


I believe we added support to ilab to pull defaults if not specified. If so, suggest to keep this file trimmed to the exact changes you need. (It's not clear which these are at the moment.)

booxter · 2025-05-08T17:39:21Z

scripts/test-sdk.sh

+    export INSTRUCTLAB_EVAL_FIRST_N_QUESTIONS=10
+    export HF_DATASETS_TRUST_REMOTE_CODE=true
+
+    python -c "import sys; sys.path.insert(0, '${SCRIPTDIR}'); from test_sdk import run_test; run_test('${knowledge_data_path}', '${skills_data_path}', nnodes=1, node_rank=0, nproc_per_node=4)"


would it make more sense to use the runner we already use for smoke - tox? it would guarantee that whatever dependencies the current virtual environment has from installing ilab etc., they will not be present when running the test.

booxter · 2025-05-08T17:48:24Z

scripts/test-sdk.sh

+    mkdir -p "${E2E_LOG_DIR}"
+}
+
+test_train() {


I'd like us to move away from shell test-only scripts and use tox as test runner. This can be achieved by:

making the python run_test a proper pytest test case (or a set of them?);

moving setup code into a number of shell helper scripts that are not directly tied to testing (though used to prepare environment), e.g. have scripts/init_ilab.sh, scripts/download_models.sh, scripts/generate_data.sh, and then finally, a thin scripts/prepare_sdk_testenv.sh that may run all of the other scripts above in order. Then you have your workflow steps as: 1) ./scripts/prepare_sdk_testenv.sh 2) tox -e ....

Ultimately, I'd like to check out the repo and be able to call ./scripts/prepare_sdk_testenv.sh once and then iterate locally with tox as usual without repeating all the steps.

booxter · 2025-05-08T17:49:58Z

scripts/test_sdk.py

+import instructlab.training.data_process as dp
+
+
+def main(args):


I think this file should be a (set of) pytest test cases driven through tox (like we do for smokes).

booxter · 2025-05-08T17:54:06Z

src/instructlab/training/model.py

+# mypy: disable_error_code="has-type"
+
+
+class Model:


any particular reason to keep these classes lumped under the same file? perhaps we should have model.py, accelerator.py etc. It will simplify usage as well as reading the code.

booxter · 2025-05-08T17:57:55Z

src/instructlab/training/model.py

+    FusedAdam = None
+    local_rank = int(os.getenv("LOCAL_RANK", "0"))
+    if __name__ == "__main__" and (not local_rank or local_rank == 0):
+        print("DeepSpeed is not available. Some features may be unavailable.")


I think it should be (gracefully!) handled when the optimizer is set up, not on import.

booxter · 2025-05-08T20:37:14Z

src/instructlab/training/model.py

+        return False
+
+
+def setup_optimizer(


I think you just move code here from place to place? (With minor changes like adding beta argument and removing print()s?) This is an example of a change that could be separated into a commit which would help reviewers immensely. In general, changes that are mechanical should be separated (as well as other logically separate changes). Then reviewers can review your PR commit by commit and follow the logic of transformations.

booxter · 2025-05-08T20:42:30Z

src/instructlab/training/main_ds.py

-        deepcopy(train_loader),
-        lr_scheduler,
-    )
-    # Necessary so that Accelerate does not step once per GPU


This kind of context should not be removed when doing transformations (I assume this comment makes some useful point that cannot be easily deduced from just reading the code, otherwise it should not be here; so removing it when moving from here to prepare_with_optimizer is not desired).

cdoern · 2025-05-10T14:15:26Z

@booxter thanks for the in depth comment! I will think of how to split this up I would imagine it'd be something like:

move model initialization code from main_ds.py to a new module, add unit tests
move optimizer initialization code to model module, add unit tests
move accelerator code from main_ds.py to new module, add unit tests
create checkpointer class concept and move associated code from main_ds.py and utils.py into model module, add unit tests
add new e2e test which runs SDK.

I can work next week on splitting up those changes into separate PRs

Introduce a new design for key components of main_ds.py. Namely splitting Model initialization, Accelerator initialization, Optimizer initialization, and Checkpoint saving initialization into classes: Model Accelerator Checkpointer The Model class wraps the various AutoModel classes we support -- and aims to be a lightweight wrapper to help with usability of the library with different model types. setup_optimizer resides within the model class and returns one of the optimizer types we support The Accelerator class aims to both store commonly accessed variables associated with the accelerated model and abstract model/optimizer mutation away from the user who should only access our Model and Optimizer classes The Checkpointer class introduces a unified approach to our various checkpointing techniques. A user can pass in their checkpointing style (full_state or hf_format), and the checkpointer, via checkpointer.checkpoint, will save the model using the selected method and other techniques (LoRA). These classes are one of a few steps needed to "SDK-ify" the training library aditionally add test_train and test_model unit tests, e2e test add test_train.py which mocks `train` and `train_epoch`. Additionally add test_model.py which tests instantiation of all new classes: Model, Accelerator, Optimizer, Checkpointer Signed-off-by: Charlie Doern <[email protected]>

mergify · 2025-05-12T20:52:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

cdoern force-pushed the refactor branch 2 times, most recently from f4bf4b2 to c1564c5 Compare April 25, 2025 17:51

cdoern mentioned this pull request Apr 25, 2025

feat: refactor main_ds.py #456

Closed

mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing ci-failure labels Apr 25, 2025

cdoern force-pushed the refactor branch from c1564c5 to 9b4a4be Compare April 25, 2025 17:57

mergify bot added ci-failure and removed ci-failure labels Apr 25, 2025

cdoern force-pushed the refactor branch 6 times, most recently from 746f76a to 355aca0 Compare April 25, 2025 19:36

mergify bot added ci-failure and removed ci-failure labels Apr 25, 2025

cdoern force-pushed the refactor branch from 355aca0 to 6f3e019 Compare April 25, 2025 19:52

mergify bot added ci-failure and removed ci-failure labels Apr 25, 2025

cdoern force-pushed the refactor branch 2 times, most recently from 04c6543 to 3f57e60 Compare April 25, 2025 21:47

mergify bot added ci-failure and removed ci-failure labels Apr 25, 2025

cdoern force-pushed the refactor branch from 3f57e60 to 44b5e6c Compare April 25, 2025 22:32

mergify bot added ci-failure and removed ci-failure labels Apr 25, 2025

cdoern force-pushed the refactor branch from 44b5e6c to 435e557 Compare April 25, 2025 23:45

mergify bot added ci-failure and removed ci-failure labels Apr 25, 2025

cdoern force-pushed the refactor branch 2 times, most recently from cc75fef to f24db49 Compare May 3, 2025 14:20

cdoern force-pushed the refactor branch from f24db49 to bf45080 Compare May 3, 2025 15:30

cdoern requested review from JamesKunstle, booxter, Maxusmusti and RobotSail May 5, 2025 17:03

fynnsu reviewed May 5, 2025

View reviewed changes

mergify bot added the needs-rebase label May 6, 2025

fynnsu mentioned this pull request May 8, 2025

Add general logging implementation #500

Merged

booxter requested changes May 8, 2025

View reviewed changes

cdoern force-pushed the refactor branch from bf45080 to 9cb35bf Compare May 10, 2025 14:12

mergify bot removed the needs-rebase label May 10, 2025

mergify bot added the ci-failure label May 10, 2025

cdoern force-pushed the refactor branch from 9cb35bf to 1b4f44f Compare May 10, 2025 14:20

mergify bot removed the ci-failure label May 10, 2025

mergify bot added the needs-rebase label May 12, 2025

		import instructlab.training.data_process as dp


		def main(args):

feat: refactor main_ds.py #498

Are you sure you want to change the base?

feat: refactor main_ds.py #498

Uh oh!

Conversation

cdoern commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 3, 2025

Uh oh!

github-actions bot commented May 3, 2025

Uh oh!

github-actions bot commented May 3, 2025

Uh oh!

github-actions bot commented May 3, 2025

Uh oh!

github-actions bot commented May 3, 2025

Uh oh!

github-actions bot commented May 3, 2025

Uh oh!

RobotSail commented May 4, 2025

Uh oh!

cdoern commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RobotSail commented May 4, 2025

Uh oh!

fynnsu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented May 6, 2025

Uh oh!

booxter left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

booxter May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

booxter May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdoern commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cdoern commented Apr 25, 2025 •

edited

Loading

cdoern commented May 4, 2025 •

edited

Loading

booxter left a comment •

edited

Loading

booxter May 8, 2025 •

edited

Loading

booxter May 8, 2025 •

edited

Loading

cdoern commented May 10, 2025 •

edited

Loading