Skip to content

feat: refactor main_ds.py #498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

feat: refactor main_ds.py #498

wants to merge 1 commit into from

Conversation

cdoern
Copy link
Contributor

@cdoern cdoern commented Apr 25, 2025

Introduce a new design for key components of main_ds.py. Namely splitting Model initialization, Accelerator initialization, Optimizer initialization, and Checkpoint saving initialization
into classes:

  1. Model
  2. Accelerator
  3. Checkpointer

NOTE: a follow up to this work will be to introduce classes/structure for the DataLoader, Sampler, etc. This was left out of this PR given the already large scope of change.

The Model class wraps the various AutoModel classes we support -- and aims to be a lightweight wrapper to help with usability of the library with different model types.
setup_optimizer resides within the model class and returns one of the optimizer types we support

The Accelerator class aims to both store commonly accessed variables associated with the accelerated model and abstract model/optimizer mutation away from the user who should only access our Model and Optimizer classes

The Checkpointer class introduces a unified approach to our various checkpointing techniques. A user can pass in their checkpointing style (full_state or hf_format), and the checkpointer, via checkpointer.checkpoint, will save the model using the selected method and other techniques (LoRA).

These classes are one of a few steps needed to "SDK-ify" the training library

Adding structure to code via classes can either be someone's favorite or least favorite thing. So I figured I'd explain myself before continuing. Here is my rationale:

Classes provide logical structuring to code, especially code meant to be a publicly consumable SDK and allows you to associate related objects and methods with one another.

Being able to group functionality under the Model, Accelerator, and Checkpointer classes inherently reduces code complexity and duplication. Being able to store things like , self.distributed_framework,self.lora_config, etc in a way such that within the class they are accessible within different methods allows the arguments per method to go down drastically, as well as complex return values. Simpler methods and argument/return values allows for simpler testing of code.

additionally add test_train and test_model unit tests and a new PR e2e test.

add test_train.py which mocks train and train_epoch. Additionally add test_model.py which tests instantiation of all new classes: Model, Accelerator, and Checkpointer

a new PR Large e2e job will run testing the SDK and calculating loss. For now it uses ilab data generate to get skills and knowledge data, but should be converted to use a static dataset from HF.

Loss on the main branch using exiting L40s x4 Large Job:

https://github.com/instructlab/training/actions/runs/14767228280/attempts/1#summary-41467828439

Loss on this PR using L40s x4 Large Job:

https://github.com/instructlab/training/actions/runs/14812223441

Loss on this PR using the new L40s x4 Large SDK Job (expect slightly different values)

https://github.com/instructlab/training/actions/runs/14812221612

@cdoern cdoern force-pushed the refactor branch 2 times, most recently from f4bf4b2 to c1564c5 Compare April 25, 2025 17:51
@cdoern cdoern mentioned this pull request Apr 25, 2025
@mergify mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing ci-failure labels Apr 25, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels Apr 25, 2025
@cdoern cdoern force-pushed the refactor branch 6 times, most recently from 746f76a to 355aca0 Compare April 25, 2025 19:36
@mergify mergify bot added ci-failure and removed ci-failure labels Apr 25, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels Apr 25, 2025
@cdoern cdoern force-pushed the refactor branch 2 times, most recently from 04c6543 to 3f57e60 Compare April 25, 2025 21:47
@mergify mergify bot added ci-failure and removed ci-failure labels Apr 25, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels Apr 25, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels Apr 25, 2025
Copy link

github-actions bot commented May 3, 2025

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

Copy link

github-actions bot commented May 3, 2025

e2e workflow failed on this PR: View run, please investigate.

@cdoern cdoern force-pushed the refactor branch 2 times, most recently from cc75fef to f24db49 Compare May 3, 2025 14:20
Copy link

github-actions bot commented May 3, 2025

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

Copy link

github-actions bot commented May 3, 2025

e2e workflow failed on this PR: View run, please investigate.

Copy link

github-actions bot commented May 3, 2025

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

Copy link

github-actions bot commented May 3, 2025

e2e workflow succeeded on this PR: View run, congrats!

@RobotSail
Copy link
Member

@cdoern I haven't gone through this in-depth yet, but I'm wondering if it would make sense to build into a v1 and do a major version bump to v1.0.0.

I'm not sure how much you are looking to break APIs that we have today, but if we went down this path then it would be a very good time to explore this option. This way we can work it into something nicer and ergonomic compared to how it is today.

Since I think it would also be unfair for you to be the only person working on this, and have the PR sitting while changes are actively being made in the repo, I propose this:

  1. We create a new branch training-v1 where we can merge this in as soon as you're ready. This way myself and anyone else can start using it and helping to build on top of it
  2. We can begin cutting dev releases such as instructlab-training-v1.0.0a1, instructlab-training-v1.0.0.preview, etc. to consume this and use it
  3. We should identify any other bottlenecks of the current library and address them in this branch (only big things that are difficult to change later, we shouldn't hold this for too long)
  4. We can create compatibility APIs so that any consumers (primarily ilab) aren't affected, similar to what we've done for the data processing refactor and what @booxter was suggesting on the other PR
  5. Once we are ready, we can merge into main and cut a new release

@cdoern
Copy link
Contributor Author

cdoern commented May 4, 2025

@RobotSail I think that is a great idea! Thank you for looking at this, I really wanted to get the loss gathering working to prove at the very least I wasn't degrading anything -- but a branch other than main makes complete sense to me so we can begin iterating on this!! I can open a different PR to that branch as opposed to the main branch here. Again, thanks for thinking about this and taking a look at the PR :)

@RobotSail
Copy link
Member

@cdoern It looks like the loss does work 😄

Copy link
Collaborator

@fynnsu fynnsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Added a few suggestions.

if local_rank == 0
else None
)
all_metrics: List[Metrics] = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this Metrics collection? It seems like it's not currently being used here (values collected but not returned/used in this fn). I also belief the new logging implementation #500, should handle metric collection, so that we don't have to pass a list of Metrics up the stack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it in general seems there is a lot of overlapping work that you've done @fynnsu which is better suited to be built on top of the new version @cdoern is building out here, so I wonder if it would be easier to merge this into a dev branch, merge the other PRs on there, then release it as main once everything is ready.

model: Model,
optimizer: torch.optim.Optimizer,
accelerator: Accelerator,
strategy="all",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add output_dir to Checkpointer init / attributes so it doesn't need to be passed to every method.

model: Model,
samples_per_gpu: int,
grad_accum: int,
train_loader: DataLoader,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the train_loader should be attached to Accelerator, these feel like separate concepts to me.

Currently, the only real "use" of train_loader in this class is getting its length for calculating LR schedule. However, the train_loader is also being accessed directly from the Accelerator attribute in train.py which feels like an unintuitive way to access the data, we should just pass dataloaders directly to the train fns.

Comment on lines +500 to +502
deepspeed_cpu_offload_optimizer: Optional[bool] = False,
deepspeed_cpu_offload_optimizer_pin_memory: Optional[bool] = False,
deepspeed_cpu_offload_optimizer_ratio: Optional[float] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize DeepSpeedOptions in main_ds.py and pass it (or None) into Accelerator.__init__. Then you can just store self.deepspeed_cpu_offload_options on the accelerator.


def save_dict_accelerate(
self,
accelerator: Accelerator,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this type is wrong. This should be referring to accelerate.Accelerator which you've aliased to TransformersAccel in this file.

In general, I find this duplicate use of Accelerator to be quite confusing. It would probably be beneficial if we renamed our Accelerator to something like DistributedBackend or whatever to avoid the name conflict.

def save_dict_accelerate(
self,
accelerator: Accelerator,
state_to_save,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be useful to have type hints for these methods. This is potentially not your fault as I imagine you mostly just moved this code around but this method is currently only called with state_to_save as a model's state_dict(). With that type info, setting state_to_save.modules and state_to_save.parameters to a noop doesn't make sense because those attributes don't exist in the first place.

Comment on lines +869 to +876
epoch: int,
samples_seen: int,
last_epoch: bool = False,
**kwargs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

epoch and kwargs not used

output_dir,
epoch: int,
samples_seen: int,
**kwargs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kwargs not used. I also just raised #527 to add checks for unused parameters to our linter.

Comment on lines +709 to +718
return cls(
model=model,
grad_accum=grad_accum,
train_loader=train_loader,
distributed_framework=DistributedBackend.FSDP,
samples_per_gpu=samples_per_gpu,
fsdp_sharding_strategy=fsdp_sharding_strategy,
fsdp_cpu_offload_params=fsdp_cpu_offload_params,
save_samples=save_samples,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these two classmethods should just use **kwargs?

self.model = add_noisy_embeddings(self.model, noise_alpha=self.noise_alpha)

@staticmethod
def supports_flash_attention(device_id=0):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this staticmethod and the next check_flash_attn_enabled were better as functions in utils.py. They're only loosely model related, really they're system related. And they don't use any attributes from Model

Copy link
Contributor

mergify bot commented May 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked at the patch from birds view. I haven't checked the logic in details. This is because I believe this PR will have to be split into separately reviewable changes before it can be considered for inclusion. Probably a series of commits, each doing one logical transformation.

I won't give you the immediate recipe on the exact split that would make sense here - after all, you probably know better which logical steps this large PR consists of than anyone else - but as a generic example, here's what I would expect from a PR of this size and impact (or a set of such PRs):

  • a commit moving a unit of code from module A to B (with no changes to what the code does);
  • a commit cleaning up the unit of code (e.g. cleaning up spurious logging);
  • perhaps another commit cleaning up some other issue in the moved code;
  • another commit introducing a stub class Model that demonstrates the interface but has no implementation, with some basic unit tests creating an object of the type;
  • another commit plugging the previously moved and transformed unit of code into one of the methods of the new class, with additional unit test for this method, ...;
  • ...more commits in between of the similar nature...
  • a separate commit introducing a tox target triggering an integration test (the e2e thingy included here);
  • a separate commit refactoring existing smoke workflow moving the code to initialize a virtual environment with necessary cuda dependencies and ilab into a reusable script;
  • a separate commit adding a CI job triggering the script above + new tox target;
  • another commit with docs...

Each of the commits would then describe in its message what is being done (can be as simple as "Move X function to model Y").

This is just an outline trying to show how this would look like in general terms. These commits can then be lumped into a single PR or - preferrably? - multiple (so that we can start merging pieces that are ready as they shape up; each PR probably combining a number of these smaller separately reviewable commits).

In this way, we will be able to give proper attention to each of the changes included here. Without it, we would have to take the refactor execution at faith value, which is a risk for regressions.

(I don't think going the route of a feature branch for this work is feasible either. This is because it won't make it any easier to review this work in a single piece later when integration from the branch back to main would have to happen. So it will only delay the process - and exacerbate the problem because more work will be merged on top of it in the meantime.)


If you need help figuring out how the split should be executed - from the code logic or git perspective - let's discuss this. I have some experience dealing with large refactors (from both sides).

@@ -0,0 +1,157 @@
chat:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we added support to ilab to pull defaults if not specified. If so, suggest to keep this file trimmed to the exact changes you need. (It's not clear which these are at the moment.)

export INSTRUCTLAB_EVAL_FIRST_N_QUESTIONS=10
export HF_DATASETS_TRUST_REMOTE_CODE=true

python -c "import sys; sys.path.insert(0, '${SCRIPTDIR}'); from test_sdk import run_test; run_test('${knowledge_data_path}', '${skills_data_path}', nnodes=1, node_rank=0, nproc_per_node=4)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make more sense to use the runner we already use for smoke - tox? it would guarantee that whatever dependencies the current virtual environment has from installing ilab etc., they will not be present when running the test.

mkdir -p "${E2E_LOG_DIR}"
}

test_train() {
Copy link
Contributor

@booxter booxter May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like us to move away from shell test-only scripts and use tox as test runner. This can be achieved by:

  • making the python run_test a proper pytest test case (or a set of them?);
  • moving setup code into a number of shell helper scripts that are not directly tied to testing (though used to prepare environment), e.g. have scripts/init_ilab.sh, scripts/download_models.sh, scripts/generate_data.sh, and then finally, a thin scripts/prepare_sdk_testenv.sh that may run all of the other scripts above in order. Then you have your workflow steps as: 1) ./scripts/prepare_sdk_testenv.sh 2) tox -e ....

Ultimately, I'd like to check out the repo and be able to call ./scripts/prepare_sdk_testenv.sh once and then iterate locally with tox as usual without repeating all the steps.

import instructlab.training.data_process as dp


def main(args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file should be a (set of) pytest test cases driven through tox (like we do for smokes).

# mypy: disable_error_code="has-type"


class Model:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason to keep these classes lumped under the same file? perhaps we should have model.py, accelerator.py etc. It will simplify usage as well as reading the code.

FusedAdam = None
local_rank = int(os.getenv("LOCAL_RANK", "0"))
if __name__ == "__main__" and (not local_rank or local_rank == 0):
print("DeepSpeed is not available. Some features may be unavailable.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be (gracefully!) handled when the optimizer is set up, not on import.

return False


def setup_optimizer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you just move code here from place to place? (With minor changes like adding beta argument and removing print()s?) This is an example of a change that could be separated into a commit which would help reviewers immensely. In general, changes that are mechanical should be separated (as well as other logically separate changes). Then reviewers can review your PR commit by commit and follow the logic of transformations.

deepcopy(train_loader),
lr_scheduler,
)
# Necessary so that Accelerate does not step once per GPU
Copy link
Contributor

@booxter booxter May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of context should not be removed when doing transformations (I assume this comment makes some useful point that cannot be easily deduced from just reading the code, otherwise it should not be here; so removing it when moving from here to prepare_with_optimizer is not desired).

@cdoern
Copy link
Contributor Author

cdoern commented May 10, 2025

@booxter thanks for the in depth comment! I will think of how to split this up I would imagine it'd be something like:

  1. move model initialization code from main_ds.py to a new module, add unit tests
  2. move optimizer initialization code to model module, add unit tests
  3. move accelerator code from main_ds.py to new module, add unit tests
  4. create checkpointer class concept and move associated code from main_ds.py and utils.py into model module, add unit tests
  5. add new e2e test which runs SDK.

I can work next week on splitting up those changes into separate PRs

@mergify mergify bot added the ci-failure label May 10, 2025
Introduce a new design for key components of main_ds.py. Namely splitting Model initialization, Accelerator initialization, Optimizer initialization, and Checkpoint saving initialization
into classes:

Model
Accelerator
Checkpointer

The Model class wraps the various AutoModel classes we support -- and aims to be a lightweight wrapper to help with usability of the library with different model types.
setup_optimizer resides within the model class and returns one of the optimizer types we support

The Accelerator class aims to both store commonly accessed variables associated with the accelerated model and abstract model/optimizer mutation away from the user who should only access our Model and Optimizer classes

The Checkpointer class introduces a unified approach to our various checkpointing techniques. A user can pass in their checkpointing style (full_state or hf_format), and the checkpointer, via checkpointer.checkpoint, will save the model using the selected method and other techniques (LoRA).

These classes are one of a few steps needed to "SDK-ify" the training library

aditionally add test_train and test_model unit tests, e2e test

add test_train.py which mocks `train` and `train_epoch`. Additionally add test_model.py which tests instantiation of all new classes: Model, Accelerator, Optimizer, Checkpointer

Signed-off-by: Charlie Doern <[email protected]>
Copy link
Contributor

mergify bot commented May 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration needs-rebase testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants