-
Notifications
You must be signed in to change notification settings - Fork 63
feat: refactor main_ds.py #498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
f4bf4b2
to
c1564c5
Compare
746f76a
to
355aca0
Compare
04c6543
to
3f57e60
Compare
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
e2e workflow failed on this PR: View run, please investigate. |
cc75fef
to
f24db49
Compare
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
e2e workflow failed on this PR: View run, please investigate. |
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
e2e workflow succeeded on this PR: View run, congrats! |
@cdoern I haven't gone through this in-depth yet, but I'm wondering if it would make sense to build into a v1 and do a major version bump to I'm not sure how much you are looking to break APIs that we have today, but if we went down this path then it would be a very good time to explore this option. This way we can work it into something nicer and ergonomic compared to how it is today. Since I think it would also be unfair for you to be the only person working on this, and have the PR sitting while changes are actively being made in the repo, I propose this:
|
@RobotSail I think that is a great idea! Thank you for looking at this, I really wanted to get the loss gathering working to prove at the very least I wasn't degrading anything -- but a branch other than main makes complete sense to me so we can begin iterating on this!! I can open a different PR to that branch as opposed to the main branch here. Again, thanks for thinking about this and taking a look at the PR :) |
@cdoern It looks like the loss does work 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Added a few suggestions.
if local_rank == 0 | ||
else None | ||
) | ||
all_metrics: List[Metrics] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this Metrics
collection? It seems like it's not currently being used here (values collected but not returned/used in this fn). I also belief the new logging implementation #500, should handle metric collection, so that we don't have to pass a list of Metrics
up the stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it in general seems there is a lot of overlapping work that you've done @fynnsu which is better suited to be built on top of the new version @cdoern is building out here, so I wonder if it would be easier to merge this into a dev branch, merge the other PRs on there, then release it as main once everything is ready.
model: Model, | ||
optimizer: torch.optim.Optimizer, | ||
accelerator: Accelerator, | ||
strategy="all", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add output_dir
to Checkpointer init / attributes so it doesn't need to be passed to every method.
model: Model, | ||
samples_per_gpu: int, | ||
grad_accum: int, | ||
train_loader: DataLoader, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the train_loader
should be attached to Accelerator
, these feel like separate concepts to me.
Currently, the only real "use" of train_loader
in this class is getting its length for calculating LR schedule. However, the train_loader
is also being accessed directly from the Accelerator
attribute in train.py
which feels like an unintuitive way to access the data, we should just pass dataloaders directly to the train
fns.
deepspeed_cpu_offload_optimizer: Optional[bool] = False, | ||
deepspeed_cpu_offload_optimizer_pin_memory: Optional[bool] = False, | ||
deepspeed_cpu_offload_optimizer_ratio: Optional[float] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize DeepSpeedOptions
in main_ds.py
and pass it (or None
) into Accelerator.__init__
. Then you can just store self.deepspeed_cpu_offload_options
on the accelerator.
|
||
def save_dict_accelerate( | ||
self, | ||
accelerator: Accelerator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this type is wrong. This should be referring to accelerate.Accelerator
which you've aliased to TransformersAccel
in this file.
In general, I find this duplicate use of Accelerator
to be quite confusing. It would probably be beneficial if we renamed our Accelerator
to something like DistributedBackend
or whatever to avoid the name conflict.
def save_dict_accelerate( | ||
self, | ||
accelerator: Accelerator, | ||
state_to_save, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be useful to have type hints for these methods. This is potentially not your fault as I imagine you mostly just moved this code around but this method is currently only called with state_to_save
as a model's state_dict()
. With that type info, setting state_to_save.modules
and state_to_save.parameters
to a noop doesn't make sense because those attributes don't exist in the first place.
epoch: int, | ||
samples_seen: int, | ||
last_epoch: bool = False, | ||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
epoch
and kwargs
not used
output_dir, | ||
epoch: int, | ||
samples_seen: int, | ||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kwargs
not used. I also just raised #527 to add checks for unused parameters to our linter.
return cls( | ||
model=model, | ||
grad_accum=grad_accum, | ||
train_loader=train_loader, | ||
distributed_framework=DistributedBackend.FSDP, | ||
samples_per_gpu=samples_per_gpu, | ||
fsdp_sharding_strategy=fsdp_sharding_strategy, | ||
fsdp_cpu_offload_params=fsdp_cpu_offload_params, | ||
save_samples=save_samples, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these two classmethods should just use **kwargs
?
self.model = add_noisy_embeddings(self.model, noise_alpha=self.noise_alpha) | ||
|
||
@staticmethod | ||
def supports_flash_attention(device_id=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this staticmethod and the next check_flash_attn_enabled
were better as functions in utils.py
. They're only loosely model related, really they're system related. And they don't use any attributes from Model
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked at the patch from birds view. I haven't checked the logic in details. This is because I believe this PR will have to be split into separately reviewable changes before it can be considered for inclusion. Probably a series of commits, each doing one logical transformation.
I won't give you the immediate recipe on the exact split that would make sense here - after all, you probably know better which logical steps this large PR consists of than anyone else - but as a generic example, here's what I would expect from a PR of this size and impact (or a set of such PRs):
- a commit moving a unit of code from module A to B (with no changes to what the code does);
- a commit cleaning up the unit of code (e.g. cleaning up spurious logging);
- perhaps another commit cleaning up some other issue in the moved code;
- another commit introducing a stub class Model that demonstrates the interface but has no implementation, with some basic unit tests creating an object of the type;
- another commit plugging the previously moved and transformed unit of code into one of the methods of the new class, with additional unit test for this method, ...;
- ...more commits in between of the similar nature...
- a separate commit introducing a tox target triggering an integration test (the e2e thingy included here);
- a separate commit refactoring existing smoke workflow moving the code to initialize a virtual environment with necessary cuda dependencies and ilab into a reusable script;
- a separate commit adding a CI job triggering the script above + new tox target;
- another commit with docs...
Each of the commits would then describe in its message what is being done (can be as simple as "Move X function to model Y").
This is just an outline trying to show how this would look like in general terms. These commits can then be lumped into a single PR or - preferrably? - multiple (so that we can start merging pieces that are ready as they shape up; each PR probably combining a number of these smaller separately reviewable commits).
In this way, we will be able to give proper attention to each of the changes included here. Without it, we would have to take the refactor execution at faith value, which is a risk for regressions.
(I don't think going the route of a feature branch for this work is feasible either. This is because it won't make it any easier to review this work in a single piece later when integration from the branch back to main would have to happen. So it will only delay the process - and exacerbate the problem because more work will be merged on top of it in the meantime.)
If you need help figuring out how the split should be executed - from the code logic or git perspective - let's discuss this. I have some experience dealing with large refactors (from both sides).
@@ -0,0 +1,157 @@ | |||
chat: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we added support to ilab
to pull defaults if not specified. If so, suggest to keep this file trimmed to the exact changes you need. (It's not clear which these are at the moment.)
export INSTRUCTLAB_EVAL_FIRST_N_QUESTIONS=10 | ||
export HF_DATASETS_TRUST_REMOTE_CODE=true | ||
|
||
python -c "import sys; sys.path.insert(0, '${SCRIPTDIR}'); from test_sdk import run_test; run_test('${knowledge_data_path}', '${skills_data_path}', nnodes=1, node_rank=0, nproc_per_node=4)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make more sense to use the runner we already use for smoke
- tox
? it would guarantee that whatever dependencies the current virtual environment has from installing ilab
etc., they will not be present when running the test.
mkdir -p "${E2E_LOG_DIR}" | ||
} | ||
|
||
test_train() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like us to move away from shell test-only scripts and use tox as test runner. This can be achieved by:
- making the python run_test a proper pytest test case (or a set of them?);
- moving setup code into a number of shell helper scripts that are not directly tied to testing (though used to prepare environment), e.g. have
scripts/init_ilab.sh
,scripts/download_models.sh
,scripts/generate_data.sh
, and then finally, a thinscripts/prepare_sdk_testenv.sh
that may run all of the other scripts above in order. Then you have your workflow steps as: 1)./scripts/prepare_sdk_testenv.sh
2)tox -e ...
.
Ultimately, I'd like to check out the repo and be able to call ./scripts/prepare_sdk_testenv.sh
once and then iterate locally with tox
as usual without repeating all the steps.
import instructlab.training.data_process as dp | ||
|
||
|
||
def main(args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this file should be a (set of) pytest test cases driven through tox
(like we do for smokes).
# mypy: disable_error_code="has-type" | ||
|
||
|
||
class Model: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any particular reason to keep these classes lumped under the same file? perhaps we should have model.py
, accelerator.py
etc. It will simplify usage as well as reading the code.
FusedAdam = None | ||
local_rank = int(os.getenv("LOCAL_RANK", "0")) | ||
if __name__ == "__main__" and (not local_rank or local_rank == 0): | ||
print("DeepSpeed is not available. Some features may be unavailable.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be (gracefully!) handled when the optimizer is set up, not on import.
return False | ||
|
||
|
||
def setup_optimizer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you just move code here from place to place? (With minor changes like adding beta
argument and removing print()s?) This is an example of a change that could be separated into a commit which would help reviewers immensely. In general, changes that are mechanical should be separated (as well as other logically separate changes). Then reviewers can review your PR commit by commit and follow the logic of transformations.
deepcopy(train_loader), | ||
lr_scheduler, | ||
) | ||
# Necessary so that Accelerate does not step once per GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of context should not be removed when doing transformations (I assume this comment makes some useful point that cannot be easily deduced from just reading the code, otherwise it should not be here; so removing it when moving from here to prepare_with_optimizer
is not desired).
@booxter thanks for the in depth comment! I will think of how to split this up I would imagine it'd be something like:
I can work next week on splitting up those changes into separate PRs |
Introduce a new design for key components of main_ds.py. Namely splitting Model initialization, Accelerator initialization, Optimizer initialization, and Checkpoint saving initialization into classes: Model Accelerator Checkpointer The Model class wraps the various AutoModel classes we support -- and aims to be a lightweight wrapper to help with usability of the library with different model types. setup_optimizer resides within the model class and returns one of the optimizer types we support The Accelerator class aims to both store commonly accessed variables associated with the accelerated model and abstract model/optimizer mutation away from the user who should only access our Model and Optimizer classes The Checkpointer class introduces a unified approach to our various checkpointing techniques. A user can pass in their checkpointing style (full_state or hf_format), and the checkpointer, via checkpointer.checkpoint, will save the model using the selected method and other techniques (LoRA). These classes are one of a few steps needed to "SDK-ify" the training library aditionally add test_train and test_model unit tests, e2e test add test_train.py which mocks `train` and `train_epoch`. Additionally add test_model.py which tests instantiation of all new classes: Model, Accelerator, Optimizer, Checkpointer Signed-off-by: Charlie Doern <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Introduce a new design for key components of main_ds.py. Namely splitting Model initialization, Accelerator initialization, Optimizer initialization, and Checkpoint saving initialization
into classes:
NOTE: a follow up to this work will be to introduce classes/structure for the DataLoader, Sampler, etc. This was left out of this PR given the already large scope of change.
The Model class wraps the various AutoModel classes we support -- and aims to be a lightweight wrapper to help with usability of the library with different model types.
setup_optimizer resides within the model class and returns one of the optimizer types we support
The Accelerator class aims to both store commonly accessed variables associated with the accelerated model and abstract model/optimizer mutation away from the user who should only access our Model and Optimizer classes
The Checkpointer class introduces a unified approach to our various checkpointing techniques. A user can pass in their checkpointing style (full_state or hf_format), and the checkpointer, via checkpointer.checkpoint, will save the model using the selected method and other techniques (LoRA).
These classes are one of a few steps needed to "SDK-ify" the training library
Adding structure to code via classes can either be someone's favorite or least favorite thing. So I figured I'd explain myself before continuing. Here is my rationale:
Classes provide logical structuring to code, especially code meant to be a publicly consumable SDK and allows you to associate related objects and methods with one another.
Being able to group functionality under the Model, Accelerator, and Checkpointer classes inherently reduces code complexity and duplication. Being able to store things like , self.distributed_framework,self.lora_config, etc in a way such that within the class they are accessible within different methods allows the arguments per method to go down drastically, as well as complex return values. Simpler methods and argument/return values allows for simpler testing of code.
additionally add test_train and test_model unit tests and a new PR e2e test.
add test_train.py which mocks
train
andtrain_epoch
. Additionally add test_model.py which tests instantiation of all new classes: Model, Accelerator, and Checkpointera new PR Large e2e job will run testing the SDK and calculating loss. For now it uses
ilab data generate
to get skills and knowledge data, but should be converted to use a static dataset from HF.Loss on the main branch using exiting L40s x4 Large Job:
https://github.com/instructlab/training/actions/runs/14767228280/attempts/1#summary-41467828439
Loss on this PR using L40s x4 Large Job:
https://github.com/instructlab/training/actions/runs/14812223441
Loss on this PR using the new L40s x4 Large SDK Job (expect slightly different values)
https://github.com/instructlab/training/actions/runs/14812221612