-
Notifications
You must be signed in to change notification settings - Fork 74
Enable fine tuning on HPU #660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Sergey Plotnikov <[email protected]>
|
This PR continues work started by #552. We use 552 for some internal work and decided to keep it as is. I'll close it in future. |
|
Hi @splotnikv, have you been able to test these changes manually on HPU? If so, could you provide the successful full run logs and metric logs? |
src/instructlab/training/model.py
Outdated
| def _post_model_init(self): | ||
| """Common initialization steps that should happen after model initialization.""" | ||
|
|
||
| if self.device == "hpu" and os.getenv("HPU_ENABLE_TORCH_COMPILE", False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better UX to handle reading this env var differently because it's not intuitive now.
For example: HPU_ENABLE_TORCH_COMPILE=False is still enabling t.compile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, but it is pretty common way widely used in other repositories. In any case I am going to change it to command line control, do you have any objections?
Signed-off-by: Sergey Plotnikov <[email protected]>
|
About the HPU specific sampler, my latest commit. It's still a work in progress; there are some TODOs in the code that I'm currently working on, but I decided to push it to get early feedback on this change. Here's how it differs from the existing sampler: Known limitations: The major one is that small batches are not supported. The exact definition of "small" depends on the sample size distribution in the dataset. I expect that batch sizes below 32 may fail. This will be fixed in the next commit. |
|
I've added small batch support. Now we can train on any batch sizes including cases where the number of samples per batch is below the number of ranks. E.g. I've tested on an 8 rank system with bs==5. Test passed. Last three ranks processed dummy samples. This exact use case does not make much sense on its own but it is crucial for last batch processing where we can get any number of samples depending on dataset size. |
Hi @Maxusmusti , see attached run log. What is metric log? Please let me know how can I get one, and I'll upload it. |
Signed-off-by: Sergey Plotnikov <[email protected]>
|
I've fixed all "TODOs". PR is ready for review now. |
Signed-off-by: Sergey Plotnikov <[email protected]>
RobotSail
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @splotnikv , overall this PR seems good and the changes all seem reasonable to enable HPU. I left a few comments inline and just had a few questions:
-
I see that the updated EpochSampler now has the ability to shuffle samples in a way that's biased to be similar in length. Have you compared this approach to having the same algorithm live in the Collator and operate at a batch-level instead? Apologies in advance if this is already documented somewhere in the PR and I missed it.
-
Would you be ammeanable to using
device_typeinstead ofdevice? We use both throughout the codebase, so it's important to prevent confusion wherever possible.
Otherwise, this PR makes sense and would be a great addition. Thank you for your hard work on this!
| deepspeed_cpu_offload_optimizer_ratio: Optional[float] = None, | ||
| fsdp_cpu_offload_params: Optional[bool] = False, | ||
| fsdp_use_orig_params: Optional[bool] = False, | ||
| device: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of just str, you might want to create a new type that constrains the choices between cuda and hpu:
AcceleratorDevice = literal["cuda", "hpu"]
# ...
# safe to infer cuda, since this will be the most common option
device: Optional[AcceleratorDevice] = "cuda" | # save gradients | ||
| if self.save_grads and len(batch) > 1: | ||
| self._copy_grads_to_buffer() | ||
| self._zero_model_grads() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miseshkiwrk How come we need to move the gradients back and forth between the buffer? Does HPU not allow accumulating grads directly in the current objects?
If this is the case, could you please leave a block comment somewhere describing the issue and why this is needed? Thanks in advance!
| help="Use Liger kernels for training.", | ||
| ) | ||
| parser.add_argument( | ||
| "--device", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of PyTorch, we typically assume device refers to a singular device on a node containing several accelerators, e.g.: cuda:0, cuda:1, etc.
Since you're looking to allow for switching between HPU and CUDA style training, maybe we could change this to be --device_type? This way, we can also constrain the options to be either cuda (by default) or hpu, and error out when given an option which doesn't fall in the set of allowed devices.
| g.manual_seed(self.seed + self._epoch) | ||
| samples = torch.randperm(self.len_data, generator=g).tolist() | ||
| else: | ||
| sorted_indices = sorted(range(len(self.lengths)), key=lambda i: self.lengths[i], reverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@splotnikv It seems like you're shuffling the dataset so that future random samples all share a more similar length. How does this currently compare to the same shuffling but at a batch-level?
| else: | ||
| chunks.append(sorted_indices[i * chunk_size:(i + 1) * chunk_size]) | ||
|
|
||
| import random |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random import, could you please move this to be imported at the module level?
| self.model = torch.compile(self.model, backend="hpu_backend", dynamic=False) | ||
| for layer in self.model.model.layers: | ||
| layer.compile(backend="hpu_backend", dynamic=False) | ||
| if os.environ.get("RANK", '0') == '0': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@splotnikv Are you trying to log from the main process on a given node here? The RANK env here corresponds to the rank of the node, not the process. If you just want to log a message on the main process (local rank 0), consider using the log_rank_0 function from instructlab.training.utils.
| flash_enabled: bool = False, | ||
| lora_config: Optional[LoraConfig] = None, | ||
| lora_quant_bits: int = 0, | ||
| device: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@splotnikv Is device here intended to be a specific device, or the broader device type (cuda/hpu)? If it's the latter, we should rename this to device_type to be consistent with the expected usage. Otherwise, it will be easy to get this confused with a specific torch.device instance such as cuda:0.
|
|
||
| # Initialize the batch loss manager | ||
| batch_loss_manager = BatchLossManager(model, accelerator, world_size, local_rank) | ||
| batch_loss_manager = BatchLossManager(model, accelerator, world_size, local_rank, args.device, args.device=="hpu" and args.torch_compile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@splotnikv Since you're setting the save_grads kwarg conditionally and inline, would you mind specifying the current args by their kwargs explicitly? This way it'll be easier to manage and maintain in the future.
|
|
||
|
|
||
| def set_random_seed(seed): | ||
| def set_random_seed(seed, device: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adjusting this so that it's either using is_hpu or device_type?
| device: Optional[str] = None | ||
| torch_compile: bool = False | ||
| num_chunks: int = 1 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are good to expose, but what are your thoughts if we adjusted the field names like this?
device-->device_type(a constrained string ofcuda | hpu, defaults tocuda), to be clearer on what the underlying hardware looks like, rather than a specific device we train onnum_chunks-->padded_packing_num_sorting_binsthis field represents a very specific implementation detail, so a longer name is fitting. A description should be added to explain that this helps training efficiency when packing has to be used, at the cost of randomness/increased bias.
Would it also make sense to have it be enable_torch_compile or use_torch_compile?
RobotSail
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think it's fine to have the EpochSampler do length-aware sampling. Could you just please update the parameters & docstrings to be clear that the feature is length-biased shuffling, and is a variance-bias tradeoff? Thank you.
Also: Would you mind updating the docs to showcase your new additions? That way people are aware of the awesome changes you're adding 😃
No description provided.