Skip to content

Conversation

@splotnikv
Copy link

No description provided.

Signed-off-by: Sergey Plotnikov <[email protected]>
@splotnikv
Copy link
Author

This PR continues work started by #552. We use 552 for some internal work and decided to keep it as is. I'll close it in future.

@Maxusmusti
Copy link
Collaborator

Hi @splotnikv, have you been able to test these changes manually on HPU? If so, could you provide the successful full run logs and metric logs?

def _post_model_init(self):
"""Common initialization steps that should happen after model initialization."""

if self.device == "hpu" and os.getenv("HPU_ENABLE_TORCH_COMPILE", False):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better UX to handle reading this env var differently because it's not intuitive now.
For example: HPU_ENABLE_TORCH_COMPILE=False is still enabling t.compile

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but it is pretty common way widely used in other repositories. In any case I am going to change it to command line control, do you have any objections?

Signed-off-by: Sergey Plotnikov <[email protected]>
@splotnikv
Copy link
Author

About the HPU specific sampler, my latest commit. It's still a work in progress; there are some TODOs in the code that I'm currently working on, but I decided to push it to get early feedback on this change.

Here's how it differs from the existing sampler:
(1) It distributes samples between ranks based on a cost function, balancing workload complexity across ranks. Think of SDPA complexity as a function of sample length. For now, it's just the number of tokens, but it'll be updated to something more sophisticated in the future.
(2) It breaks the training dataset into several bins of similar sample sizes to get a more even sample size distribution in each batch, then shuffles both bins and samples within bins. This reduces padding overhead and speeds up training.
(3) It adds bucketing, i.e. aligns the maximum sample size in batches to reduce number of graph recompilations, which also speeds up training.

Known limitations: The major one is that small batches are not supported. The exact definition of "small" depends on the sample size distribution in the dataset. I expect that batch sizes below 32 may fail. This will be fixed in the next commit.

@splotnikv
Copy link
Author

I've added small batch support. Now we can train on any batch sizes including cases where the number of samples per batch is below the number of ranks. E.g. I've tested on an 8 rank system with bs==5. Test passed. Last three ranks processed dummy samples. This exact use case does not make much sense on its own but it is crucial for last batch processing where we can get any number of samples depending on dataset size.

@splotnikv
Copy link
Author

Hi @splotnikv, have you been able to test these changes manually on HPU? If so, could you provide the successful full run logs and metric logs?

Hi @Maxusmusti , see attached run log. What is metric log? Please let me know how can I get one, and I'll upload it.
run_log.txt

@splotnikv
Copy link
Author

I've fixed all "TODOs". PR is ready for review now.

Signed-off-by: Sergey Plotnikov <[email protected]>
@mergify mergify bot added the ci-failure label Oct 28, 2025
@RobotSail
Copy link
Member

@coderabbitai

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @splotnikv , overall this PR seems good and the changes all seem reasonable to enable HPU. I left a few comments inline and just had a few questions:

  1. I see that the updated EpochSampler now has the ability to shuffle samples in a way that's biased to be similar in length. Have you compared this approach to having the same algorithm live in the Collator and operate at a batch-level instead? Apologies in advance if this is already documented somewhere in the PR and I missed it.

  2. Would you be ammeanable to using device_type instead of device? We use both throughout the codebase, so it's important to prevent confusion wherever possible.

Otherwise, this PR makes sense and would be a great addition. Thank you for your hard work on this!

deepspeed_cpu_offload_optimizer_ratio: Optional[float] = None,
fsdp_cpu_offload_params: Optional[bool] = False,
fsdp_use_orig_params: Optional[bool] = False,
device: Optional[str] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of just str, you might want to create a new type that constrains the choices between cuda and hpu:

AcceleratorDevice = literal["cuda", "hpu"]

# ...

# safe to infer cuda, since this will be the most common option
device: Optional[AcceleratorDevice] = "cuda"  

# save gradients
if self.save_grads and len(batch) > 1:
self._copy_grads_to_buffer()
self._zero_model_grads()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miseshkiwrk How come we need to move the gradients back and forth between the buffer? Does HPU not allow accumulating grads directly in the current objects?

If this is the case, could you please leave a block comment somewhere describing the issue and why this is needed? Thanks in advance!

help="Use Liger kernels for training.",
)
parser.add_argument(
"--device",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of PyTorch, we typically assume device refers to a singular device on a node containing several accelerators, e.g.: cuda:0, cuda:1, etc.

Since you're looking to allow for switching between HPU and CUDA style training, maybe we could change this to be --device_type? This way, we can also constrain the options to be either cuda (by default) or hpu, and error out when given an option which doesn't fall in the set of allowed devices.

g.manual_seed(self.seed + self._epoch)
samples = torch.randperm(self.len_data, generator=g).tolist()
else:
sorted_indices = sorted(range(len(self.lengths)), key=lambda i: self.lengths[i], reverse=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@splotnikv It seems like you're shuffling the dataset so that future random samples all share a more similar length. How does this currently compare to the same shuffling but at a batch-level?

else:
chunks.append(sorted_indices[i * chunk_size:(i + 1) * chunk_size])

import random
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random import, could you please move this to be imported at the module level?

self.model = torch.compile(self.model, backend="hpu_backend", dynamic=False)
for layer in self.model.model.layers:
layer.compile(backend="hpu_backend", dynamic=False)
if os.environ.get("RANK", '0') == '0':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@splotnikv Are you trying to log from the main process on a given node here? The RANK env here corresponds to the rank of the node, not the process. If you just want to log a message on the main process (local rank 0), consider using the log_rank_0 function from instructlab.training.utils.

flash_enabled: bool = False,
lora_config: Optional[LoraConfig] = None,
lora_quant_bits: int = 0,
device: Optional[str] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@splotnikv Is device here intended to be a specific device, or the broader device type (cuda/hpu)? If it's the latter, we should rename this to device_type to be consistent with the expected usage. Otherwise, it will be easy to get this confused with a specific torch.device instance such as cuda:0.


# Initialize the batch loss manager
batch_loss_manager = BatchLossManager(model, accelerator, world_size, local_rank)
batch_loss_manager = BatchLossManager(model, accelerator, world_size, local_rank, args.device, args.device=="hpu" and args.torch_compile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@splotnikv Since you're setting the save_grads kwarg conditionally and inline, would you mind specifying the current args by their kwargs explicitly? This way it'll be easier to manage and maintain in the future.



def set_random_seed(seed):
def set_random_seed(seed, device: str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adjusting this so that it's either using is_hpu or device_type?

Comment on lines +258 to +261
device: Optional[str] = None
torch_compile: bool = False
num_chunks: int = 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are good to expose, but what are your thoughts if we adjusted the field names like this?

  • device --> device_type (a constrained string of cuda | hpu, defaults to cuda), to be clearer on what the underlying hardware looks like, rather than a specific device we train on
  • num_chunks --> padded_packing_num_sorting_bins this field represents a very specific implementation detail, so a longer name is fitting. A description should be added to explain that this helps training efficiency when packing has to be used, at the cost of randomness/increased bias.

Would it also make sense to have it be enable_torch_compile or use_torch_compile?

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it's fine to have the EpochSampler do length-aware sampling. Could you just please update the parameters & docstrings to be clear that the feature is length-biased shuffling, and is a variance-bias tradeoff? Thank you.

Also: Would you mind updating the docs to showcase your new additions? That way people are aware of the awesome changes you're adding 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants