Enable fine tuning on HPU #660

splotnikv · 2025-09-23T15:40:19Z

No description provided.

Signed-off-by: Sergey Plotnikov <[email protected]>

splotnikv · 2025-09-23T15:43:40Z

This PR continues work started by #552. We use 552 for some internal work and decided to keep it as is. I'll close it in future.

Maxusmusti · 2025-09-25T19:11:51Z

Hi @splotnikv, have you been able to test these changes manually on HPU? If so, could you provide the successful full run logs and metric logs?

mieshkiwrk · 2025-09-30T11:19:51Z

src/instructlab/training/model.py

    def _post_model_init(self):
        """Common initialization steps that should happen after model initialization."""
+
+        if self.device == "hpu" and os.getenv("HPU_ENABLE_TORCH_COMPILE", False):


I think it would be better UX to handle reading this env var differently because it's not intuitive now.
For example: HPU_ENABLE_TORCH_COMPILE=False is still enabling t.compile

Agree, but it is pretty common way widely used in other repositories. In any case I am going to change it to command line control, do you have any objections?

Signed-off-by: Sergey Plotnikov <[email protected]>

splotnikv · 2025-10-09T20:49:02Z

About the HPU specific sampler, my latest commit. It's still a work in progress; there are some TODOs in the code that I'm currently working on, but I decided to push it to get early feedback on this change.

Here's how it differs from the existing sampler:
(1) It distributes samples between ranks based on a cost function, balancing workload complexity across ranks. Think of SDPA complexity as a function of sample length. For now, it's just the number of tokens, but it'll be updated to something more sophisticated in the future.
(2) It breaks the training dataset into several bins of similar sample sizes to get a more even sample size distribution in each batch, then shuffles both bins and samples within bins. This reduces padding overhead and speeds up training.
(3) It adds bucketing, i.e. aligns the maximum sample size in batches to reduce number of graph recompilations, which also speeds up training.

Known limitations: The major one is that small batches are not supported. The exact definition of "small" depends on the sample size distribution in the dataset. I expect that batch sizes below 32 may fail. This will be fixed in the next commit.

splotnikv · 2025-10-16T15:29:37Z

I've added small batch support. Now we can train on any batch sizes including cases where the number of samples per batch is below the number of ranks. E.g. I've tested on an 8 rank system with bs==5. Test passed. Last three ranks processed dummy samples. This exact use case does not make much sense on its own but it is crucial for last batch processing where we can get any number of samples depending on dataset size.

splotnikv · 2025-10-16T15:36:19Z

Hi @splotnikv, have you been able to test these changes manually on HPU? If so, could you provide the successful full run logs and metric logs?

Hi @Maxusmusti , see attached run log. What is metric log? Please let me know how can I get one, and I'll upload it.
run_log.txt

Signed-off-by: Sergey Plotnikov <[email protected]>

splotnikv · 2025-10-22T14:35:35Z

I've fixed all "TODOs". PR is ready for review now.

Signed-off-by: Sergey Plotnikov <[email protected]>

RobotSail · 2025-11-05T16:12:24Z

@coderabbitai

RobotSail

Thank you for the PR @splotnikv , overall this PR seems good and the changes all seem reasonable to enable HPU. I left a few comments inline and just had a few questions:

I see that the updated EpochSampler now has the ability to shuffle samples in a way that's biased to be similar in length. Have you compared this approach to having the same algorithm live in the Collator and operate at a batch-level instead? Apologies in advance if this is already documented somewhere in the PR and I missed it.
Would you be ammeanable to using device_type instead of device? We use both throughout the codebase, so it's important to prevent confusion wherever possible.

Otherwise, this PR makes sense and would be a great addition. Thank you for your hard work on this!

RobotSail · 2025-11-05T16:01:05Z

src/instructlab/training/accelerator.py

        deepspeed_cpu_offload_optimizer_ratio: Optional[float] = None,
        fsdp_cpu_offload_params: Optional[bool] = False,
        fsdp_use_orig_params: Optional[bool] = False,
+        device: Optional[str] = None,


Instead of just str, you might want to create a new type that constrains the choices between cuda and hpu:

AcceleratorDevice = literal["cuda", "hpu"] # ... # safe to infer cuda, since this will be the most common option device: Optional[AcceleratorDevice] = "cuda"

RobotSail · 2025-11-05T16:22:06Z

src/instructlab/training/batch_loss_manager.py

+            # save gradients
+            if self.save_grads and len(batch) > 1:
+                self._copy_grads_to_buffer()
+                self._zero_model_grads()


@miseshkiwrk How come we need to move the gradients back and forth between the buffer? Does HPU not allow accumulating grads directly in the current objects?

If this is the case, could you please leave a block comment somewhere describing the issue and why this is needed? Thanks in advance!

RobotSail · 2025-11-05T16:27:19Z

src/instructlab/training/main_ds.py

        help="Use Liger kernels for training.",
    )
+    parser.add_argument(
+        "--device",


In the context of PyTorch, we typically assume device refers to a singular device on a node containing several accelerators, e.g.: cuda:0, cuda:1, etc.

Since you're looking to allow for switching between HPU and CUDA style training, maybe we could change this to be --device_type? This way, we can also constrain the options to be either cuda (by default) or hpu, and error out when given an option which doesn't fall in the set of allowed devices.

RobotSail · 2025-11-05T16:43:45Z

src/instructlab/training/sampler.py

+            g.manual_seed(self.seed + self._epoch)
+            samples = torch.randperm(self.len_data, generator=g).tolist()
+        else:
+            sorted_indices = sorted(range(len(self.lengths)), key=lambda i: self.lengths[i], reverse=True)


@splotnikv It seems like you're shuffling the dataset so that future random samples all share a more similar length. How does this currently compare to the same shuffling but at a batch-level?

RobotSail · 2025-11-05T16:44:43Z

src/instructlab/training/sampler.py

+                else:
+                    chunks.append(sorted_indices[i * chunk_size:(i + 1) * chunk_size])
+
+            import random


Random import, could you please move this to be imported at the module level?

RobotSail · 2025-11-05T16:55:25Z

src/instructlab/training/model.py

+            self.model = torch.compile(self.model, backend="hpu_backend", dynamic=False)
+            for layer in self.model.model.layers:
+                layer.compile(backend="hpu_backend", dynamic=False) 
+            if os.environ.get("RANK", '0') == '0':


@splotnikv Are you trying to log from the main process on a given node here? The RANK env here corresponds to the rank of the node, not the process. If you just want to log a message on the main process (local rank 0), consider using the log_rank_0 function from instructlab.training.utils.

RobotSail · 2025-11-05T16:58:16Z

src/instructlab/training/model.py

        flash_enabled: bool = False,
        lora_config: Optional[LoraConfig] = None,
        lora_quant_bits: int = 0,
+        device: Optional[str] = None,


@splotnikv Is device here intended to be a specific device, or the broader device type (cuda/hpu)? If it's the latter, we should rename this to device_type to be consistent with the expected usage. Otherwise, it will be easy to get this confused with a specific torch.device instance such as cuda:0.

RobotSail · 2025-11-05T17:05:17Z

src/instructlab/training/main_ds.py


    # Initialize the batch loss manager
-    batch_loss_manager = BatchLossManager(model, accelerator, world_size, local_rank)
+    batch_loss_manager = BatchLossManager(model, accelerator, world_size, local_rank, args.device, args.device=="hpu" and args.torch_compile)


@splotnikv Since you're setting the save_grads kwarg conditionally and inline, would you mind specifying the current args by their kwargs explicitly? This way it'll be easier to manage and maintain in the future.

RobotSail · 2025-11-05T18:28:32Z

src/instructlab/training/utils.py



-def set_random_seed(seed):
+def set_random_seed(seed, device: str):


Would you mind adjusting this so that it's either using is_hpu or device_type?

RobotSail · 2025-11-05T19:00:49Z

src/instructlab/training/config.py

+    device: Optional[str] = None
+    torch_compile: bool = False
+    num_chunks: int = 1
+


I think these are good to expose, but what are your thoughts if we adjusted the field names like this?

device --> device_type (a constrained string of cuda | hpu, defaults to cuda), to be clearer on what the underlying hardware looks like, rather than a specific device we train on

num_chunks --> padded_packing_num_sorting_bins this field represents a very specific implementation detail, so a longer name is fitting. A description should be added to explain that this helps training efficiency when packing has to be used, at the cost of randomness/increased bias.

Would it also make sense to have it be enable_torch_compile or use_torch_compile?

RobotSail

Actually, I think it's fine to have the EpochSampler do length-aware sampling. Could you just please update the parameters & docstrings to be clear that the feature is length-biased shuffling, and is a variance-bias tradeoff? Thank you.

Also: Would you mind updating the docs to showcase your new additions? That way people are aware of the awesome changes you're adding 😃

Enable fine tuning on HPU

a4370d7

Signed-off-by: Sergey Plotnikov <[email protected]>

This was referenced Sep 23, 2025

enable fine tuning on HPU #552

Draft

Add HPU SFT example Red-Hat-AI-Innovation-Team/training_hub#10

Open

mieshkiwrk reviewed Sep 30, 2025

View reviewed changes

Add HPU specific sampler

11a0a9a

Signed-off-by: Sergey Plotnikov <[email protected]>

Add small batch support and fix odd last batch processing

5740d80

Add command line parameters to control HPU specific sampler

4b03a33

Signed-off-by: Sergey Plotnikov <[email protected]>

Add torch.compile workaround

9fccd07

Signed-off-by: Sergey Plotnikov <[email protected]>

mergify bot added the ci-failure label Oct 28, 2025

RobotSail reviewed Nov 5, 2025

View reviewed changes



		def set_random_seed(seed):
		def set_random_seed(seed, device: str):

Enable fine tuning on HPU #660

Are you sure you want to change the base?

Enable fine tuning on HPU #660

Uh oh!

Conversation

splotnikv commented Sep 23, 2025

Uh oh!

splotnikv commented Sep 23, 2025

Uh oh!

Maxusmusti commented Sep 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

splotnikv commented Oct 9, 2025

Uh oh!

splotnikv commented Oct 16, 2025

Uh oh!

splotnikv commented Oct 16, 2025

Uh oh!

splotnikv commented Oct 22, 2025

Uh oh!

RobotSail commented Nov 5, 2025

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants