enable fine tuning on HPU #552

splotnikv · 2025-05-14T15:38:42Z

This PR enables fine tuning on HPU. It supports FSDP, torch.compile and bucketing. HPU is autodetected and the change doesn't affect existing functionality. PR passed basic smoke testing on HPU, more testing is in progress.

RobotSail

Thanks for the PR! The block alignment for Gaudi should happen outside of the Multipack logic. Please update your PR so that samples are correctly aligned for Gaudi systems outside of the sampler logic.

RobotSail · 2025-05-14T16:08:13Z

src/instructlab/training/multipack_sampler.py

@@ -211,11 +213,11 @@ def ffd_check_padding(a: np.ndarray, c: int, n: int):
        not_found = True
        for idx in range(n):
            # Calculate the new capacity if size is added to the bin
-            new_capacity = max(bins_max_lengths[idx], size) * (
+            new_capacity = bucket(max(bins_max_lengths[idx], size)) * (


Why are we using bucket in multipack? This affects sampling for NVIDIA and AMD systems as well

Multipack decides how many samples and of what size we can fit in any given batch. Bucketing increases maximum sample length of the batch and thus changes batch layout. Moving bucketing out of multipack decreases bucketing efficiency. About impact on other platforms. bucket() is defined as "identity" functions on anything except HPU, so should not have any impact.

@splotnikv Why do you think that bucketing would be ineffective when moved outside of multipack? From multipack's perspective, it's just looking at lengths. If you move the buckets outside you should still get the same values I believe.

Functionally, I'm seeing that bucket only returns non-identity if Gaudi is available, so I think @splotnikv is correct that this doesn't impact cuda.

That being said, re: this implementation, the integration would be preferred as:

new_capacity = max(bins_max_lengths[idx], size) ... if is_torch_hpu_available(): new_capacity = bucket(new_capacity)

rather than obfuscating the cuda path inside of the bucket function.

@JamesKunstle Regardless of bucket's behavior on other platforms, Multipack implementation is intended as a generic algorithm which should only operate on what it believes are numbers.

It would be equivalent and preferred to simply run bucket on all of the lengths ahead of time and pass them into Multipack, rather than updating the algorithm to now also consider what platform it's running on.

RobotSail · 2025-05-14T16:08:24Z

src/instructlab/training/multipack_sampler.py

@@ -266,11 +268,11 @@ def ffd_with_result_padding(a: np.ndarray, c: int, start_index: int):
        add_new = True
        for idx in range(len(bins_max_lengths)):
            # Calculate the new capacity if size is added to the bin
-            new_capacity = max(bins_max_lengths[idx], size) * (
+            new_capacity = bucket(max(bins_max_lengths[idx], size)) * (


This shouldn't go in the logic for multipack

Please see my reply above.

splotnikv · 2025-05-15T21:03:15Z

Thanks for the PR! The block alignment for Gaudi should happen outside of the Multipack logic. Please update your PR so that samples are correctly aligned for Gaudi systems outside of the sampler logic.

I've removed bucketing from multi-pack sampler for now, but we have to find a place there to implement it. I've left it in collate function but it is not enough. Please advise how we can do it. E.g. I can implement HPU specific multi-pack sampler.

mergify · 2025-05-20T23:43:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. @splotnikv please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

JamesKunstle · 2025-05-22T19:28:45Z

In the next week or so we're going to implement some significantly better testing for this lib that'll give us more confidence in merging this.

My ideal requirements to OK merging this are:

test coverage verifying that these modifications don't modify the training behavior on cuda machines.
test coverage verifying that supported features work on Gaudi machines (requires that we have a Gaudi runner, a project in-progress).
documentation of the bucketing strategy used here and how it interacts with our current multipack impl.

In general, this seems like a pretty minimally invasive integration, so I don't have much hesitation about merging it. I'd just like to make sure we've got good performance / correctness benchmarks ahead of time to hedge against regressions.

splotnikv · 2025-05-23T17:06:19Z

@RobotSail, I've refactored bucketing implementation, moved it almost completely out of multipack sampler. One conditional call still remains in generate_batches() function, please review.

@JamesKunstle , I've done minimal testing on HPU on our site, everything seems to work as expected, more testing is in progress, I'll put results here when they'll be finished. I am also working on documentation, will add it in next commit.

RobotSail · 2025-05-23T17:31:02Z

src/instructlab/training/multipack_sampler.py

+
+        if is_torch_hpu_available():
+            bucket_v = np.vectorize(bucket)
+            lengths = bucket_v(lengths)


@splotnikv This implementation makes more sense, but It seems like we are still mixing responsibilities with what is expected from the Multipack sampler here.

How can we provide the bucketed lengths prior to them even entering the sampler in the first place?

Hi @RobotSail, unfortunately this is the best that we can do without refactoring a multipack. The problem is in how multipack handles sample lengths. It takes it from two different places (1) from dataset, see get_effective_samples_per_minibatch() and (2) from arguments of MultipackDistributedBatchSampler ctor. If I remove bucketing from generate_batches, I'll have to move it to get_effective_samples_per_minibatch and plus add one more call in setup_dataloader. I.e. to call bucketing in all places where we create MultipackDistributedBatchSampler. Let me know if you prefer this solution. I'll update PR.

@splotnikv I believe you will need to move this bucketing logic out of multipack entirely and into data processing anyway. Since our data processing script will drop samples below a certain length, we assume the data is below this threshold by the time we run the training loop (everything past the main function call in main_ds.py).

Since bucket clips the lengths up to a certain length so that it can be properly padded on HPUs, there's a chance that your samples will actually be exceeding this threshold.

@RobotSail , I removed bucketing completely from the MultipackDistributedBatchSampler but it is still present in get_effective_samples_per_minibatch(). Please have a look.

About sample length handling. Sorry, but I don't understand your concern. Bucketing does not change sample length, it remains exactly the same. All batch creation functionality remains the same and should be based on actual sample length. Bucketing should be done at the very end, when batch has been created and is ready to be sent to a model. E.g. samples longer than max_seq_len should be dropped regardless of the bucketing. What bucketing changes is how samples are stored, think about it as std::vector::capacity vs size. BTW, if "capacity" exceeds packing_max_batch_len it is fine.

mergify bot added the ci-failure label May 14, 2025

RobotSail requested changes May 14, 2025

View reviewed changes

mergify bot added the needs-rebase label May 20, 2025

Enable fine tuning on HPU

a53d6e1

splotnikv force-pushed the hpu_ft_pub branch from 57a6439 to a53d6e1 Compare May 23, 2025 17:01

mergify bot removed the needs-rebase label May 23, 2025

RobotSail reviewed May 23, 2025

View reviewed changes

Refactor bucketing, disable torch.compile by default

2d9a17a

enable fine tuning on HPU #552

Are you sure you want to change the base?

enable fine tuning on HPU #552

Uh oh!

Conversation

splotnikv commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

splotnikv commented May 15, 2025

Uh oh!

mergify bot commented May 20, 2025

Uh oh!

JamesKunstle commented May 22, 2025

Uh oh!

splotnikv commented May 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

splotnikv commented May 14, 2025 •

edited

Loading