Skip to content

Conversation

@siqi-he
Copy link
Contributor

@siqi-he siqi-he commented Oct 12, 2025

This PR extends the memory reallocation optimization introduced in #11112 to MultiTargetHistBuilder.
In addition, this pr also adds a resize() call to both MultiTargetHistBuilder and HistUpdater as safeguards in situations when number of pages decreases.

I did a quick benchmark using a few datasets from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html that aren't too small while I could fit in my ec2 instance (c7i.24xl), and the changes seems consistent:

speed up
BlogCatalog 1.01
Flickr 1.17
siam-competition2007 1.00
mediamill 1.10
YouTube 1.04

@siqi-he siqi-he marked this pull request as ready for review October 12, 2025 22:32
@siqi-he
Copy link
Contributor Author

siqi-he commented Oct 13, 2025

Failed tests seem to be unrelated to changes in this pr, pls see below:
Screenshot 2025-10-12 at 11 06 20 PM
Screenshot 2025-10-12 at 11 07 12 PM

}
page_idx++;
}
partitioner_.resize(page_idx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to resize this after the loop?

Copy link
Contributor Author

@siqi-he siqi-he Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's meant to be a safeguard if page size decreases between different calls. if the number of pages has decreased, LeafPartition which iterates partitioner_ with for (auto const &part : partitioner_) could touch on stale data. this wasn't a problem previously since we were clearing partitioner_ every time we call InitData.
it does seem to me the scenario above is extremely rare, but since adding a resize in most situations won't do anything, i figured it is probably ok to include it. they will be redundant if the number of pages never decreases.

Copy link
Member

@trivialfis trivialfis Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing. In that case, would you like to help add a test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. i've added tests for both HistUpdater and MultiTargetHistBuilder that stale partitioners would not write to p_out_position especially when we perform an update using a data with large number of pages first then followed by an update using a data with smaller number of pages. let me know if there is any issue

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the optimization.

@trivialfis trivialfis merged commit b3ed14d into dmlc:master Oct 21, 2025
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants