-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-deterministic map operator option in data loading #1088
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Looks good to me. I left a comment about thread pool for future reference, but it is not a blocker.
|
||
namespace fairseq2n::detail { | ||
|
||
class thread_pool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is okay for the time-being, but note that this will clash with TBB's thread pool similar to how it clashes with OpenMP and MKL. Ideally we should leverage TBB here instead of writing our own pool implementation.
std::atomic<bool> finished_{false}; | ||
std::atomic<size_t> tasks_in_flight_{0}; | ||
std::exception_ptr exception_ptr_{}; | ||
// TODO: Use this conditionally on FAIRSEQ2N_USE_TBB instead: tbb::task_arena pool_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
What does this PR do? Please describe:
Add the "deterministic" boolean parameter to the map operator in data_pipeline. It follows similar semantics as the tensorflow equivalent: https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map , i.e. it trades the matching of output and input order with some execution speed.
Does your PR introduce any breaking changes? If yes, please list them:
No
Check list: