-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is possible to restrict number of threads? #2988
Comments
Currently not. However, we probably should add this feature. CC @nalimilan |
We had discussed this when we added multithreading support. There are two options:
We could also have both. Something which might be useful is to allow not only to choose between single-threading and multi-threading, but also to set the number of threads, as sometimes using all threads isn't faster. This is sometimes easy to implement, but sometimes more difficult. It doesn't have to be supported immediately but it's important to keep in mind when choosing the name of the keyword argument (so that it can take either a Boolean or an integer). |
@nalimilan - I understand that the plan would be to:
|
Maybe we should use the same strategy as Arrow.jl:
https://github.com/apache/arrow-julia/blob/52bfe1f72fa72b0b2d150efeeeb9edacd853f96f/src/write.jl#L49 The default is |
OK - but we should add a comment that currently only |
Following #3019 the key decision to make is if we:
|
For my use cases I primarily need it in the latter case ( |
After several discussions I am now leaning towards a global switch rather than argument passed to functions. In particular it will be then easier to handle with What @jpsamaroo suggested is that we could consider using https://github.com/tkf/ContextVariablesX.jl instead of a global switch. @tkf the question is if you consider ContextVariablesX.jl stable/mature enough to be used (and what would be a recommended pattern here). What we need is a switch that will enable/disable using multi threading in DataFrames.jl operations. CC @krynju |
ContextVariablesX.jl is a hack and "X" here means "eXperimental" (not, e.g., "eXtended"). It uses logger as a "payload." It means that julia> @contextvar CTX = 0
Main.CTX :: ContextVar [3eaa8739-b516-401d-aa2a-f5dd85d0f89c] => 0
julia> with_context(CTX => 1) do
@show CTX[]
with_logger(NullLogger()) do
@show CTX[]
end
@show CTX[]
end;
CTX[] = 1
CTX[] = 0
CTX[] = 1 That said, I don't think people call
FWIW, I don't think this pattern is great since it does not generalize over the amount of data to be processed. I think a much more composable pattern is to specify the size of per-task problem size (what is called |
@bkamins What is needed exactly with A global variable doesn't seem like the best API if it's intended to be used when passing a function that isn't thread safe (you'd have to set it before and unset it after an operation). Though we could easily have both a global variable, and use its value as the default for the argument that relevant functions would take.
@tkf Actually we already have this kind of mechanism for some operations internally. But the logic to choose the number of tasks is relatively complex so I'm not sure we can easily expose this to users. For example, for grouped operations, we spawn one task per operation (this is cheap so it's probably always worth it), and then one task per thread (as having one task per group would have a large overhead). We could allow specifying the number of groups per task, but the total number of rows can also be a relevant criterion depending on operations. For now, the idea is mainly to have a way to disable multithreading as that's the most common need. |
Here @krynju should confirm this, but to my understanding the requirement is that the call of some operation should not have to specify how many cores should be used. Instead there should be some separate mechanism that would allow the scheduler of the calls decide if multithreading should be allowed or not (and it is OK to have just two options - do not do threading or use maximum number of threads available). I hope it is clear, but let me try to restate it (so that @krynju can confirm): Having said that I think that the proposal of @nalimilan is OK. Essentially it says (as a temporary solution until context lands in Julia which can take some time):
I think it is an OK approach:
@nalimilan - have I understood what you propose correctly? |
I think that's a nice idea to have a high-level hinting mechanism. If that's the intention, I think it'd be great if it were clear at the API level. For this, it'd be nice if the API talks about the idea and not the implementation details (e.g., number of actual |
OK. So DTable would need the global state to affect all multithreading, including cases where we know it's thread-safe and has very low overhead so that creating as many tasks as possible is most likely a good idea (e.g. when copying columns)? Having that state disable multithreading altogether it easy to implement, but allowing it to determine the number of tasks would require adjusting the code, making it more complex in several places. This may not be a great idea as @tkf notes. Ideally we would start as many tasks as we can and the scheduler would choose how many threads to use to run them. |
The point is that Dagger.jl scheduler would want to opt-out of mulit-threading not because it is unsafe, but because it knows it does other operations in parallel that DTable does not even have to be aware of. |
Yeah, that's easy to support. But does it also need to specify that DataFrames should use e.g. only 2 threads? |
No - as far as @jpsamaroo explained he thinks that for Dagger.jl it is enough to have a binary switch - either do not use threading or use as many threads as are available. |
See #3030 for an implementation of the global flag to disable multithreading. As a second step we will be able to add keyword arguments to some functions whose value would default to that of the global flag if we want. |
Thank you for working on this! |
My hope is that the DTable can be agnostic to how underlying libraries like DataFrames choose to parallelize internal operations, but that Dagger itself can either:
Really, we need some shared API that DataFrames and Dagger can use to indicate when one parallelizable operation is nested within another, enough granularity to communicate how much parallelism is exposed (e.g. single-thread vs. all-threads), and a way to select the desired level of parallelism in a way that doesn't impose burden on user-facing APIs. My hope is that we could communicate this information via context variables, since that prevents burdening user-facing APIs, and the parallelism level can be "magically" propagated through task spawns by Julia and across workers by Dagger. |
Agreed, but this can be achieved only in long term as far as I understand, and we need some temporary solution till this is possible. Currently In #3030 @nalimilan proposed a global variable - as discussed above. I think it is a reasonable temporary choice. Does it make sense to you? |
👍 to this! I'm just pointing out what, in my mind, is a more ideal future API, but a single global flag would definitely suffice (for most users) for now! |
Is possible to restrict number of threads for groupby->transform/combine?
I have some function i want to use is not thread safe however I do not want to run entire julia session on single thread.
The text was updated successfully, but these errors were encountered: