-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize SubtaskGraph generation #3342
base: master
Are you sure you want to change the base?
Optimize SubtaskGraph generation #3342
Conversation
c18d405
to
81244dc
Compare
491de95
to
8fb4545
Compare
# Note: `dtypes`, `index_value`, and `columns_value` are lazily | ||
# initialized, so we should call property `params` to initialize | ||
# these fields. | ||
[o.params for o in out_chunks] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's wired, what would happen without these codes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will no columns_value
, index_value
which are used in MainPool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These field are lazily initialized, but field a
or b
are lazily initialized by params
. Can you make a
initialized by a
, b
initialized by b
? Then we can lazily initialize them in Worker Main Pool.
What do these changes do?
In
gen_subtask_graph
, Mars always create new out chunks even if the out chunk already exists. It costs a lot of time if there are plenty of chunks.Related issue number
Fixes #3341
I did a comparison, in which one creates new out chunks and the other does not. The test scripts are:
Cost time of
Subtask
generation are: 122.92s, 56.63s.Check code requirements