-
Notifications
You must be signed in to change notification settings - Fork 0
Benchmark amount of intermediate data stored #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| executor=spec.executor, | ||
| spec=spec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems redundant
| kwargs['callbacks'] = callbacks | ||
|
|
||
| with benchmarks(history): | ||
| with benchmarks(history, spec.work_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec.work_dir isn't the right thing to pass - it's not specific enough. I want the directory containing only the intermediate results written during the last cubed run.
I can see that each such directly is uniquely labeled which is nice, but I'm not sure exactly how to get this label from the computation. Should I import cubed.core.plan.CONTEXT_ID or is there a simpler way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The event object passed to on_compute_start (and on_compute_end) in the callback contains the context_id (see https://github.com/cubed-dev/cubed/blob/fda2f1e7bedec5389cae00ee751a54076ccf28e3/cubed/runtime/types.py#L28-L49). The HistoryCallback doesn't capture this - perhaps it could be changed, or maybe it's simpler to write a custom callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cubed.core.plan there is a CONTEXT_ID and a compute_id. The former seems to be used to make temporary directories, whilst the latter is what's stored in the event object. What's the difference between them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CONTEXT_ID is the directory name created for the duration of the client Python process. The compute_id is a recent addition used for creating directories to store history data in that is unique to the call to compute.
So you're right - you want CONTEXT_ID, but the intermediate data for all computations run in the pytest session will be there, so it's hard to pick out the arrays for a particular computation.
We should probably make Cubed store its intermediate data in a directory named {CONTEXT_ID}/{compute_id}, but that's a bit more work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation!
We should probably make Cubed store its intermediate data in a directory named
{CONTEXT_ID}/{compute_id}, but that's a bit more work.
I would like to do that - I can make a PR to cubed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Continuing discussion from cubed-dev/cubed#413)
I think the easier way to solve the original problem in #10 would be to just get the intermediate array paths from the DAG
So then the benchmark context managers need to know about the plan object right? Or can we add it to what's saved in
history.plan?
I tried looking at the plan object but if this information is stored in there I can't seem to find it. I thought looking at target would give me what I need but for the smallest quad means example I just get None for each node, when clearly something was written.
I'm also wondering if there is any subtlety here with the optimized vs unoptimized plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan only has the op names now. Optimized vs unoptimized shouldn't make a difference. Try something like (untested)
array_names = [name for name in dag if name.startswith("array-")]Where dag comes from the callback object.
| kwargs['callbacks'] = callbacks | ||
|
|
||
| with benchmarks(history): | ||
| with benchmarks(history, spec.work_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see a way to avoid passing extra information to the benchmarks context manager - the history does not contain this information.
| # List all files and subdirectories in the directory | ||
| contents = fs.glob(f'{work_dir}/**', detail=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently failing to discover any of the contents
| total_size = 0 | ||
|
|
||
| if work_dir.startswith('s3://'): | ||
| fs = fsspec.filesystem('s3') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simpler way to do this (that I only discovered yesterday!) is
fs, _, _ = fsspec.get_fs_token_paths(work_dir)
WIP Attempt to record the amount of intermediate data stored, just by looking at the work directory.