-
Notifications
You must be signed in to change notification settings - Fork 161
update: Rewrite update script #372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
63d1535
to
178226a
Compare
Cannot reproduce good performance. I compared the original update.py versus my own PoC (called
Here is what it looks like:
|
@tleb On how many tags did you test? I decreased chunksize in my script to 100 (the "1000" argument in @Daniil159x I'm not sure if async would helps much here. It maybe would if threads were blocked on I/O most of the time, but both mine and @tleb scripts do database I/O on a single thread. There is also I/O related to reading from other processes (a lot of processing happens in script.sh/ctags/git), but again, it's not clear to me if that's the bottleneck. async also has some overhead AFAIK. On the other hand, I do see that neither of the scripts achieve 100% CPU utilization, it always hovers ~99%, so maybe? |
With chunksize calculation from @tleb's script (already on my
|
elixir/update.py
Outdated
if hash in self.hash_to_idx: | ||
return self.hash_to_idx[hash] | ||
else: | ||
return self.db.blob.get(hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you have to look locally or in the DB? Why not only do one or the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some data, like hash -> idx, idx -> hash/filename mappings, whatever was in vers.db, is not saved into the database until the update process finishes. I was hoping to make interrupting the update process a bit safer thanks to that (although I'm actually not 100% anymore if default berkeley db can handle interruptions without breaking).
The idea is pretty basic, refs/defs added in an interrupted update won't have entries in hash/filename/blob databases.
numBlobs is updated first, reserving id space for currently processed blobs forever. An interrupted update might leave entries with unknown blob ids, but AFAIK this is (definitely could be if it's not) handled gracefully by the backend. The unknown entries could be garbage collected later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do processing in a version per version basis? That requires storing all key-value pairs to be updated somewhere: either in memory or in an append-only file. Then once we are done indexing the file, we do an "update database" step that does all the writes.
That would avoid any database issue that would be caused by indexing raising an error. Also, it removes all DB ops from indexing functions that are purely focused on calling ctags or whatever.
So pseudocode would be like:
for version_name in versions_todo:
all_blobs = ...
new_blobs = ...
new_defs = compute_new_defs_multithreaded(version_name, new_blobs)
list_of_defs_in_version = find_all_defs_in_all_blobs(all_blobs)
new_refs = find_new_refs_multithreaded(version_name, list_of_defs_in_version)
# same thing for all types of values we have
# OK, we are done, we can update the database with new_defs, new_refs, etc.
save_defs(new_defs)
save_refs(new_refs)
# ...
elixir/update.py
Outdated
for idx, path in buf: | ||
obj.append(idx, path) | ||
|
||
state.db.vers.put(state.tag, obj, sync=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is part of the "add to databases" done in UpdatePartialState and another part is done here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I separate parts that can have garbage entries left by an unfinished update, from parts that cannot have garbage entries. vers is used to tell if a tag was already indexed. That's also (partially) why using a database while it's being updated was such a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, my complaint is that code is not linear. It makes it really hard to understand, and I've been reading and writing Elixir indexing code more or less recently. I can't imagine myself or others in two years time.
As said in some comments, I'd expect code to be more like
x = get_required_values()
y = compute_y(x)
z = compute_z(x, y)
cleanup_foobar()
If things need to be done in parallel, it should be a single call to pool.*map*()
function(s). That makes execution flow (what happens when) and ressource management (what is used/freed when) easy to understand. Futures & co are nice features but they make code obtuse.
Something else not touched in the comments (but discussed in real-life :-) ) is that the logs are not useful. I haven't run this version (please rebase on master that has the utils/index
script to make testing easier), but code says it prints not-so-useful things.
A good logs starting point would be a summary at the end of each tag indexed. Something like (ideally aligned):
tag v2.6: N blobs, M defs, O refs, P compatibles, S seconds
One last thing about logging: make sure to not lose errors! There are try except Exception
blocks that ignore the exception and print a generic error message. That is not useful for debugging purposes. It must do what it can (maybe stop all processing of the current version and try indexing other versions).
Thanks for the review!
Rebased.
I explained why the code is not linear in of the review comments. I thought I would maybe state design goals. We should've discussed that earlier. I think we agree about most of this, but some is up for discussion. From the current script:
If not for clunkiness of berkeleydb, last two points would be unnecessary. But I'm assuming we are staying it berkeleydb for now. |
ec97c4c
to
e32aeef
Compare
About scheduling, we know the processing tree we want.
Yes! That is exactly what the That does one assumption: do it version after version, task after task. I don't see how that constraint could slow down the computation.
Yes! A summary per tag is enough and will be much more useful than currently.
Yes that is a main limitation. That is the main justification for "the one-main-thread does all the database operations".
Idea brought up in above code review comments: we could do all writes related to a version at the end of the processing of that version. Either we store all the key-value pairs to update in memory or in an append-only file. That way we write to the database only if indexing was successful.
The above strategy is nice because it means we can do anything we want and don't depend on features from our database (like concurrent write support). |
We could fix #292 with this PR. For that, we must:
That is what was done in my |
Avoid calling this parse-docs script that is expensive. This heuristic avoids running it on most files, and is almost free. Signed-off-by: Théo Lebrun <[email protected]>
Ok, so I wrote some more versions of the update script and benchmarked them a couple of times. Tests were done on a 4 core machine with 8 gigabytes of RAM, indexing Linux v6.12.6. https://github.com/fstachura/elixir/tree/faster-update-variants
It seems that:
Based on that I would just go with the "simplified" version, with berkeleydb cache. https://github.com/fstachura/elixir/blob/faster-update-variants/elixir/update_simplified.py |
By default ctags sorts entries. This is not useful to the update script, but takes time. user time for `update.py 16` on musl v1.2.5 went from 1m21.613s to 1m11.849s.
New update script divides work into tasks scheduled between a constant number of processes, instead of statically assigning a single long running task to each thread. This results in better CPU saturation. Database handles are not shared between threads anymore, instead the main thread is used to commit results of other processes into the database. This trades locking on database access for serialization costs - since multiprocessing is used, values returned from futures are pickled.
Closes #289
Closes #292