Skip to content

update: Rewrite update script #372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

fstachura
Copy link
Collaborator

@fstachura fstachura commented Dec 29, 2024

Closes #289
Closes #292

@fstachura fstachura marked this pull request as draft December 29, 2024 22:43
@fstachura
Copy link
Collaborator Author

fstachura commented Dec 29, 2024

I noticed that deduplicating definitions from references doesn't work properly.

@fstachura fstachura marked this pull request as ready for review December 30, 2024 00:20
@Daniil159x
Copy link

Hi, I think these changes are good and work better than update.py in the master.

But I have CPU idle when processing futures in the main thread.
изображение

Maybe async would be better in utilizing CPU?

@tleb
Copy link
Member

tleb commented Feb 1, 2025

Cannot reproduce good performance. I compared the original update.py versus my own PoC (called update-ng.py below) versus yours (called update-franek.py below). Everything is in a single branch to simplify testing (sorry for the crappy commit messages).

Command Mean [s] Min [s] Max [s] Relative
update.py 40.472 ± 0.196 40.250 40.617 4.72 ± 0.04
update-ng.py 8.578 ± 0.055 8.531 8.639 1.00
update-franek.py 80.363 ± 0.164 80.204 80.531 9.37 ± 0.06

Here is what it looks like:

⟩ hyperfine --min-runs 3 --export-markdown benchmark-table.md \
--parameter-list update update.py,update-ng.py,update-franek.py \
--prepare 'rm -rf data/musl/data/*' \
'TLEB_UPDATE={update} TLEB_NO_FETCH=1 ./utils/index ./data musl'
Benchmark 1: TLEB_UPDATE=update.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     40.472 s ±  0.196 s    [User: 71.356 s, System: 39.680 s]
  Range (min … max):   40.250 s … 40.617 s    3 runs

Benchmark 2: TLEB_UPDATE=update-ng.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):      8.578 s ±  0.055 s    [User: 72.419 s, System: 38.537 s]
  Range (min … max):    8.531 s …  8.639 s    3 runs

Benchmark 3: TLEB_UPDATE=update-franek.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     80.363 s ±  0.164 s    [User: 78.747 s, System: 49.339 s]
  Range (min … max):   80.204 s … 80.531 s    3 runs

Summary
  TLEB_UPDATE=update-ng.py TLEB_NO_FETCH=1 ./utils/index ./data musl ran
    4.72 ± 0.04 times faster than TLEB_UPDATE=update.py TLEB_NO_FETCH=1 ./utils/index ./data musl
    9.37 ± 0.06 times faster than TLEB_UPDATE=update-franek.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  • script.sh is limiting to first 10 tags because laptop on battery.
  • TLEB_NO_FETCH=1 makes sure that utils/index does not try doing Git fetches (avoid network ops in benchmark).
  • Not done in a Docker container to have hyperfine report valid usr and sys timings.
  • I have a weird reproducible issue with my update-ng.py that has those timings: wallclock 7.040s, usr 18.589s, sys 178.334s. On the same system, update.py does wallclock 22.012s, usr 23.667s, sys 46.578s. Notice the massive sys, which I cannot understand.

@fstachura
Copy link
Collaborator Author

fstachura commented Feb 6, 2025

@tleb On how many tags did you test? I decreased chunksize in my script to 100 (the "1000" argument in update_version call) and it seems to be running much faster now, without much performance difference between our scripts. But I also ran it only on a single tag. Chunksize calculation in yours makes much more sense though.

@Daniil159x I'm not sure if async would helps much here. It maybe would if threads were blocked on I/O most of the time, but both mine and @tleb scripts do database I/O on a single thread. There is also I/O related to reading from other processes (a lot of processing happens in script.sh/ctags/git), but again, it's not clear to me if that's the bottleneck. async also has some overhead AFAIK. On the other hand, I do see that neither of the scripts achieve 100% CPU utilization, it always hovers ~99%, so maybe?

@fstachura
Copy link
Collaborator Author

With chunksize calculation from @tleb's script (already on my faster-update branch), performance of the scripts is very similar, at least on my machine:

Benchmark 1: TLEB_UPDATE=update.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     82.743 s ±  0.109 s    [User: 125.653 s, System: 88.404 s]
  Range (min … max):   82.623 s … 82.835 s    3 runs
 
Benchmark 2: TLEB_UPDATE=update-ng.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     41.873 s ±  0.280 s    [User: 154.509 s, System: 132.848 s]
  Range (min … max):   41.685 s … 42.195 s    3 runs

Benchmark 2: TLEB_UPDATE=update-franek.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     45.776 s ±  0.854 s    [User: 154.594 s, System: 138.161 s]
  Range (min … max):   45.131 s … 46.745 s    3 runs
 
Summary
  TLEB_UPDATE=update-ng.py TLEB_NO_FETCH=1 ./utils/index ./data musl ran
    1.09 ± 0.02 times faster than TLEB_UPDATE=update-franek.py TLEB_NO_FETCH=1 ./utils/index ./data musl
    1.98 ± 0.01 times faster than TLEB_UPDATE=update.py TLEB_NO_FETCH=1 ./utils/index ./data musl

elixir/update.py Outdated
if hash in self.hash_to_idx:
return self.hash_to_idx[hash]
else:
return self.db.blob.get(hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have to look locally or in the DB? Why not only do one or the other?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some data, like hash -> idx, idx -> hash/filename mappings, whatever was in vers.db, is not saved into the database until the update process finishes. I was hoping to make interrupting the update process a bit safer thanks to that (although I'm actually not 100% anymore if default berkeley db can handle interruptions without breaking).

The idea is pretty basic, refs/defs added in an interrupted update won't have entries in hash/filename/blob databases.
numBlobs is updated first, reserving id space for currently processed blobs forever. An interrupted update might leave entries with unknown blob ids, but AFAIK this is (definitely could be if it's not) handled gracefully by the backend. The unknown entries could be garbage collected later.

Copy link
Member

@tleb tleb Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do processing in a version per version basis? That requires storing all key-value pairs to be updated somewhere: either in memory or in an append-only file. Then once we are done indexing the file, we do an "update database" step that does all the writes.

That would avoid any database issue that would be caused by indexing raising an error. Also, it removes all DB ops from indexing functions that are purely focused on calling ctags or whatever.

So pseudocode would be like:

for version_name in versions_todo:
	all_blobs = ...
	new_blobs = ...
	new_defs = compute_new_defs_multithreaded(version_name, new_blobs)
	list_of_defs_in_version = find_all_defs_in_all_blobs(all_blobs)
	new_refs = find_new_refs_multithreaded(version_name, list_of_defs_in_version)
	# same thing for all types of values we have

	# OK, we are done, we can update the database with new_defs, new_refs, etc.
	save_defs(new_defs)
	save_refs(new_refs)
	# ...

elixir/update.py Outdated
for idx, path in buf:
obj.append(idx, path)

state.db.vers.put(state.tag, obj, sync=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is part of the "add to databases" done in UpdatePartialState and another part is done here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separate parts that can have garbage entries left by an unfinished update, from parts that cannot have garbage entries. vers is used to tell if a tag was already indexed. That's also (partially) why using a database while it's being updated was such a problem.

Copy link
Member

@tleb tleb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, my complaint is that code is not linear. It makes it really hard to understand, and I've been reading and writing Elixir indexing code more or less recently. I can't imagine myself or others in two years time.

As said in some comments, I'd expect code to be more like

x = get_required_values()

y = compute_y(x)
z = compute_z(x, y)

cleanup_foobar()

If things need to be done in parallel, it should be a single call to pool.*map*() function(s). That makes execution flow (what happens when) and ressource management (what is used/freed when) easy to understand. Futures & co are nice features but they make code obtuse.

Something else not touched in the comments (but discussed in real-life :-) ) is that the logs are not useful. I haven't run this version (please rebase on master that has the utils/index script to make testing easier), but code says it prints not-so-useful things.

A good logs starting point would be a summary at the end of each tag indexed. Something like (ideally aligned):

tag v2.6: N blobs, M defs, O refs, P compatibles, S seconds

One last thing about logging: make sure to not lose errors! There are try except Exception blocks that ignore the exception and print a generic error message. That is not useful for debugging purposes. It must do what it can (maybe stop all processing of the current version and try indexing other versions).

@fstachura
Copy link
Collaborator Author

fstachura commented Feb 12, 2025

Thanks for the review!

please rebase on master that has the utils/index script to make testing easier

Rebased.

One last thing about logging: make sure to not lose errors! There are try except Exception blocks that ignore the exception and print a generic error message. That is not useful for debugging purposes. It must do what it can (maybe stop all processing of the current version and try indexing other versions).

logging.exception also prints the exception. I also don't like that the exception is not explicitly passed as an argument.

I explained why the code is not linear in of the review comments.

I thought I would maybe state design goals. We should've discussed that earlier. I think we agree about most of this, but some is up for discussion.

From the current script:

  • Avoid threads (GIL)
  • Make scheduling more effective (tag locks, long tail that is only done on a single thread)
  • Separate scheduling from computation (this also refers to your ideas to compute an index update, and apply it later, maybe with another program)
  • Make logs more meaningful
  • Do not write to the database from multiple threads
  • Make update script crashes/interruptions safer

If not for clunkiness of berkeleydb, last two points would be unnecessary. But I'm assuming we are staying it berkeleydb for now.

@tleb
Copy link
Member

tleb commented Feb 14, 2025

I thought I would maybe state design goals. We should've discussed that earlier. I think we agree about most of this, but some is up for discussion.

From the current script:

  • Avoid threads (GIL)
  • Make scheduling more effective (tag locks, long tail that is only done on a single thread)

About scheduling, we know the processing tree we want.

  • One main thread is responsible for spawning everything.
  • It has child processes (or threads but I don't see the point) that do all the CPU work.
  • Those should output to some IPC (like stdout) that gets retrieved by the main process to update the database.
  • Separate scheduling from computation (this also refers to your ideas to compute an index update, and apply it later, maybe with another program)

Yes! That is exactly what the pool.*map*() abstractions give us. One pure function takes a blob and returns the computation value. It cannot be bothered by scheduling because it doesn't know about it. It should have access to no (or almost no) shared resources.

That does one assumption: do it version after version, task after task. I don't see how that constraint could slow down the computation.

  • Make logs more meaningful

Yes! A summary per tag is enough and will be much more useful than currently.

  • Do not write to the database from multiple threads

Yes that is a main limitation. That is the main justification for "the one-main-thread does all the database operations".

  • Make update script crashes/interruptions safer

Idea brought up in above code review comments: we could do all writes related to a version at the end of the processing of that version. Either we store all the key-value pairs to update in memory or in an append-only file. That way we write to the database only if indexing was successful.

If not for clunkiness of berkeleydb, last two points would be unnecessary. But I'm assuming we are staying it berkeleydb for now.

The above strategy is nice because it means we can do anything we want and don't depend on features from our database (like concurrent write support).

@tleb
Copy link
Member

tleb commented Feb 25, 2025

We could fix #292 with this PR. For that, we must:

  • Do versions one after the other.
  • Store all defs inside the current version in memory.
  • Then pick from that memory set when we check if a ref should be added.
  • Have a procedure to extract all defs for a version N when we do iterative indexing.

That is what was done in my update.py PoC. It works but uses a lot of memory. But fixes indexing. I think it is worth it!

Avoid calling this parse-docs script that is expensive. This heuristic
avoids running it on most files, and is almost free.

Signed-off-by: Théo Lebrun <[email protected]>
@fstachura
Copy link
Collaborator Author

Ok, so I wrote some more versions of the update script and benchmarked them a couple of times. Tests were done on a 4 core machine with 8 gigabytes of RAM, indexing Linux v6.12.6.

https://github.com/fstachura/elixir/tree/faster-update-variants

Command Mean [s] Min [s] Max [s] Relative
ELIXIR_CACHE=1 python3 -m elixir.update 2545.016 ± 29.337 2515.303 2573.962 1.00
python3 -m elixir.gen_update 2780.758 ± 37.764 2749.192 2822.595 1.09 ± 0.02
python3 -m elixir.update 2849.570 ± 17.191 2829.777 2860.770 1.12 ± 0.01
python3 -m elixir.update_simplified 2852.024 ± 63.585 2793.141 2919.447 1.12 ± 0.03
python3 -m elixir.file_update 3335.096 ± 13.836 3326.338 3351.047 1.31 ± 0.02
python3 update.py 128 6676.816 ± 29.255 6659.263 6710.587 1.00

It seems that:

  • Explicitly adding cache to the bsddb speeds up the update process
  • Using a generator in scriptLines, instead of reading the output of the script into a array of lines, seems to have some performance benefits
  • My version of the file-based update script (add definitions/references into an append-only file, later sort, merge keys and then add to database to reduce database writes) does not seem to perform well. Using tmpfs usually caused the system to go OOM and perform worse when writing to network storage.

Based on that I would just go with the "simplified" version, with berkeleydb cache.

https://github.com/fstachura/elixir/blob/faster-update-variants/elixir/update_simplified.py

fstachura added 3 commits May 30, 2025 14:23
By default ctags sorts entries. This is not useful to the update script,
but takes time.
user time for `update.py 16` on musl v1.2.5 went from 1m21.613s to
1m11.849s.
New update script divides work into tasks scheduled between
a constant number of processes, instead of statically assigning
a single long running task to each thread.
This results in better CPU saturation.

Database handles are not shared between threads anymore, instead
the main thread is used to commit results of other processes into the
database.
This trades locking on database access for serialization costs - since
multiprocessing is used, values returned from futures are pickled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent number of references if update job is ran for multiple tags at once Improve indexing performance (update.py)
3 participants