update: Rewrite update script #372

fstachura · 2024-12-29T20:25:05Z

Closes #289
Closes #292

fstachura · 2024-12-29T22:44:36Z

~~I noticed that deduplicating definitions from references doesn't work properly.~~

Daniil159x · 2025-01-31T20:37:42Z

Hi, I think these changes are good and work better than update.py in the master.

But I have CPU idle when processing futures in the main thread.

Maybe async would be better in utilizing CPU?

tleb · 2025-02-01T17:10:20Z

Cannot reproduce good performance. I compared the original update.py versus my own PoC (called update-ng.py below) versus yours (called update-franek.py below). Everything is in a single branch to simplify testing (sorry for the crappy commit messages).

Command	Mean [s]	Min [s]	Max [s]	Relative
`update.py`	40.472 ± 0.196	40.250	40.617	4.72 ± 0.04
`update-ng.py`	8.578 ± 0.055	8.531	8.639	1.00
`update-franek.py`	80.363 ± 0.164	80.204	80.531	9.37 ± 0.06

Here is what it looks like:

⟩ hyperfine --min-runs 3 --export-markdown benchmark-table.md \
--parameter-list update update.py,update-ng.py,update-franek.py \
--prepare 'rm -rf data/musl/data/*' \
'TLEB_UPDATE={update} TLEB_NO_FETCH=1 ./utils/index ./data musl'
Benchmark 1: TLEB_UPDATE=update.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     40.472 s ±  0.196 s    [User: 71.356 s, System: 39.680 s]
  Range (min … max):   40.250 s … 40.617 s    3 runs

Benchmark 2: TLEB_UPDATE=update-ng.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):      8.578 s ±  0.055 s    [User: 72.419 s, System: 38.537 s]
  Range (min … max):    8.531 s …  8.639 s    3 runs

Benchmark 3: TLEB_UPDATE=update-franek.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     80.363 s ±  0.164 s    [User: 78.747 s, System: 49.339 s]
  Range (min … max):   80.204 s … 80.531 s    3 runs

Summary
  TLEB_UPDATE=update-ng.py TLEB_NO_FETCH=1 ./utils/index ./data musl ran
    4.72 ± 0.04 times faster than TLEB_UPDATE=update.py TLEB_NO_FETCH=1 ./utils/index ./data musl
    9.37 ± 0.06 times faster than TLEB_UPDATE=update-franek.py TLEB_NO_FETCH=1 ./utils/index ./data musl

script.sh is limiting to first 10 tags because laptop on battery.
TLEB_NO_FETCH=1 makes sure that utils/index does not try doing Git fetches (avoid network ops in benchmark).
Not done in a Docker container to have hyperfine report valid usr and sys timings.
I have a weird reproducible issue with my update-ng.py that has those timings: wallclock 7.040s, usr 18.589s, sys 178.334s. On the same system, update.py does wallclock 22.012s, usr 23.667s, sys 46.578s. Notice the massive sys, which I cannot understand.

fstachura · 2025-02-06T11:13:47Z

@tleb On how many tags did you test? I decreased chunksize in my script to 100 (the "1000" argument in update_version call) and it seems to be running much faster now, without much performance difference between our scripts. But I also ran it only on a single tag. Chunksize calculation in yours makes much more sense though.

@Daniil159x I'm not sure if async would helps much here. It maybe would if threads were blocked on I/O most of the time, but both mine and @tleb scripts do database I/O on a single thread. There is also I/O related to reading from other processes (a lot of processing happens in script.sh/ctags/git), but again, it's not clear to me if that's the bottleneck. async also has some overhead AFAIK. On the other hand, I do see that neither of the scripts achieve 100% CPU utilization, it always hovers ~99%, so maybe?

fstachura · 2025-02-07T12:50:49Z

With chunksize calculation from @tleb's script (already on my faster-update branch), performance of the scripts is very similar, at least on my machine:

Benchmark 1: TLEB_UPDATE=update.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     82.743 s ±  0.109 s    [User: 125.653 s, System: 88.404 s]
  Range (min … max):   82.623 s … 82.835 s    3 runs
 
Benchmark 2: TLEB_UPDATE=update-ng.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     41.873 s ±  0.280 s    [User: 154.509 s, System: 132.848 s]
  Range (min … max):   41.685 s … 42.195 s    3 runs

Benchmark 2: TLEB_UPDATE=update-franek.py TLEB_NO_FETCH=1 ./utils/index ./data musl
  Time (mean ± σ):     45.776 s ±  0.854 s    [User: 154.594 s, System: 138.161 s]
  Range (min … max):   45.131 s … 46.745 s    3 runs
 
Summary
  TLEB_UPDATE=update-ng.py TLEB_NO_FETCH=1 ./utils/index ./data musl ran
    1.09 ± 0.02 times faster than TLEB_UPDATE=update-franek.py TLEB_NO_FETCH=1 ./utils/index ./data musl
    1.98 ± 0.01 times faster than TLEB_UPDATE=update.py TLEB_NO_FETCH=1 ./utils/index ./data musl

script.sh

elixir/update.py

tleb · 2025-02-07T16:26:12Z

elixir/update.py

+        if hash in self.hash_to_idx:
+            return self.hash_to_idx[hash]
+        else:
+            return self.db.blob.get(hash)


Why do you have to look locally or in the DB? Why not only do one or the other?

Some data, like hash -> idx, idx -> hash/filename mappings, whatever was in vers.db, is not saved into the database until the update process finishes. I was hoping to make interrupting the update process a bit safer thanks to that (although I'm actually not 100% anymore if default berkeley db can handle interruptions without breaking).

The idea is pretty basic, refs/defs added in an interrupted update won't have entries in hash/filename/blob databases.
numBlobs is updated first, reserving id space for currently processed blobs forever. An interrupted update might leave entries with unknown blob ids, but AFAIK this is (definitely could be if it's not) handled gracefully by the backend. The unknown entries could be garbage collected later.

Can we do processing in a version per version basis? That requires storing all key-value pairs to be updated somewhere: either in memory or in an append-only file. Then once we are done indexing the file, we do an "update database" step that does all the writes.

That would avoid any database issue that would be caused by indexing raising an error. Also, it removes all DB ops from indexing functions that are purely focused on calling ctags or whatever.

So pseudocode would be like:

for version_name in versions_todo: all_blobs = ... new_blobs = ... new_defs = compute_new_defs_multithreaded(version_name, new_blobs) list_of_defs_in_version = find_all_defs_in_all_blobs(all_blobs) new_refs = find_new_refs_multithreaded(version_name, list_of_defs_in_version) # same thing for all types of values we have # OK, we are done, we can update the database with new_defs, new_refs, etc. save_defs(new_defs) save_refs(new_refs) # ...

elixir/update.py

tleb · 2025-02-07T16:33:43Z

elixir/update.py

+    for idx, path in buf:
+        obj.append(idx, path)
+
+    state.db.vers.put(state.tag, obj, sync=True)


Why is part of the "add to databases" done in UpdatePartialState and another part is done here?

I separate parts that can have garbage entries left by an unfinished update, from parts that cannot have garbage entries. vers is used to tell if a tag was already indexed. That's also (partially) why using a database while it's being updated was such a problem.

elixir/update.py

tleb

Overall, my complaint is that code is not linear. It makes it really hard to understand, and I've been reading and writing Elixir indexing code more or less recently. I can't imagine myself or others in two years time.

As said in some comments, I'd expect code to be more like

x = get_required_values()

y = compute_y(x)
z = compute_z(x, y)

cleanup_foobar()

If things need to be done in parallel, it should be a single call to pool.*map*() function(s). That makes execution flow (what happens when) and ressource management (what is used/freed when) easy to understand. Futures & co are nice features but they make code obtuse.

Something else not touched in the comments (but discussed in real-life :-) ) is that the logs are not useful. I haven't run this version (please rebase on master that has the utils/index script to make testing easier), but code says it prints not-so-useful things.

A good logs starting point would be a summary at the end of each tag indexed. Something like (ideally aligned):

tag v2.6: N blobs, M defs, O refs, P compatibles, S seconds

One last thing about logging: make sure to not lose errors! There are try except Exception blocks that ignore the exception and print a generic error message. That is not useful for debugging purposes. It must do what it can (maybe stop all processing of the current version and try indexing other versions).

fstachura · 2025-02-12T07:20:24Z

Thanks for the review!

please rebase on master that has the utils/index script to make testing easier

Rebased.

One last thing about logging: make sure to not lose errors! There are try except Exception blocks that ignore the exception and print a generic error message. That is not useful for debugging purposes. It must do what it can (maybe stop all processing of the current version and try indexing other versions).

logging.exception also prints the exception. I also don't like that the exception is not explicitly passed as an argument.

I explained why the code is not linear in of the review comments.

I thought I would maybe state design goals. We should've discussed that earlier. I think we agree about most of this, but some is up for discussion.

From the current script:

Avoid threads (GIL)
Make scheduling more effective (tag locks, long tail that is only done on a single thread)
Separate scheduling from computation (this also refers to your ideas to compute an index update, and apply it later, maybe with another program)
Make logs more meaningful
Do not write to the database from multiple threads
Make update script crashes/interruptions safer

If not for clunkiness of berkeleydb, last two points would be unnecessary. But I'm assuming we are staying it berkeleydb for now.

tleb · 2025-02-14T15:34:15Z

I thought I would maybe state design goals. We should've discussed that earlier. I think we agree about most of this, but some is up for discussion.

From the current script:

Avoid threads (GIL)

Make scheduling more effective (tag locks, long tail that is only done on a single thread)

About scheduling, we know the processing tree we want.

One main thread is responsible for spawning everything.
It has child processes (or threads but I don't see the point) that do all the CPU work.
Those should output to some IPC (like stdout) that gets retrieved by the main process to update the database.

Separate scheduling from computation (this also refers to your ideas to compute an index update, and apply it later, maybe with another program)

Yes! That is exactly what the pool.*map*() abstractions give us. One pure function takes a blob and returns the computation value. It cannot be bothered by scheduling because it doesn't know about it. It should have access to no (or almost no) shared resources.

That does one assumption: do it version after version, task after task. I don't see how that constraint could slow down the computation.

Make logs more meaningful

Yes! A summary per tag is enough and will be much more useful than currently.

Do not write to the database from multiple threads

Yes that is a main limitation. That is the main justification for "the one-main-thread does all the database operations".

Make update script crashes/interruptions safer

Idea brought up in above code review comments: we could do all writes related to a version at the end of the processing of that version. Either we store all the key-value pairs to update in memory or in an append-only file. That way we write to the database only if indexing was successful.

If not for clunkiness of berkeleydb, last two points would be unnecessary. But I'm assuming we are staying it berkeleydb for now.

The above strategy is nice because it means we can do anything we want and don't depend on features from our database (like concurrent write support).

tleb · 2025-02-25T10:44:05Z

We could fix #292 with this PR. For that, we must:

Do versions one after the other.
Store all defs inside the current version in memory.
Then pick from that memory set when we check if a ref should be added.
Have a procedure to extract all defs for a version N when we do iterative indexing.

That is what was done in my update.py PoC. It works but uses a lot of memory. But fixes indexing. I think it is worth it!

Avoid calling this parse-docs script that is expensive. This heuristic avoids running it on most files, and is almost free. Signed-off-by: Théo Lebrun <[email protected]>

fstachura · 2025-05-29T11:21:29Z

Ok, so I wrote some more versions of the update script and benchmarked them a couple of times. Tests were done on a 4 core machine with 8 gigabytes of RAM, indexing Linux v6.12.6.

https://github.com/fstachura/elixir/tree/faster-update-variants

Command	Mean [s]	Min [s]	Max [s]	Relative
`ELIXIR_CACHE=1 python3 -m elixir.update`	2545.016 ± 29.337	2515.303	2573.962	1.00
`python3 -m elixir.gen_update`	2780.758 ± 37.764	2749.192	2822.595	1.09 ± 0.02
`python3 -m elixir.update`	2849.570 ± 17.191	2829.777	2860.770	1.12 ± 0.01
`python3 -m elixir.update_simplified`	2852.024 ± 63.585	2793.141	2919.447	1.12 ± 0.03
`python3 -m elixir.file_update`	3335.096 ± 13.836	3326.338	3351.047	1.31 ± 0.02
`python3 update.py 128`	6676.816 ± 29.255	6659.263	6710.587	1.00

It seems that:

Explicitly adding cache to the bsddb speeds up the update process
Using a generator in scriptLines, instead of reading the output of the script into a array of lines, seems to have some performance benefits
My version of the file-based update script (add definitions/references into an append-only file, later sort, merge keys and then add to database to reduce database writes) does not seem to perform well. Using tmpfs usually caused the system to go OOM and perform worse when writing to network storage.

Based on that I would just go with the "simplified" version, with berkeleydb cache.

https://github.com/fstachura/elixir/blob/faster-update-variants/elixir/update_simplified.py

By default ctags sorts entries. This is not useful to the update script, but takes time. user time for `update.py 16` on musl v1.2.5 went from 1m21.613s to 1m11.849s.

New update script divides work into tasks scheduled between a constant number of processes, instead of statically assigning a single long running task to each thread. This results in better CPU saturation. Database handles are not shared between threads anymore, instead the main thread is used to commit results of other processes into the database. This trades locking on database access for serialization costs - since multiprocessing is used, values returned from futures are pickled.

fstachura marked this pull request as draft December 29, 2024 22:43

fstachura force-pushed the faster-update branch from 63d1535 to 178226a Compare December 30, 2024 00:20

fstachura marked this pull request as ready for review December 30, 2024 00:20