refactor(websoc-scraper): always scrape by chunks#320
Open
laggycomputer wants to merge 15 commits intomainfrom
Open
refactor(websoc-scraper): always scrape by chunks#320laggycomputer wants to merge 15 commits intomainfrom
laggycomputer wants to merge 15 commits intomainfrom
Conversation
This was
linked to
issues
Feb 27, 2026
idk why index example won't typecheck otherwise
# Conflicts: # apps/data-pipeline/websoc-scraper/tsconfig.json
# Conflicts: # packages/db/migrations/meta/_journal.json
sanskarm7
requested changes
Mar 20, 2026
Collaborator
sanskarm7
left a comment
There was a problem hiding this comment.
need to update writeup in some places to match actual behavior (eg section codes 00001 to 97999 and not 00000-99999)
otherwise testing is looking good for now
sanskarm7
requested changes
Mar 20, 2026
Collaborator
sanskarm7
left a comment
There was a problem hiding this comment.
pulling a dante on dante (im scared for my life)
sanskarm7
approved these changes
Mar 20, 2026
Collaborator
sanskarm7
left a comment
There was a problem hiding this comment.
lets get it (the money)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We simplify the Websoc scraper's data acquisition strategy.
Current scraper strategy
The Websoc scraper currently operates on a state machine, controlled by persistent state in the
websoc_metatable.Depending on the state stored there, one of several modes takes effect, over multiple scrapes of the same term.
They happen in this order:
last_scrapedcolumn isNULL, indicating a scrape of this term has never been done.The scraper makes a (pretty good) guess at the valid departments and enumerates all sections in each department, in lexigraphical order of the department names.
The transaction upserting each department ends with the
last_scrapedandlast_dept_scrapedcolumns being set appropriately.In that case, if collection of a department fails, the columns stored at intermediate steps in the previous state are sufficient information to pick up where we left off.
last_scrapedand setlast_dept_scrapedtoNULL, indicating that we are done discovering departments.In feat: websoc scraper enhancements #83, we introduced a new scraping strategy, giving a new mode of operation for every scrape after the previous states are completed:
S, sorted in increasing order.iis either0or1modulo 891.(a, b)from this sublist forms a closed interval[a, b]in which there are known to be at least 891 valid section codes.We scrape this range directly, without respect to departments, because we expect this code range to contain near the maximum of 900 section codes.
Scraping in this way will take less requests overall than scraping by department would.
If the interval
[a, b]cannot be scraped, it is possible that this was due to more than 9 sections (1% of the limit of 900) being added in that code range between our last scrape and this current attempt.In #114, another step was added to retry this chunk by recursively bisecting it, with one of two outcomes:
Bisecting stops if failure is encountered when scraping a section code range which contains less than 900 valid section codes, because, for that chunk, the encountered failure cannot possibly be a "too many sections" error.
These two PRs, implemented several weeks apart, contained subtle correctness bugs:
Sfails to consider that section codes may newly appear before the first known section code.The chunk
[0, min(S))is never considered.The chunk
(max(S), 97999]is also not considered if the sublist contains an even number of codes.This is issue websoc-scraper: failure to discover sections at code range extrema #317.
If the algorithm wants to scrape chunks
[a, b]and[c, d]for consecutivea, b, c, din the sublist, then chunk(b, c)will never be scraped.This is issue websoc-scraper: failure to discover sections between known chunks #318.
#311 will become obsolete because that code has been removed.
New strategy
The department-wise discovery strategy is removed.
The list of departments probably doesn't include departments which are now obsolete but were relevant in earlier Websoc terms, so we shouldn't have used it, anyway.
If we can't be confident that it's authoritative and exhaustive (which we can't), then we shouldn't assume that all meaningful sections belong to one of those departments.
We simplify the chunking algorithm described above to simply take every 891st element of
Sas a new upper bound.We also track the upper bound of the highest-numbered chunk we have scraped so far, and use it to form the next lower bound, instead of using an element from
S.When all elements of the sublist are exhausted, we also scrape the remaining range after the elements of the sublist and before the maximum section code of 97999.
(Note that chunks are always scraped in increasing order, even under bisection.)
These changes make the
last_dept_scrapedcolumn obsolete, so it is dropped.We also make some minor internal API changes, mainly around whether section codes are passed as strings or numbers.
Let's examine how the earlier cases in the old scraper map onto this new strategy:
This almost always fails, but bisecting will eventually collect all sections.
Test plan
Verify that all section codes are collected.
Verify that on restart, all section codes are collected.
Verify that the correct chunks are used (there should be litle or no bisecting), and that there are no codes between chunks which don't get scraped.
Types of changes
Checklist: