Skip to content

refactor(websoc-scraper): always scrape by chunks#320

Open
laggycomputer wants to merge 15 commits intomainfrom
websoc-scrape-strategy
Open

refactor(websoc-scraper): always scrape by chunks#320
laggycomputer wants to merge 15 commits intomainfrom
websoc-scrape-strategy

Conversation

@laggycomputer
Copy link
Member

@laggycomputer laggycomputer commented Feb 27, 2026

We simplify the Websoc scraper's data acquisition strategy.

Current scraper strategy

The Websoc scraper currently operates on a state machine, controlled by persistent state in the websoc_meta table.
Depending on the state stored there, one of several modes takes effect, over multiple scrapes of the same term.
They happen in this order:

  1. The last_scraped column is NULL, indicating a scrape of this term has never been done.
    The scraper makes a (pretty good) guess at the valid departments and enumerates all sections in each department, in lexigraphical order of the department names.
    The transaction upserting each department ends with the last_scraped and last_dept_scraped columns being set appropriately.
  2. The scraper occasionally fails to enumerate all departments in the time alotted.
    In that case, if collection of a department fails, the columns stored at intermediate steps in the previous state are sufficient information to pick up where we left off.
  3. If the list of departments to newly scrape for this term has been exhausted, we bump last_scraped and set last_dept_scraped to NULL, indicating that we are done discovering departments.
  4. Previously, the scraper would jump frop step 3 back to step 1, enumerating all departments once more from the beginning of the list.
    In feat: websoc scraper enhancements #83, we introduced a new scraping strategy, giving a new mode of operation for every scrape after the previous states are completed:
  • We have a list of section codes known to correspond to valid sections in this term, call it S, sorted in increasing order.
  • If the elements are indexed from 1, take every element whose index i is either 0 or 1 modulo 891.
  • Each pair of elements (a, b) from this sublist forms a closed interval [a, b] in which there are known to be at least 891 valid section codes.
    We scrape this range directly, without respect to departments, because we expect this code range to contain near the maximum of 900 section codes.
    Scraping in this way will take less requests overall than scraping by department would.
  • If there were an odd number of elements in the sublist, then the last element is paired with 97999 to form a final range, which is then scraped.

If the interval [a, b] cannot be scraped, it is possible that this was due to more than 9 sections (1% of the limit of 900) being added in that code range between our last scrape and this current attempt.
In #114, another step was added to retry this chunk by recursively bisecting it, with one of two outcomes:

  • There is some other reason for failure, and all attempts to get any section code range, even the response would contain under 900 sections, will fail.
    Bisecting stops if failure is encountered when scraping a section code range which contains less than 900 valid section codes, because, for that chunk, the encountered failure cannot possibly be a "too many sections" error.
  • After one or more bisections, less than 900 sections are returned, a chunk is successfully scraped, and we may continue.

These two PRs, implemented several weeks apart, contained subtle correctness bugs:

  1. Observe how the algorithm which computes chunks from S fails to consider that section codes may newly appear before the first known section code.
    The chunk [0, min(S)) is never considered.
    The chunk (max(S), 97999] is also not considered if the sublist contains an even number of codes.
    This is issue websoc-scraper: failure to discover sections at code range extrema #317.
  2. The algorithm also ignores section codes which may appear between elements of the sublist which are not paired together.
    If the algorithm wants to scrape chunks [a, b] and [c, d] for consecutive a, b, c, d in the sublist, then chunk (b, c) will never be scraped.
    This is issue websoc-scraper: failure to discover sections between known chunks #318.
  3. There is a minor performance issue in Save one database round-trip in websoc-scraper #312 because the sublist is computed too eagerly, before we have even determined that we need it.

#311 will become obsolete because that code has been removed.

New strategy

The department-wise discovery strategy is removed.
The list of departments probably doesn't include departments which are now obsolete but were relevant in earlier Websoc terms, so we shouldn't have used it, anyway.
If we can't be confident that it's authoritative and exhaustive (which we can't), then we shouldn't assume that all meaningful sections belong to one of those departments.

We simplify the chunking algorithm described above to simply take every 891st element of S as a new upper bound.
We also track the upper bound of the highest-numbered chunk we have scraped so far, and use it to form the next lower bound, instead of using an element from S.
When all elements of the sublist are exhausted, we also scrape the remaining range after the elements of the sublist and before the maximum section code of 97999.
(Note that chunks are always scraped in increasing order, even under bisection.)
These changes make the last_dept_scraped column obsolete, so it is dropped.

We also make some minor internal API changes, mainly around whether section codes are passed as strings or numbers.

Let's examine how the earlier cases in the old scraper map onto this new strategy:

  • If no section codes are known for this term, then the sublist will be empty, and the scraper will attempt to snapshot chunk 00000-97999.
    This almost always fails, but bisecting will eventually collect all sections.
  • Even if we failed to acquire all section codes on the first run, the new chunking strategy is exhaustive so we would eventually manage to snapshot, potentially with bisection, the missing range.
  • Once almost all section codes are known, we can snapshot nearly perfectly sized chunks without issue.

Test plan

  • Scrape a term with no data initially stored.
    Verify that all section codes are collected.
  • Do the same, but interrupt the scraper midway.
    Verify that on restart, all section codes are collected.
  • After all section codes are known, scrape again.
    Verify that the correct chunks are used (there should be litle or no bisecting), and that there are no codes between chunks which don't get scraped.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code involves a change to the database schema.
  • My code requires a change to the documentation.

idk why index example won't typecheck otherwise
@laggycomputer laggycomputer linked an issue Feb 28, 2026 that may be closed by this pull request
# Conflicts:
#	apps/data-pipeline/websoc-scraper/tsconfig.json
# Conflicts:
#	packages/db/migrations/meta/_journal.json
Copy link
Collaborator

@sanskarm7 sanskarm7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update writeup in some places to match actual behavior (eg section codes 00001 to 97999 and not 00000-99999)

otherwise testing is looking good for now

Copy link
Collaborator

@sanskarm7 sanskarm7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pulling a dante on dante (im scared for my life)

Copy link
Collaborator

@sanskarm7 sanskarm7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets get it (the money)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants