refactor(websoc-scraper): always scrape by chunks by laggycomputer · Pull Request #320 · icssc/anteater-api

laggycomputer · 2026-02-27T07:59:09Z

We simplify the Websoc scraper's data acquisition strategy.

Current scraper strategy

The Websoc scraper currently operates on a state machine, controlled by persistent state in the websoc_meta table.
Depending on the state stored there, one of several modes takes effect, over multiple scrapes of the same term.
They happen in this order:

The last_scraped column is NULL, indicating a scrape of this term has never been done.
The scraper makes a (pretty good) guess at the valid departments and enumerates all sections in each department, in lexigraphical order of the department names.
The transaction upserting each department ends with the last_scraped and last_dept_scraped columns being set appropriately.
The scraper occasionally fails to enumerate all departments in the time alotted.
In that case, if collection of a department fails, the columns stored at intermediate steps in the previous state are sufficient information to pick up where we left off.
If the list of departments to newly scrape for this term has been exhausted, we bump last_scraped and set last_dept_scraped to NULL, indicating that we are done discovering departments.
Previously, the scraper would jump frop step 3 back to step 1, enumerating all departments once more from the beginning of the list.
In feat: websoc scraper enhancements #83, we introduced a new scraping strategy, giving a new mode of operation for every scrape after the previous states are completed:

We have a list of section codes known to correspond to valid sections in this term, call it S, sorted in increasing order.
If the elements are indexed from 1, take every element whose index i is either 0 or 1 modulo 891.
Each pair of elements (a, b) from this sublist forms a closed interval [a, b] in which there are known to be at least 891 valid section codes.
We scrape this range directly, without respect to departments, because we expect this code range to contain near the maximum of 900 section codes.
Scraping in this way will take less requests overall than scraping by department would.
If there were an odd number of elements in the sublist, then the last element is paired with 97999 to form a final range, which is then scraped.

If the interval [a, b] cannot be scraped, it is possible that this was due to more than 9 sections (1% of the limit of 900) being added in that code range between our last scrape and this current attempt.
In #114, another step was added to retry this chunk by recursively bisecting it, with one of two outcomes:

There is some other reason for failure, and all attempts to get any section code range, even the response would contain under 900 sections, will fail.
Bisecting stops if failure is encountered when scraping a section code range which contains less than 900 valid section codes, because, for that chunk, the encountered failure cannot possibly be a "too many sections" error.
After one or more bisections, less than 900 sections are returned, a chunk is successfully scraped, and we may continue.

These two PRs, implemented several weeks apart, contained subtle correctness bugs:

Observe how the algorithm which computes chunks from S fails to consider that section codes may newly appear before the first known section code.
The chunk [0, min(S)) is never considered.
The chunk (max(S), 97999] is also not considered if the sublist contains an even number of codes.
This is issue websoc-scraper: failure to discover sections at code range extrema #317.
The algorithm also ignores section codes which may appear between elements of the sublist which are not paired together.
If the algorithm wants to scrape chunks [a, b] and [c, d] for consecutive a, b, c, d in the sublist, then chunk (b, c) will never be scraped.
This is issue websoc-scraper: failure to discover sections between known chunks #318.
There is a minor performance issue in Save one database round-trip in websoc-scraper #312 because the sublist is computed too eagerly, before we have even determined that we need it.

#311 will become obsolete because that code has been removed.

New strategy

The department-wise discovery strategy is removed.
The list of departments probably doesn't include departments which are now obsolete but were relevant in earlier Websoc terms, so we shouldn't have used it, anyway.
If we can't be confident that it's authoritative and exhaustive (which we can't), then we shouldn't assume that all meaningful sections belong to one of those departments.

We simplify the chunking algorithm described above to simply take every 891st element of S as a new upper bound.
We also track the upper bound of the highest-numbered chunk we have scraped so far, and use it to form the next lower bound, instead of using an element from S.
When all elements of the sublist are exhausted, we also scrape the remaining range after the elements of the sublist and before the maximum section code of 97999.
(Note that chunks are always scraped in increasing order, even under bisection.)
These changes make the last_dept_scraped column obsolete, so it is dropped.

We also make some minor internal API changes, mainly around whether section codes are passed as strings or numbers.

Let's examine how the earlier cases in the old scraper map onto this new strategy:

If no section codes are known for this term, then the sublist will be empty, and the scraper will attempt to snapshot chunk 00000-97999.
This almost always fails, but bisecting will eventually collect all sections.
Even if we failed to acquire all section codes on the first run, the new chunking strategy is exhaustive so we would eventually manage to snapshot, potentially with bisection, the missing range.
Once almost all section codes are known, we can snapshot nearly perfectly sized chunks without issue.

Test plan

Scrape a term with no data initially stored.
Verify that all section codes are collected.
Do the same, but interrupt the scraper midway.
Verify that on restart, all section codes are collected.
After all section codes are known, scrape again.
Verify that the correct chunks are used (there should be litle or no bisecting), and that there are no codes between chunks which don't get scraped.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code involves a change to the database schema.
My code requires a change to the documentation.

idk why index example won't typecheck otherwise

# Conflicts: # apps/data-pipeline/websoc-scraper/tsconfig.json

# Conflicts: # packages/db/migrations/meta/_journal.json

sanskarm7

need to update writeup in some places to match actual behavior (eg section codes 00001 to 97999 and not 00000-99999)

otherwise testing is looking good for now

packages/db/migrations/0027_websoc_meta_drop_last_dept_scraped.sql

packages/db/src/schema.ts

apps/data-pipeline/websoc-scraper/src/lib.ts

sanskarm7

pulling a dante on dante (im scared for my life)

apps/data-pipeline/websoc-scraper/src/lib.ts

sanskarm7

lets get it (the money)

refactor(websoc-scraper): always scrape by chunks

9f20fcc

laggycomputer had a problem deploying to staging-320 February 27, 2026 07:59 — with GitHub Actions Failure

This was linked to issues Feb 27, 2026

Save one database round-trip in websoc-scraper #312

Open

websoc-scraper: failure to discover sections at code range extrema #317

Open

websoc-scraper: failure to discover sections between known chunks #318

Open

unused param

e2b3432

laggycomputer had a problem deploying to staging-320 February 27, 2026 08:01 — with GitHub Actions Failure

add @types/node

09697e1

idk why index example won't typecheck otherwise

laggycomputer had a problem deploying to staging-320 February 27, 2026 08:05 — with GitHub Actions Failure

add node here idk

54ce63c

laggycomputer temporarily deployed to staging-320 February 28, 2026 05:31 — with GitHub Actions Inactive

laggycomputer linked an issue Feb 28, 2026 that may be closed by this pull request

Misnomers in websoc-scraper #311

Open

Merge remote-tracking branch 'origin/main' into websoc-scrape-strategy

8b2704f

laggycomputer temporarily deployed to staging-320 March 7, 2026 19:55 — with GitHub Actions Inactive

laggycomputer requested a review from sanskarm7 March 7, 2026 19:57

Merge branch 'main' into websoc-scrape-strategy

aa49dcc

# Conflicts: # apps/data-pipeline/websoc-scraper/tsconfig.json

laggycomputer temporarily deployed to staging-320 March 10, 2026 15:35 — with GitHub Actions Inactive

Merge branch 'main' into websoc-scrape-strategy

61d4ae6

laggycomputer temporarily deployed to staging-320 March 11, 2026 00:27 — with GitHub Actions Inactive

laggycomputer added 2 commits March 15, 2026 21:21

"merge" main

8e36c64

Merge branch 'main' into websoc-scrape-strategy

7c6dcc5

# Conflicts: # packages/db/migrations/meta/_journal.json

laggycomputer temporarily deployed to staging-320 March 16, 2026 04:22 — with GitHub Actions Inactive

Merge branch 'main' into websoc-scrape-strategy

9a9049a

laggycomputer temporarily deployed to staging-320 March 18, 2026 03:00 — with GitHub Actions Inactive

sanskarm7 requested changes Mar 20, 2026

View reviewed changes

you can't park there

bdcb7aa

laggycomputer temporarily deployed to staging-320 March 20, 2026 01:29 — with GitHub Actions Inactive

laggycomputer requested a review from sanskarm7 March 20, 2026 01:35

shift bounds a tad

f36cda2

laggycomputer temporarily deployed to staging-320 March 20, 2026 01:46 — with GitHub Actions Inactive

doc as requested

94f3940

laggycomputer temporarily deployed to staging-320 March 20, 2026 01:50 — with GitHub Actions Inactive

sanskarm7 requested changes Mar 20, 2026

View reviewed changes

apps/data-pipeline/websoc-scraper/src/lib.ts Outdated Show resolved Hide resolved

apps/data-pipeline/websoc-scraper/src/lib.ts Outdated Show resolved Hide resolved

reform the comment

c610472

laggycomputer requested a review from sanskarm7 March 20, 2026 02:00

laggycomputer temporarily deployed to staging-320 March 20, 2026 02:00 — with GitHub Actions Inactive

sanskarm7 approved these changes Mar 20, 2026

View reviewed changes

Merge branch 'main' into websoc-scrape-strategy

fb9a4af

laggycomputer deployed to staging-320 March 23, 2026 17:34 — with GitHub Actions View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(websoc-scraper): always scrape by chunks#320

refactor(websoc-scraper): always scrape by chunks#320
laggycomputer wants to merge 15 commits intomainfrom
websoc-scrape-strategy

laggycomputer commented Feb 27, 2026 •

edited

Loading

Uh oh!

sanskarm7 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sanskarm7 left a comment

Uh oh!

Uh oh!

Uh oh!

sanskarm7 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

laggycomputer commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current scraper strategy

New strategy

Test plan

Types of changes

Checklist:

Uh oh!

sanskarm7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sanskarm7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sanskarm7 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

laggycomputer commented Feb 27, 2026 •

edited

Loading