fix: missing or confused instructor data by ParzivalPerhaps · Pull Request #303 · icssc/anteater-api

ParzivalPerhaps · 2026-02-09T23:59:23Z

Description

This fix comes as a response to #184, which reports multiple instructors confused for one another and others that are missing.

A brief overview of the problems with our previous approach

This problem comes down to issues in the 2 primary scrapers responsible for updating instructor data, the websoc scraper and instructor scraper. The websoc scraper was previously not delineating between people with the same UCI normalized name LAST, FIRST_INITIAL of which there is quite a few (about 3,551 untracked entries). The websoc scraper was also not tracking some instructors for classes due to a weird usage of Sets. The next part of the problem is how we use the instructor scraper to assign a UCINetID to these normalized names, before we were only tracking the name so that's all we had to go off of, therefore common names among instructors were almost never reliably assigned to an entry in the instructor table, there were also a number of missing titles from the search terms such as the Unit 18 Faculty title. The last part of this problem is that we need to retroactively attribute the classes of professors missing their correct classes.

Solution

I solved these problems by doing a few things, first of which is changing the format by which we store websoc instructor name data from using exclusively the normalized name to a delimited string including the academic school as well, it looks like:

Before (`websoc_instructor` row as JSON)

{
  "name": "JONES, E.G.",
  "updated_at": "2024-10-16 03:06:25.759624+00"
}

After (`websoc_instructor` row as JSON)

{
  "name": "THORNTON, A.&|*Donald Bren School of Information and Computer Sciences&|*",
  "updated_at": "2026-02-05 10:47:32.04+00",
  "identifier": "THORNTON, A.",
  "school": "Donald Bren School of Information and Computer Sciences",
  "department": "Information and Computer Science"
}

The reason I didn't drop the primary key for the websoc_instructor table is that we have a lot of active foreign key restrictions involving the name column, and we don't want to lose any valuable historical data on the websoc_section_to_instructor table.

I ended up replacing the sets implementation of getting all the unique instructors for sections with a more gross array filter implementation that functions far more robustly. The websoc scraper's performance did not seem degraded or changed in my testing.

For the instructor scraper the primary change is that it now uses the newly tracked department information along with a list of titles that includes more niche ones like Graduate for instructors with the "Graduate xyz Office" title that teach certain classes along with the previously mentioned Unit 18 and Faculty. The internal logic is fairly similar, the delay between requests has been reduced to 800ms from 1000ms.

Some small adjustments had to be made to the websoc service and instructor view schema to use the delimited name in some cases, I suspect more of these cases may exist.

The last step in this solution was the instructor_assocation_resolver which resolves legacy websoc_section_to_instructor rows with the new naming scheme and will subsequently resolve the confused courses between berga1 and bergac for example. **IMPORTANT NOTE: ** Instructors that have not appeared in this quarter's websoc will not have the new format of names in websoc_instructor therefore legacy and newly-formatted instructors will both appear in the database, all API functionality should still work as expected (as far as I have found during testing).

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code involves a change to the database schema.
My code requires a change to the documentation.

…uctor scraper and relevant parts of websoc scraper. Updated .gitignore to ignore SQL sessions for those who use the VSCode SQL plugin. Temporarily committing schema changes while finishing up improvements

…ed-instructor-data Merged changes from main

apps/data-pipeline/instructor-association-resolver/src/index.ts

laggycomputer

merge conflict

laggycomputer

it is not normative to leave personal authorship messages and metadata

ParzivalPerhaps · 2026-03-04T05:09:22Z

it is not normative to leave personal authorship messages and metadata

I don't think it's that deep but maybe that's just me, after working through websoc scraper I just found myself wishing there was a more macro documentation or explanation of what the idea was to speed up the time to productivity for new people working on it.

apps/data-pipeline/instructor-scraper/src/index.ts

apps/data-pipeline/websoc-scraper/src/lib.ts

packages/db/migrations/meta/_journal.json

laggycomputer

we discussed on call

ParzivalPerhaps added 3 commits December 25, 2025 04:19

Worked on websoc_instructor to instructor mismatch by improving instr…

f46507d

…uctor scraper and relevant parts of websoc scraper. Updated .gitignore to ignore SQL sessions for those who use the VSCode SQL plugin. Temporarily committing schema changes while finishing up improvements

Merge remote-tracking branch 'origin/main' into fix/missing-or-confus…

6e0e4ca

…ed-instructor-data Merged changes from main

websoc is punishment for the sins of man

7b83756

ParzivalPerhaps self-assigned this Feb 9, 2026

laggycomputer linked an issue Feb 10, 2026 that may be closed by this pull request

[websoc] Intermittent loss of GE columns #302

Open

laggycomputer removed a link to an issue Feb 10, 2026

[websoc] Intermittent loss of GE columns #302

Open

laggycomputer linked an issue Feb 10, 2026 that may be closed by this pull request

Inaccurate/missing instructor data #184

Open

laggycomputer reviewed Feb 17, 2026

View reviewed changes

apps/data-pipeline/instructor-association-resolver/src/index.ts Outdated Show resolved Hide resolved

ParzivalPerhaps marked this pull request as ready for review February 17, 2026 01:42

laggycomputer requested changes Feb 17, 2026

View reviewed changes

migrate changed from main, we still hate websoc

7247aae

ParzivalPerhaps had a problem deploying to staging-303 March 4, 2026 03:20 — with GitHub Actions Error

finished migrating changes

81305ba

ParzivalPerhaps temporarily deployed to staging-303 March 4, 2026 03:21 — with GitHub Actions Inactive

ParzivalPerhaps requested a review from laggycomputer March 4, 2026 05:05

laggycomputer requested changes Mar 4, 2026

View reviewed changes

Added requested documentation and formatting fixes

5ca2196

ParzivalPerhaps temporarily deployed to staging-303 March 5, 2026 22:30 — with GitHub Actions Inactive

Changed let variable iterator to const

45abe59

ParzivalPerhaps temporarily deployed to staging-303 March 5, 2026 22:34 — with GitHub Actions Inactive

fixed overwritten drizzle journal timestamps

3ae81a1

ParzivalPerhaps deployed to staging-303 March 5, 2026 22:37 — with GitHub Actions View deployment

ParzivalPerhaps requested a review from laggycomputer March 5, 2026 22:38

laggycomputer requested changes Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: missing or confused instructor data#303

fix: missing or confused instructor data#303
ParzivalPerhaps wants to merge 8 commits intomainfrom
fix/missing-or-confused-instructor-data

ParzivalPerhaps commented Feb 9, 2026

Uh oh!

Uh oh!

laggycomputer left a comment

Uh oh!

laggycomputer left a comment

Uh oh!

ParzivalPerhaps commented Mar 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

laggycomputer left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ParzivalPerhaps commented Feb 9, 2026

Description

A brief overview of the problems with our previous approach

Solution

Before (websoc_instructor row as JSON)

After (websoc_instructor row as JSON)

Checklist:

Uh oh!

Uh oh!

laggycomputer left a comment

Choose a reason for hiding this comment

Uh oh!

laggycomputer left a comment

Choose a reason for hiding this comment

Uh oh!

ParzivalPerhaps commented Mar 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

laggycomputer left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Before (`websoc_instructor` row as JSON)

After (`websoc_instructor` row as JSON)