fix: missing or confused instructor data#303
Open
ParzivalPerhaps wants to merge 8 commits intomainfrom
Open
Conversation
…uctor scraper and relevant parts of websoc scraper. Updated .gitignore to ignore SQL sessions for those who use the VSCode SQL plugin. Temporarily committing schema changes while finishing up improvements
…ed-instructor-data Merged changes from main
apps/data-pipeline/instructor-association-resolver/src/index.ts
Outdated
Show resolved
Hide resolved
laggycomputer
requested changes
Mar 4, 2026
Member
laggycomputer
left a comment
There was a problem hiding this comment.
it is not normative to leave personal authorship messages and metadata
Contributor
Author
I don't think it's that deep but maybe that's just me, after working through websoc scraper I just found myself wishing there was a more macro documentation or explanation of what the idea was to speed up the time to productivity for new people working on it. |
laggycomputer
requested changes
Mar 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This fix comes as a response to #184, which reports multiple instructors confused for one another and others that are missing.
A brief overview of the problems with our previous approach
This problem comes down to issues in the 2 primary scrapers responsible for updating instructor data, the websoc scraper and instructor scraper. The websoc scraper was previously not delineating between people with the same UCI normalized name
LAST, FIRST_INITIALof which there is quite a few (about 3,551 untracked entries). The websoc scraper was also not tracking some instructors for classes due to a weird usage of Sets. The next part of the problem is how we use the instructor scraper to assign a UCINetID to these normalized names, before we were only tracking the name so that's all we had to go off of, therefore common names among instructors were almost never reliably assigned to an entry in the instructor table, there were also a number of missing titles from the search terms such as theUnit 18 Facultytitle. The last part of this problem is that we need to retroactively attribute the classes of professors missing their correct classes.Solution
I solved these problems by doing a few things, first of which is changing the format by which we store websoc instructor name data from using exclusively the normalized name to a delimited string including the academic school as well, it looks like:
Before (
websoc_instructorrow as JSON){ "name": "JONES, E.G.", "updated_at": "2024-10-16 03:06:25.759624+00" }After (
websoc_instructorrow as JSON){ "name": "THORNTON, A.&|*Donald Bren School of Information and Computer Sciences&|*", "updated_at": "2026-02-05 10:47:32.04+00", "identifier": "THORNTON, A.", "school": "Donald Bren School of Information and Computer Sciences", "department": "Information and Computer Science" }The reason I didn't drop the primary key for the
websoc_instructortable is that we have a lot of active foreign key restrictions involving thenamecolumn, and we don't want to lose any valuable historical data on thewebsoc_section_to_instructortable.I ended up replacing the sets implementation of getting all the unique instructors for sections with a more gross array filter implementation that functions far more robustly. The websoc scraper's performance did not seem degraded or changed in my testing.
For the instructor scraper the primary change is that it now uses the newly tracked department information along with a list of titles that includes more niche ones like
Graduatefor instructors with the "Graduate xyz Office" title that teach certain classes along with the previously mentionedUnit 18andFaculty. The internal logic is fairly similar, the delay between requests has been reduced to800msfrom1000ms.Some small adjustments had to be made to the websoc service and instructor view schema to use the delimited name in some cases, I suspect more of these cases may exist.
The last step in this solution was the
instructor_assocation_resolverwhich resolves legacywebsoc_section_to_instructorrows with the new naming scheme and will subsequently resolve the confused courses betweenberga1andbergacfor example. **IMPORTANT NOTE: ** Instructors that have not appeared in this quarter's websoc will not have the new format of names inwebsoc_instructortherefore legacy and newly-formatted instructors will both appear in the database, all API functionality should still work as expected (as far as I have found during testing).Checklist: