Skip to content

fix: missing or confused instructor data#303

Open
ParzivalPerhaps wants to merge 8 commits intomainfrom
fix/missing-or-confused-instructor-data
Open

fix: missing or confused instructor data#303
ParzivalPerhaps wants to merge 8 commits intomainfrom
fix/missing-or-confused-instructor-data

Conversation

@ParzivalPerhaps
Copy link
Contributor

Description

This fix comes as a response to #184, which reports multiple instructors confused for one another and others that are missing.

A brief overview of the problems with our previous approach

This problem comes down to issues in the 2 primary scrapers responsible for updating instructor data, the websoc scraper and instructor scraper. The websoc scraper was previously not delineating between people with the same UCI normalized name LAST, FIRST_INITIAL of which there is quite a few (about 3,551 untracked entries). The websoc scraper was also not tracking some instructors for classes due to a weird usage of Sets. The next part of the problem is how we use the instructor scraper to assign a UCINetID to these normalized names, before we were only tracking the name so that's all we had to go off of, therefore common names among instructors were almost never reliably assigned to an entry in the instructor table, there were also a number of missing titles from the search terms such as the Unit 18 Faculty title. The last part of this problem is that we need to retroactively attribute the classes of professors missing their correct classes.

Solution

I solved these problems by doing a few things, first of which is changing the format by which we store websoc instructor name data from using exclusively the normalized name to a delimited string including the academic school as well, it looks like:

Before (websoc_instructor row as JSON)

{
  "name": "JONES, E.G.",
  "updated_at": "2024-10-16 03:06:25.759624+00"
}

After (websoc_instructor row as JSON)

{
  "name": "THORNTON, A.&|*Donald Bren School of Information and Computer Sciences&|*",
  "updated_at": "2026-02-05 10:47:32.04+00",
  "identifier": "THORNTON, A.",
  "school": "Donald Bren School of Information and Computer Sciences",
  "department": "Information and Computer Science"
}

The reason I didn't drop the primary key for the websoc_instructor table is that we have a lot of active foreign key restrictions involving the name column, and we don't want to lose any valuable historical data on the websoc_section_to_instructor table.

I ended up replacing the sets implementation of getting all the unique instructors for sections with a more gross array filter implementation that functions far more robustly. The websoc scraper's performance did not seem degraded or changed in my testing.

For the instructor scraper the primary change is that it now uses the newly tracked department information along with a list of titles that includes more niche ones like Graduate for instructors with the "Graduate xyz Office" title that teach certain classes along with the previously mentioned Unit 18 and Faculty. The internal logic is fairly similar, the delay between requests has been reduced to 800ms from 1000ms.

Some small adjustments had to be made to the websoc service and instructor view schema to use the delimited name in some cases, I suspect more of these cases may exist.

The last step in this solution was the instructor_assocation_resolver which resolves legacy websoc_section_to_instructor rows with the new naming scheme and will subsequently resolve the confused courses between berga1 and bergac for example. **IMPORTANT NOTE: ** Instructors that have not appeared in this quarter's websoc will not have the new format of names in websoc_instructor therefore legacy and newly-formatted instructors will both appear in the database, all API functionality should still work as expected (as far as I have found during testing).

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code involves a change to the database schema.
  • My code requires a change to the documentation.

…uctor scraper and relevant parts of websoc scraper. Updated .gitignore to ignore SQL sessions for those who use the VSCode SQL plugin.

Temporarily committing schema changes while finishing up improvements
@ParzivalPerhaps ParzivalPerhaps self-assigned this Feb 9, 2026
@laggycomputer laggycomputer linked an issue Feb 10, 2026 that may be closed by this pull request
@laggycomputer laggycomputer linked an issue Feb 10, 2026 that may be closed by this pull request
@ParzivalPerhaps ParzivalPerhaps marked this pull request as ready for review February 17, 2026 01:42
Copy link
Member

@laggycomputer laggycomputer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge conflict

Copy link
Member

@laggycomputer laggycomputer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not normative to leave personal authorship messages and metadata

@ParzivalPerhaps
Copy link
Contributor Author

it is not normative to leave personal authorship messages and metadata

I don't think it's that deep but maybe that's just me, after working through websoc scraper I just found myself wishing there was a more macro documentation or explanation of what the idea was to speed up the time to productivity for new people working on it.

Copy link
Member

@laggycomputer laggycomputer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we discussed on call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inaccurate/missing instructor data

2 participants