Refactor scraper to capture section-specific details #46

ramonechen · 2025-12-26T20:50:38Z

What?

This is a really big pull request; probably bigger than it should be.

Closes #17, closes #45, closes #34, closes #29.

This pull request introduces:

Reorganization of the output JSON to display all class/section details instead of aggregating into courses.
- CRNs
- Section numbers
- Section-specific titles, descriptions, credits, restrictions, faculty, etc.
Addition of meeting block information to the output JSON.
- Start and end hours
- Meeting credit hours
- Campus codes and descriptions
- Building codes and descriptions
- Building room
- Meeting category (lecture, test, lab/recitation, etc.)
- Start and end dates
- Days of the week
Adjustment of faculty data to display meetings that they are a part of as well as meetings that they are the primary instructor for.
- Denoted by allMeetings and primaryMeetings lists filled with integer IDs that correspond to meeting blocks.
- Each meeting block is assigned a local id that counts up from 1.
  - This was my solution to avoiding data duplication in the case that every meeting block has its own list of faculty.
Addition of waitlist data to help complement enrollment data.
New functionality to scrape courses that are normally hidden from SIS class search via discovery by crosslists.
Improved logging and verbosity.
Improved scraping/network performance.
- The biggest change was using a single, shared TCPConnector for all client sessions instead of creating a new one for every session created.
  - This means that only one connection with the server is established, which is shared by individual sessions that have their own set of cookies.
  - This greatly cut down on the volume of TCP handshakes and connection teardowns, which previously wasted a considerable amount of time.
- This change also imposes a global maximum number of simultaneous requests instead of a local maximum per client session.
  - This effectively makes the flow of requests to the SIS server much more constant, as the old architecture created a "thundering herd" problem in which one session could blast the server with many requests at once and cause it to throttle responses.
  - Client sessions for each term now fire requests in a more round-robin fashion, so all processing terms gradually approach 100% in sync.
  - As a side effect, scraper logs are now very quiet during the scraping process and all writing happens in a burst at the end of the process.

Why?

In a nutshell, more granular data good, aggregating section data into courses bad because it omits lots of important information about each section.

See linked issues for more details.

How?

New functions such as get_class_details(), get_class_enrollment(), and get_class_faculty_meetings() were added to sis_api.py to facilitate scraping of hidden classes.

A major refactor to the code in sis_scraper.py was needed to support the new section-level scraping and especially the hidden class scraping.

Testing?

The code has been tested incrementally and has shown to produce correct results several times in a row on all available terms at once. Other than that, trust me bro.

Anything else?

Have a nice day.

This refactor is the first step in reworking the SIS scraper to fetch more granular information about every section (class) within a course. I plan on expanding on this refactor by scraping more data from SIS that would help generate more insights on courses.

As far as we know, creditMin should never be null in the SIS data. But creditMax can be null, depending on whether the class has a credit range or not.

This data will probably be expanded on later once we understand better what all of the data from SIS really means.

From what I understand after a brief look, L typically means lecture, T is test, and B is lab/recitation. I'm not sure if there are others, and SIS doesn't seem to provide any legend for this either.

This is a first step in centralizing the logic for processing class meeting information.

This function will be necessary for classes that don't appear in SIS's main class search.

See last commit description.

Previously, the credit values would end up as the name of the next <span> tag in the HTML data (which was often "Grade Modifiers:"). This has been fixed by replacing find_next_sibling() with the next_sibling attribute.

The previous approach was a very experimental one that I tried for fun; after thinking about it, it's probably not great for data consistency over time, as changes in the names of the fields on SIS would cause a crash in the SIS scraper.

Renamed get_class_meetings to get_class_faculty_meetings and updated its return format to include both faculty and meeting details. Refactored the processing logic to associate faculty with meetings, assign unique meeting IDs, and provide a more structured output. Improved error handling and logging for missing data.

In these cases, fatal makes more sense as the scraper is designed to not recover after these errors.

Replaces per-call TCPConnector creation with a single shared aiohttp.TCPConnector instance, passed to all relevant functions. This reduces resource usage and allows for better connection pooling across parallel tasks.

Refactored process_class_details to support fetching class details by CRN and term when SIS class entry is not available. The course data structure is now keyed by subject description during processing and converted to subject code before output. Added logic to detect and process hidden crosslisted CRNs not present in the main class search. Improved parallelization and data consistency throughout the scraping process.

Enhanced error logging in sis_scraper.py by including full exception tracebacks in log messages instead of printing them separately. This provides more detailed context for debugging and consolidates error information in the logs.

I didn't know that sessions take ownership of the TCPConnector given to it by default and automatically closes it at the end of the session. This caused every other running session to crash due to the connector being closed. I fixed this by adding the connector_owner=False kwarg to every session instantiation.

Renamed the 'semaphore_val' parameter to 'max_concurrent_sessions' for clarity and increased 'limit_per_host' from 5 to 20 to allow more simultaneous connections per host. Updated related docstrings and variable usage accordingly.

Enhanced the retry warning log to include the request URL, parameters, and exception details for better debugging of failed requests.

Raised max_concurrent_sessions to 25 and limit_per_host to 75 for improved parallelism. Added keepalive_timeout and force_close options to aiohttp.TCPConnector for better connection management.

Putting the subject description right underneath the subject code in the output JSON is much nicer to read. It was also the original plan.

Replaces individual codify functions and mapping logic with a new CodeMapper class that manages code-name mappings for subjects, attributes, restrictions, and instructors. Updates the main post-processing flow to use process_term and CodeMapper, improving maintainability, normalization, and consistency of code mapping and generation. Mapping files are now updated and saved after processing.

ramonechen added 22 commits December 1, 2025 13:37

Print traceback on term error

7fee7a4

Allow null values for creditMin and creditMax

14dc824

As far as we know, creditMin should never be null in the SIS data. But creditMax can be null, depending on whether the class has a credit range or not.

Add meeting info to output JSON data

e3b6027

This data will probably be expanded on later once we understand better what all of the data from SIS really means.

Add waitlist metrics to class data

e4c905c

Add meeting categories to class data

733aba9

From what I understand after a brief look, L typically means lecture, T is test, and B is lab/recitation. I'm not sure if there are others, and SIS doesn't seem to provide any legend for this either.

Add return type to get_class_crosslists

ed235e6

Add _process_class_meetings()

b6a2490

This is a first step in centralizing the logic for processing class meeting information.

Add get_class_meetings()

b36289f

This function will be necessary for classes that don't appear in SIS's main class search.

Add get_class_details()

035395c

See last commit description.

Add get_class_enrollment()

b2a143d

See last commit description.

Fix credit parsing in get_class_details()

aad1010

Previously, the credit values would end up as the name of the next <span> tag in the HTML data (which was often "Grade Modifiers:"). This has been fixed by replacing find_next_sibling() with the next_sibling attribute.

Change some error logs to fatal logs

3f5321d

In these cases, fatal makes more sense as the scraper is designed to not recover after these errors.

Create global TCPConnector for aiohttp sessions

5bbcfd6

Replaces per-call TCPConnector creation with a single shared aiohttp.TCPConnector instance, passed to all relevant functions. This reduces resource usage and allows for better connection pooling across parallel tasks.

Improve error logging with full tracebacks

cc96208

Enhanced error logging in sis_scraper.py by including full exception tracebacks in log messages instead of printing them separately. This provides more detailed context for debugging and consolidates error information in the logs.

Raise max simultaneous connections to 20

09c2dbb

Renamed the 'semaphore_val' parameter to 'max_concurrent_sessions' for clarity and increased 'limit_per_host' from 5 to 20 to allow more simultaneous connections per host. Updated related docstrings and variable usage accordingly.

Improve retry logging with URL and params

d4b79d3

Enhanced the retry warning log to include the request URL, parameters, and exception details for better debugging of failed requests.

Increase concurrency and update TCPConnector settings

e24a2a4

Raised max_concurrent_sessions to 25 and limit_per_host to 75 for improved parallelism. Added keepalive_timeout and force_close options to aiohttp.TCPConnector for better connection management.

ramonechen added the Scraper label Dec 26, 2025

ramonechen added 2 commits December 26, 2025 16:11

Reorder "subjectDescription" field to be above course data

b6cfeec

Putting the subject description right underneath the subject code in the output JSON is much nicer to read. It was also the original plan.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor scraper to capture section-specific details #46

Refactor scraper to capture section-specific details #46

Uh oh!

ramonechen commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor scraper to capture section-specific details #46

Are you sure you want to change the base?

Refactor scraper to capture section-specific details #46

Uh oh!

Conversation

ramonechen commented Dec 26, 2025

What?

Why?

How?

Testing?

Anything else?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants