Skip to content

Conversation

@ramonechen
Copy link
Collaborator

What?

This is a really big pull request; probably bigger than it should be.

Closes #17, closes #45, closes #34, closes #29.

This pull request introduces:

  • Reorganization of the output JSON to display all class/section details instead of aggregating into courses.
    • CRNs
    • Section numbers
    • Section-specific titles, descriptions, credits, restrictions, faculty, etc.
  • Addition of meeting block information to the output JSON.
    • Start and end hours
    • Meeting credit hours
    • Campus codes and descriptions
    • Building codes and descriptions
    • Building room
    • Meeting category (lecture, test, lab/recitation, etc.)
    • Start and end dates
    • Days of the week
  • Adjustment of faculty data to display meetings that they are a part of as well as meetings that they are the primary instructor for.
    • Denoted by allMeetings and primaryMeetings lists filled with integer IDs that correspond to meeting blocks.
    • Each meeting block is assigned a local id that counts up from 1.
      • This was my solution to avoiding data duplication in the case that every meeting block has its own list of faculty.
  • Addition of waitlist data to help complement enrollment data.
  • New functionality to scrape courses that are normally hidden from SIS class search via discovery by crosslists.
  • Improved logging and verbosity.
  • Improved scraping/network performance.
    • The biggest change was using a single, shared TCPConnector for all client sessions instead of creating a new one for every session created.
      • This means that only one connection with the server is established, which is shared by individual sessions that have their own set of cookies.
      • This greatly cut down on the volume of TCP handshakes and connection teardowns, which previously wasted a considerable amount of time.
    • This change also imposes a global maximum number of simultaneous requests instead of a local maximum per client session.
      • This effectively makes the flow of requests to the SIS server much more constant, as the old architecture created a "thundering herd" problem in which one session could blast the server with many requests at once and cause it to throttle responses.
      • Client sessions for each term now fire requests in a more round-robin fashion, so all processing terms gradually approach 100% in sync.
      • As a side effect, scraper logs are now very quiet during the scraping process and all writing happens in a burst at the end of the process.

Why?

In a nutshell, more granular data good, aggregating section data into courses bad because it omits lots of important information about each section.

See linked issues for more details.

How?

New functions such as get_class_details(), get_class_enrollment(), and get_class_faculty_meetings() were added to sis_api.py to facilitate scraping of hidden classes.

A major refactor to the code in sis_scraper.py was needed to support the new section-level scraping and especially the hidden class scraping.

Testing?

The code has been tested incrementally and has shown to produce correct results several times in a row on all available terms at once. Other than that, trust me bro.

Anything else?

Have a nice day.

This refactor is the first step in reworking the SIS scraper to fetch more granular information about every section (class) within a course. I plan on expanding on this refactor by scraping more data from SIS that would help generate more insights on courses.
As far as we know, creditMin should never be null in the SIS data. But creditMax can be null, depending on whether the class has a credit range or not.
This data will probably be expanded on later once we understand better what all of the data from SIS really means.
From what I understand after a brief look, L typically means lecture, T is test, and B is lab/recitation. I'm not sure if there are others, and SIS doesn't seem to provide any legend for this either.
This is a first step in centralizing the logic for processing class meeting information.
This function will be necessary for classes that don't appear in SIS's main class search.
See last commit description.
See last commit description.
Previously, the credit values would end up as the name of the next <span> tag in the HTML data (which was often "Grade Modifiers:"). This has been fixed by replacing find_next_sibling() with the next_sibling attribute.
The previous approach was a very experimental one that I tried for fun; after thinking about it, it's probably not great for data consistency over time, as changes in the names of the fields on SIS would cause a crash in the SIS scraper.
Renamed get_class_meetings to get_class_faculty_meetings and updated its return format to include both faculty and meeting details. Refactored the processing logic to associate faculty with meetings, assign unique meeting IDs, and provide a more structured output. Improved error handling and logging for missing data.
In these cases, fatal makes more sense as the scraper is designed to not recover after these errors.
Replaces per-call TCPConnector creation with a single shared aiohttp.TCPConnector instance, passed to all relevant functions. This reduces resource usage and allows for better connection pooling across parallel tasks.
Refactored process_class_details to support fetching class details by CRN and term when SIS class entry is not available. The course data structure is now keyed by subject description during processing and converted to subject code before output. Added logic to detect and process hidden crosslisted CRNs not present in the main class search. Improved parallelization and data consistency throughout the scraping process.
Enhanced error logging in sis_scraper.py by including full exception tracebacks in log messages instead of printing them separately. This provides more detailed context for debugging and consolidates error information in the logs.
I didn't know that sessions take ownership of the TCPConnector given to it by default and automatically closes it at the end of the session. This caused every other running session to crash due to the connector being closed. I fixed this by adding the connector_owner=False kwarg to every session instantiation.
Renamed the 'semaphore_val' parameter to 'max_concurrent_sessions' for clarity and increased 'limit_per_host' from 5 to 20 to allow more simultaneous connections per host. Updated related docstrings and variable usage accordingly.
Enhanced the retry warning log to include the request URL, parameters, and exception details for better debugging of failed requests.
Raised max_concurrent_sessions to 25 and limit_per_host to 75 for improved parallelism. Added keepalive_timeout and force_close options to aiohttp.TCPConnector for better connection management.
Putting the subject description right underneath the subject code in the output JSON is much nicer to read. It was also the original plan.
Replaces individual codify functions and mapping logic with a new CodeMapper class that manages code-name mappings for subjects, attributes, restrictions, and instructors. Updates the main post-processing flow to use process_term and CodeMapper, improving maintainability, normalization, and consistency of code mapping and generation. Mapping files are now updated and saved after processing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

2 participants