-
Notifications
You must be signed in to change notification settings - Fork 0
Refactor scraper to capture section-specific details #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
ramonechen
wants to merge
24
commits into
main
Choose a base branch
from
scrape-section-details
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This refactor is the first step in reworking the SIS scraper to fetch more granular information about every section (class) within a course. I plan on expanding on this refactor by scraping more data from SIS that would help generate more insights on courses.
As far as we know, creditMin should never be null in the SIS data. But creditMax can be null, depending on whether the class has a credit range or not.
This data will probably be expanded on later once we understand better what all of the data from SIS really means.
From what I understand after a brief look, L typically means lecture, T is test, and B is lab/recitation. I'm not sure if there are others, and SIS doesn't seem to provide any legend for this either.
This is a first step in centralizing the logic for processing class meeting information.
This function will be necessary for classes that don't appear in SIS's main class search.
See last commit description.
See last commit description.
Previously, the credit values would end up as the name of the next <span> tag in the HTML data (which was often "Grade Modifiers:"). This has been fixed by replacing find_next_sibling() with the next_sibling attribute.
The previous approach was a very experimental one that I tried for fun; after thinking about it, it's probably not great for data consistency over time, as changes in the names of the fields on SIS would cause a crash in the SIS scraper.
Renamed get_class_meetings to get_class_faculty_meetings and updated its return format to include both faculty and meeting details. Refactored the processing logic to associate faculty with meetings, assign unique meeting IDs, and provide a more structured output. Improved error handling and logging for missing data.
In these cases, fatal makes more sense as the scraper is designed to not recover after these errors.
Replaces per-call TCPConnector creation with a single shared aiohttp.TCPConnector instance, passed to all relevant functions. This reduces resource usage and allows for better connection pooling across parallel tasks.
Refactored process_class_details to support fetching class details by CRN and term when SIS class entry is not available. The course data structure is now keyed by subject description during processing and converted to subject code before output. Added logic to detect and process hidden crosslisted CRNs not present in the main class search. Improved parallelization and data consistency throughout the scraping process.
Enhanced error logging in sis_scraper.py by including full exception tracebacks in log messages instead of printing them separately. This provides more detailed context for debugging and consolidates error information in the logs.
I didn't know that sessions take ownership of the TCPConnector given to it by default and automatically closes it at the end of the session. This caused every other running session to crash due to the connector being closed. I fixed this by adding the connector_owner=False kwarg to every session instantiation.
Renamed the 'semaphore_val' parameter to 'max_concurrent_sessions' for clarity and increased 'limit_per_host' from 5 to 20 to allow more simultaneous connections per host. Updated related docstrings and variable usage accordingly.
Enhanced the retry warning log to include the request URL, parameters, and exception details for better debugging of failed requests.
Raised max_concurrent_sessions to 25 and limit_per_host to 75 for improved parallelism. Added keepalive_timeout and force_close options to aiohttp.TCPConnector for better connection management.
Putting the subject description right underneath the subject code in the output JSON is much nicer to read. It was also the original plan.
Replaces individual codify functions and mapping logic with a new CodeMapper class that manages code-name mappings for subjects, attributes, restrictions, and instructors. Updates the main post-processing flow to use process_term and CodeMapper, improving maintainability, normalization, and consistency of code mapping and generation. Mapping files are now updated and saved after processing.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What?
This is a really big pull request; probably bigger than it should be.
Closes #17, closes #45, closes #34, closes #29.
This pull request introduces:
allMeetingsandprimaryMeetingslists filled with integer IDs that correspond to meeting blocks.idthat counts up from 1.Why?
In a nutshell, more granular data good, aggregating section data into courses bad because it omits lots of important information about each section.
See linked issues for more details.
How?
New functions such as
get_class_details(),get_class_enrollment(), andget_class_faculty_meetings()were added tosis_api.pyto facilitate scraping of hidden classes.A major refactor to the code in
sis_scraper.pywas needed to support the new section-level scraping and especially the hidden class scraping.Testing?
The code has been tested incrementally and has shown to produce correct results several times in a row on all available terms at once. Other than that, trust me bro.
Anything else?
Have a nice day.