-
Notifications
You must be signed in to change notification settings - Fork 0
Modernize core application infrastructure #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Literally copy pasted over from CARPI-Data. The plan is to set up scheduled scraping and automate the process of updating our database with the latest data.
The scraper router is very incomplete, only a dummy thread has been made. I plan to use this thread to automatically update the database every so often.
Replaced the threading approach with the more high-level APScheduler library, which is now also in the requirements.txt. The scraper will now actually run as well, although it will error due to the lack of a "data" directory and the scraper failing to check that it doesn't exist. It's currently configured to run automatically every Monday at 4:00AM, in NY time.
This endpoint will return a 409 if the scraper is already currently running, either because of the automatic scheduler or if the endpoint was called recently. It also currently waits for the scraper to terminate before returning a response, which may be changed later.
Didn't realize uvloop isn't compatible with Windows, but the app seems to work fine without it. So I added a comment saying to comment this line out before installing the required packages on Windows.
Also began making use of the app-wide __init__.py by delegating some global function and variable imports to the file.
Before, dependencies.py would search for a .env file from the current working directory. This has now been changed so that .env is searched for relative to the file.
Eliminates the need to import and include API routers one by one in the main file.
Not sure if this makes any real difference, but I thought it couldn't hurt to include it.
Just to eliminate any confusion of utils.py being a global utils file.
The background job scheduler is now available as a global variable in the main app package. All scraping logic is now delegated to the sis_scraper.py file, including the boolean value that locks the scraper to only execute once at a time. The scraper also now integrates with FastAPI (technically Uvicorn) logging. The recurring scraper job now happens in the app's main file.
Before, it would create a "data" folder at the current working directory; this has now been changed to create one in the same directory as the scraper file.
Taken from the CARPI-Data repository, but modified so that it does not depend on any external dependencies. Code that is no longer relevant has also been removed entirely from this file.
Coming straight from the CARPI-Data repo.
With the switch to a new service for SIS on 09/17/2025, we decided to rewrite our existing SIS scraper in favor of simpler logic and better maintainability. The old scrapers still work, we just don't really like touching that code. This new scraper IS NOT COMPLETE yet. I discovered a working procedure via Postman for getting all course data, and am in the middle of automating it using Python. A current issue is that the GET request made in get_cookie() does not return a cookie, which is needed to access any course data. This is actively being looked into.
Small restructure of functions and an addition of clarifying docstrings. The issue presented in the last commit turned out to be a result of a gap in my understanding of cookies; a JSESSIONID cookie is assigned on the first call to any SIS page. I was inadvertently calling get_subjects() before attempting to read a cookie from get_cookie(), at which point the aiohttp session would already have a cookie and therefore would not receive another one from the server. In addition, I did not know that the aiohttp session automatically stores cookies and sends them in subsequent requests. Therefore, there is no need to explicitly handle cookie logic.
I'm basically following Single Responsibility Principle as I'm writing this new SIS scraper. The course_search() method, previously get_courses_in_subject(), now gives the caller more control over how results should be sorted. There are a lot of other miscellaneous changes included with this commit, but they're all basically for the sake of SRP.
My plan is to have this master function call all of the scraper functions and aggregate the data to create the final JSON that represents all course data for one term.
In the context of this scraper, a "class" refers to a section of a course, while a "course" refers to the overarching course that may have multiple classes. The data returned from SIS is keyed by classes, not courses. The get_course_data() function manipulates and aggregates this data such that the returned structure is keyed by courses instead, with classes as a sub-field of each course.
This change is part of transitioning into a new scraping/processing approach in which raw data will be stored in the JSON as an intermediate, and a big post-processing pipeline will be added later to convert names into codes.
In the time it took to make this commit, I discovered SO many more restriction types that it's getting a little ridiculous trying to match them all. Although at this point, there are hopefully no more possible restriction types. Only the future can tell.
This ensures that names can reliably be converted to codes in the post-processing stage after scraping. For example, there has been a case of a course referencing a prerequisite subject name that existed in a previous semester but not the current one. Thus, caching these code mappings that contain data from all semesters will prevent cases like these.
My Windows machine was running into errors trying to map unicode characters to a writable character; turns out open() by default does not use unicode encoding on Windows but does on Mac.
The reason for this change is detailed in the comment added to the code. Basically, SIS data sucks and data for a single restriction can sometimes be split into multiple lines, causing each line to count as a separate restriction.
With the migration of the SIS scraper to its own repository, it is no longer needed in the API service. We plan to run the scraper separately from the API service, with a third database model repository to unify the schema structure used by both apps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR modernizes the core application infrastructure by introducing a centralized logger, dynamic router registration, and improved database lifecycle management. The changes prepare the codebase for better scalability and maintainability after removing the originally planned SIS scraper implementation.
Key Changes
- Implemented FastAPI lifespan context manager for database engine initialization and cleanup
- Added dynamic router scanning and registration to eliminate manual router inclusion
- Centralized logging configuration and made key dependencies available at package level
Reviewed Changes
Copilot reviewed 6 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| app/main.py | Replaced manual router registration with dynamic scanning function and added lifespan management |
| app/dependencies.py | Converted database initialization to async lifespan pattern with proper cleanup |
| app/init.py | Exported core dependencies and created centralized logger instance |
| app/routers/course.py | Reorganized imports to use package-level exports |
| app/db_models/professor.py | Alphabetized SQLAlchemy type imports |
| README.md | Added blank line for formatting |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
What?
This pull request introduces some modernizations to the core application infrastructure, which includes a centralized logger object, a function for integrating API routers in a scalable manner, and a FastAPI lifespan function for managing initialization and cleaning up of the MySQL database engine.
Why?
This branch was originally created to add a new SIS scraper implementation into the API service, but a major architectural change was made mid-way through, and the scraper has since been migrated to a different repository in the organization. Thus, only these miscellaneous changes remain.
How?
Development of this branch began with adding a SIS scraper implementation in a "scrapers" directory of the app, a router for manual invocation of the scraper, and background scheduling logic to automatically run the scraper every Monday at 4:00am. Then all of that went bye-bye in the last few commits.
Testing?
The app and its endpoints ran fine throughout development of the scraper and still runs fine without it. Not much to test, really.
Anything Else?
Have a nice day.