Modernize core application infrastructure #11

ramonechen · 2025-10-22T19:45:57Z

What?

This pull request introduces some modernizations to the core application infrastructure, which includes a centralized logger object, a function for integrating API routers in a scalable manner, and a FastAPI lifespan function for managing initialization and cleaning up of the MySQL database engine.

Why?

This branch was originally created to add a new SIS scraper implementation into the API service, but a major architectural change was made mid-way through, and the scraper has since been migrated to a different repository in the organization. Thus, only these miscellaneous changes remain.

How?

Development of this branch began with adding a SIS scraper implementation in a "scrapers" directory of the app, a router for manual invocation of the scraper, and background scheduling logic to automatically run the scraper every Monday at 4:00am. Then all of that went bye-bye in the last few commits.

Testing?

The app and its endpoints ran fine throughout development of the scraper and still runs fine without it. Not much to test, really.

Anything Else?

Have a nice day.

Literally copy pasted over from CARPI-Data. The plan is to set up scheduled scraping and automate the process of updating our database with the latest data.

The scraper router is very incomplete, only a dummy thread has been made. I plan to use this thread to automatically update the database every so often.

Replaced the threading approach with the more high-level APScheduler library, which is now also in the requirements.txt. The scraper will now actually run as well, although it will error due to the lack of a "data" directory and the scraper failing to check that it doesn't exist. It's currently configured to run automatically every Monday at 4:00AM, in NY time.

This endpoint will return a 409 if the scraper is already currently running, either because of the automatic scheduler or if the endpoint was called recently. It also currently waits for the scraper to terminate before returning a response, which may be changed later.

Didn't realize uvloop isn't compatible with Windows, but the app seems to work fine without it. So I added a comment saying to comment this line out before installing the required packages on Windows.

Also began making use of the app-wide __init__.py by delegating some global function and variable imports to the file.

Before, dependencies.py would search for a .env file from the current working directory. This has now been changed so that .env is searched for relative to the file.

Eliminates the need to import and include API routers one by one in the main file.

Not sure if this makes any real difference, but I thought it couldn't hurt to include it.

Just to eliminate any confusion of utils.py being a global utils file.

The background job scheduler is now available as a global variable in the main app package. All scraping logic is now delegated to the sis_scraper.py file, including the boolean value that locks the scraper to only execute once at a time. The scraper also now integrates with FastAPI (technically Uvicorn) logging. The recurring scraper job now happens in the app's main file.

Before, it would create a "data" folder at the current working directory; this has now been changed to create one in the same directory as the scraper file.

Taken from the CARPI-Data repository, but modified so that it does not depend on any external dependencies. Code that is no longer relevant has also been removed entirely from this file.

Coming straight from the CARPI-Data repo.

With the switch to a new service for SIS on 09/17/2025, we decided to rewrite our existing SIS scraper in favor of simpler logic and better maintainability. The old scrapers still work, we just don't really like touching that code. This new scraper IS NOT COMPLETE yet. I discovered a working procedure via Postman for getting all course data, and am in the middle of automating it using Python. A current issue is that the GET request made in get_cookie() does not return a cookie, which is needed to access any course data. This is actively being looked into.

Small restructure of functions and an addition of clarifying docstrings. The issue presented in the last commit turned out to be a result of a gap in my understanding of cookies; a JSESSIONID cookie is assigned on the first call to any SIS page. I was inadvertently calling get_subjects() before attempting to read a cookie from get_cookie(), at which point the aiohttp session would already have a cookie and therefore would not receive another one from the server. In addition, I did not know that the aiohttp session automatically stores cookies and sends them in subsequent requests. Therefore, there is no need to explicitly handle cookie logic.

I'm basically following Single Responsibility Principle as I'm writing this new SIS scraper. The course_search() method, previously get_courses_in_subject(), now gives the caller more control over how results should be sorted. There are a lot of other miscellaneous changes included with this commit, but they're all basically for the sake of SRP.

My plan is to have this master function call all of the scraper functions and aggregate the data to create the final JSON that represents all course data for one term.

In the context of this scraper, a "class" refers to a section of a course, while a "course" refers to the overarching course that may have multiple classes. The data returned from SIS is keyed by classes, not courses. The get_course_data() function manipulates and aggregates this data such that the returned structure is keyed by courses instead, with classes as a sub-field of each course.

This change is part of transitioning into a new scraping/processing approach in which raw data will be stored in the JSON as an intermediate, and a big post-processing pipeline will be added later to convert names into codes.

In the time it took to make this commit, I discovered SO many more restriction types that it's getting a little ridiculous trying to match them all. Although at this point, there are hopefully no more possible restriction types. Only the future can tell.

This ensures that names can reliably be converted to codes in the post-processing stage after scraping. For example, there has been a case of a course referencing a prerequisite subject name that existed in a previous semester but not the current one. Thus, caching these code mappings that contain data from all semesters will prevent cases like these.

My Windows machine was running into errors trying to map unicode characters to a writable character; turns out open() by default does not use unicode encoding on Windows but does on Mac.

The reason for this change is detailed in the comment added to the code. Basically, SIS data sucks and data for a single restriction can sometimes be split into multiple lines, causing each line to count as a separate restriction.

With the migration of the SIS scraper to its own repository, it is no longer needed in the API service. We plan to run the scraper separately from the API service, with a third database model repository to unify the schema structure used by both apps.

Copilot

Pull Request Overview

This PR modernizes the core application infrastructure by introducing a centralized logger, dynamic router registration, and improved database lifecycle management. The changes prepare the codebase for better scalability and maintainability after removing the originally planned SIS scraper implementation.

Key Changes

Implemented FastAPI lifespan context manager for database engine initialization and cleanup
Added dynamic router scanning and registration to eliminate manual router inclusion
Centralized logging configuration and made key dependencies available at package level

Reviewed Changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
app/main.py	Replaced manual router registration with dynamic scanning function and added lifespan management
app/dependencies.py	Converted database initialization to async lifespan pattern with proper cleanup
app/init.py	Exported core dependencies and created centralized logger instance
app/routers/course.py	Reorganized imports to use package-level exports
app/db_models/professor.py	Alphabetized SQLAlchemy type imports
README.md	Added blank line for formatting

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

app/dependencies.py

app/main.py

app/dependencies.py

app/main.py

app/dependencies.py

Co-authored-by: Copilot <[email protected]>

ramonechen and others added 30 commits February 14, 2025 17:53

Import SIS scraper code

9fabe76

Literally copy pasted over from CARPI-Data. The plan is to set up scheduled scraping and automate the process of updating our database with the latest data.

Properly close database engine on shutdown

7c163af

Re-order some imports

60a099c

Create automatic scraper thread

a7eb946

The scraper router is very incomplete, only a dummy thread has been made. I plan to use this thread to automatically update the database every so often.

Add __init__.py to scrapers package

7cec28b

Add comment to requirements.txt

536267d

Didn't realize uvloop isn't compatible with Windows, but the app seems to work fine without it. So I added a comment saying to comment this line out before installing the required packages on Windows.

Change some relative imports to absolute imports

3f53eaf

Also began making use of the app-wide __init__.py by delegating some global function and variable imports to the file.

Change .env loading to work relatively

72288c1

Before, dependencies.py would search for a .env file from the current working directory. This has now been changed so that .env is searched for relative to the file.

Add dynamic router scanning

5a900d8

Eliminates the need to import and include API routers one by one in the main file.

Properly shut down APScheduler scheduler

80929d8

Not sure if this makes any real difference, but I thought it couldn't hurt to include it.

Add logger variable to app package

1769722

Rename utils.py to scraper_utils.py

6841c6b

Just to eliminate any confusion of utils.py being a global utils file.

Remove redundant parameter in router scanning function

1bde9ef

Remove redundant import in dependencies

816987e

Reduce verbosity of scraper logging

165f0ef

Change location of SIS scraper data to be relative

79cb76e

Before, it would create a "data" folder at the current working directory; this has now been changed to create one in the same directory as the scraper file.

Add scraper_data to .gitignore

0722806

Import old JSON to SQL conversion code

e694e6c

Taken from the CARPI-Data repository, but modified so that it does not depend on any external dependencies. Code that is no longer relevant has also been removed entirely from this file.

FIx "may not be" major requirement variable

9c7e177

Import prerequisite parser code

1e85d58

Coming straight from the CARPI-Data repo.

Begin implementation of master scraping function

0d06bb5

My plan is to have this master function call all of the scraper functions and aggregate the data to create the final JSON that represents all course data for one term.

Remove test headers from main function

7ebb3ab

Replace "&" with "&" in SIS data

a2ef840

ramonechen and others added 16 commits October 8, 2025 20:01

Add timeout arg to SIS scraper

6b737fd

Split batch logger calls into multiple calls

3fa6cb1

Increase SIS scraper default timeout

6636223

Temporarily remove subject name to code map

cc6d7bf

This change is part of transitioning into a new scraping/processing approach in which raw data will be stored in the JSON as an intermediate, and a big post-processing pipeline will be added later to convert names into codes.

Fix incorrectly parsed Special Approvals restrictions

ad5ab90

Enforce UTF-8 encoding

2baf63d

My Windows machine was running into errors trying to map unicode characters to a writable character; turns out open() by default does not use unicode encoding on Windows but does on Mac.

Fix incorrect mapping paths being used in output

cb7c8d2

Fix course restriction parsing

0a23d77

The reason for this change is detailed in the comment added to the code. Basically, SIS data sucks and data for a single restriction can sometimes be split into multiple lines, causing each line to count as a separate restriction.

Update code mappings to reflect correct restrictions

a72d97a

Ensure code maps are always sorted

ce3a1f1

Test commit

65e7d6c

Update requirements.txt

287fd97

Remove scraper_data from .gitignore

d77e1b6

ramonechen added the API Related to the API label Oct 22, 2025

ramonechen requested review from Copilot and jzgom067 October 22, 2025 19:46

Copilot AI reviewed Oct 22, 2025

View reviewed changes

app/dependencies.py Outdated Show resolved Hide resolved

app/main.py Show resolved Hide resolved

app/dependencies.py Show resolved Hide resolved

app/main.py Outdated Show resolved Hide resolved

app/dependencies.py Show resolved Hide resolved

ramonechen and others added 3 commits October 22, 2025 15:54

Prevent duplicate router registration

e65673e

Co-authored-by: Copilot <[email protected]>

Add type annotation to database engine variable

9b6bd4b

Format main.py

d2ce2f4

jzgom067 approved these changes Oct 24, 2025

View reviewed changes

ramonechen added 4 commits October 24, 2025 17:00

Merge branch 'main-preview' into automatic-scraping

70e7080

Merge branch 'main-preview' into automatic-scraping

0d644c3

Remove unused files from main-preview

a235531

Remove duplicate import in course router

24a1a84

ramonechen merged commit 10604d8 into main-preview Oct 24, 2025
2 checks passed

ramonechen deleted the automatic-scraping branch October 24, 2025 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modernize core application infrastructure #11

Modernize core application infrastructure #11

Uh oh!

ramonechen commented Oct 22, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Modernize core application infrastructure #11

Modernize core application infrastructure #11

Uh oh!

Conversation

ramonechen commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

How?

Testing?

Anything Else?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ramonechen commented Oct 22, 2025 •

edited

Loading