This repository contains source code for a web scraper designed to fetch course data from Rensselaer Polytechnic Institute's new SIS, which replaced the old SIS on September 17th, 2025.
The scraper operates on endpoints within the student registration system, which is accessible through this link without needing authentication.
This scraper is designed to fetch course data from any range of years and academic terms, with Summer 1998 being the earliest available term.
A JSON file is produced for every academic term processed; see sample output JSON data format for format details.
Course information included in the output data includes:
- Course Code (e.g. CSCI 1100)
- Name
- Description
- Corequisites
Prerequisites (with AND, OR relationships)(Currently WIP)- Crosslists
- Attributes
- Restrictions (with restriction types)
- Credit Hours
- Section CRNs
- Section instructors (and RCSIDs)
- Seat data (capacity, registered, and open)
There are two distinct steps that take place within the SIS scraper:
This first step involves fetching data from SIS endpoints and writing to JSON after some basic formatting and sanitization. Code mappings are also generated and written to JSON during this step (e.g. subject code-to-name, restriction code-to-name, and more). The extent of these mappings may vary depend on the range of academic terms the scraper is ran on; more terms generally results in more comprehensive mappings. The scraper data and mappings are written to separate directories.
This next, optional step operates on the JSON data outputted by the scraper and codifies values such as subject names, restriction names, attribute names, and instructor names using code mappings generated by the scraper. The scraper data is not modified in-place; instead, processed data is written to a new, separate directory.
RCSIDs are also dynamically generated for instructors that do not have an RCSID in the scraped data (i.e. they are not RPI faculty). These generated RCSIDs are checked against existing ones in the instructor RCSID-to-name map generated by the scraper and follow the same naming convention—using numbers to ensure unique IDs. All instances of instructors with the same name without an RCSID are treated as one person. These generated RCSIDs are written to JSON separate from the one with real RCSIDs.
The postprocess step is tailored to the needs of Project CARPI, so the resulting data format may not be desirable for general use.
The scraper requires Python >= 3.11 to run. If you don't have an appropriate version of Python, download and install one from the official website.
Recommended: To avoid cluttering the global Python package space on your computer, you can optionally create a virtual environment in the project root and activate it in order to keep packages scoped to this project only.
Windows
# Create a virtual environment in the current directory python.exe -m venv .venv # Enter/activate the environment .venv\Scripts\activateMacOS/Unix
# Create a virtual envrionment in the current directory python3 -m venv .venv # Enter/activate the environment source .venv/bin/activateYou'll know whether you're in the virtual environment based on if a prefix appears in your command prompt.
# Note the (.venv) (.venv) raymond@Macbook-Pro sis_scraper %To exit/deactivate the virtual environment, simply run the deactivate command.
(.venv) raymond@Macbook-Pro sis_scraper % deactivate raymond@Macbook-Pro sis_scraper %
From the project root, run the command below to install the required dependencies.
# Install all packages listed in requirements.txt
pip install -r requirements.txtCreate a new file in the sis_scraper directory named .env. An example.env file has been provided in the directory for reference; you may simply copy-paste the contents to use default values.
The file contains variables for configuring output directories and code mapping filenames. You may optionally edit them to your liking.
Run main.py using one of the commands below. Optional flags --scrape-only and --postprocess-only may be specified to disable a step of the scraper.
Windows
# Navigate to the sis_scraper directory
cd sis_scraper
# Run the scraper from start_year to end_year
python.exe main.py [--scrape-only | --postprocess-only] start_year end_yearMacOS/Unix
# Navigate to the sis_scraper directory
cd sis_scraper
# Run the scraper from start_year to end_year
python3 main.py [--scrape-only | --postprocess-only] start_year end_yearOnce the scraper is running, logs will be displayed in the console as well as written to disk. The log directory by default is located in the directory of the main script.
Running time of the scraper may vary heavily depending on the number of terms being scraped as well as external server or network factors that are not within the scraper's control.
The output shown below does not accurately reflect real data; it is just meant to show the structure and format of the data. Note that the shown format is after the scraping step only (i.e. no postprocessing).
[
"CSCI": {
"subject_name": "Computer Science",
"courses": {
"CSCI 1100": {
"course_name": "Computer Science I",
"course_detail": {
"description": "An introduction to computer programming algorithm design and analysis.",
"corequisite": [
"Earth & Environmental Science 1100",
"Physics 1101"
],
"prerequisite": {},
"crosslist": [
"Architecture 5100",
"Electrical & Comp. Sys. Engr. 4480"
],
"attributes": [
"Data Intensive I DI1",
"Introductory Level Course FRSH"
],
"restrictions": {
"campus": [
"Troy (T)"
],
"not_campus": [],
"classification": [
"Freshman (FR)",
"Sophomore(SO)"
],
"not_classification": [],
"college": [
"School of Science (S)"
],
"not_college": [],
"degree": [
"Doctor of Philosophy (PHD)"
],
"not_degree": [],
"department": [
"Architecture (ARCH)"
],
"not_department": [],
"level": [
"Graduate"
],
"not_level": [],
"major": [
"Computer Science (CSCI)",
"Computer & Systems Engineering (CSYS)"
],
"not_major": [],
"minor": [
"Electronic Arts (EART)"
],
"not_minor": [],
"special_approval": [
"Instructor's Approval"
]
},
"credits": {
"min": 4,
"max": 0
},
"sections": [
{
"CRN": "61891",
"instructor": [
"Doe, John (doej)",
"Jane, Mary (janem)",
"Huffman, Stanton (Unknown RCSID)"
],
"capacity": 30,
"registered": 29,
"open": 1
},
]
}
},
}
},
"ENGR": {
"subject_name": "Core Engineering",
"courses": {},
},
]