A Modern Web Scraper for the RPI Student Information System (SIS)

This repository contains source code for a web scraper designed to fetch course data from Rensselaer Polytechnic Institute's new SIS, which replaced the old SIS on September 17th, 2025.

The scraper operates on endpoints within the student registration system, which is accessible through this link without needing authentication.

Data Collected

This scraper is designed to fetch course data from any range of years and academic terms, with Summer 1998 being the earliest available term.

A JSON file is produced for every academic term processed; see sample output JSON data format for format details.

Course information included in the output data includes:

Course Code (e.g. CSCI 1100)
Name
Description
Corequisites
~~Prerequisites (with AND, OR relationships)~~ (Currently WIP)
Crosslists
Attributes
Restrictions (with restriction types)
Credit Hours
Section CRNs
Section instructors (and RCSIDs)
Seat data (capacity, registered, and open)

About the SIS Scraper

There are two distinct steps that take place within the SIS scraper:

Scraping Step

This first step involves fetching data from SIS endpoints and writing to JSON after some basic formatting and sanitization. Code mappings are also generated and written to JSON during this step (e.g. subject code-to-name, restriction code-to-name, and more). The extent of these mappings may vary depend on the range of academic terms the scraper is ran on; more terms generally results in more comprehensive mappings. The scraper data and mappings are written to separate directories.

Postprocess Step

This next, optional step operates on the JSON data outputted by the scraper and codifies values such as subject names, restriction names, attribute names, and instructor names using code mappings generated by the scraper. The scraper data is not modified in-place; instead, processed data is written to a new, separate directory.

RCSIDs are also dynamically generated for instructors that do not have an RCSID in the scraped data (i.e. they are not RPI faculty). These generated RCSIDs are checked against existing ones in the instructor RCSID-to-name map generated by the scraper and follow the same naming convention—using numbers to ensure unique IDs. All instances of instructors with the same name without an RCSID are treated as one person. These generated RCSIDs are written to JSON separate from the one with real RCSIDs.

The postprocess step is tailored to the needs of Project CARPI, so the resulting data format may not be desirable for general use.

Running the SIS Scraper

1. Installing Required Dependencies

The scraper requires Python >= 3.11 to run. If you don't have an appropriate version of Python, download and install one from the official website.

Recommended: To avoid cluttering the global Python package space on your computer, you can optionally create a virtual environment in the project root and activate it in order to keep packages scoped to this project only.

Windows
# Create a virtual environment in the current directory
python.exe -m venv .venv
# Enter/activate the environment
.venv\Scripts\activate
MacOS/Unix
# Create a virtual envrionment in the current directory
python3 -m venv .venv
# Enter/activate the environment
source .venv/bin/activate
You'll know whether you're in the virtual environment based on if a prefix appears in your command prompt.
# Note the (.venv)
(.venv) raymond@Macbook-Pro sis_scraper %
To exit/deactivate the virtual environment, simply run the deactivate command.
(.venv) raymond@Macbook-Pro sis_scraper % deactivate
raymond@Macbook-Pro sis_scraper %

From the project root, run the command below to install the required dependencies.

# Install all packages listed in requirements.txt
pip install -r requirements.txt

2. Creating the .env File

Create a new file in the sis_scraper directory named .env. An example.env file has been provided in the directory for reference; you may simply copy-paste the contents to use default values.

The file contains variables for configuring output directories and code mapping filenames. You may optionally edit them to your liking.

3. Running the Script

Run main.py using one of the commands below. Optional flags --scrape-only and --postprocess-only may be specified to disable a step of the scraper.

Windows

# Navigate to the sis_scraper directory
cd sis_scraper
# Run the scraper from start_year to end_year
python.exe main.py [--scrape-only | --postprocess-only] start_year end_year

MacOS/Unix

# Navigate to the sis_scraper directory
cd sis_scraper
# Run the scraper from start_year to end_year
python3 main.py [--scrape-only | --postprocess-only] start_year end_year

Once the scraper is running, logs will be displayed in the console as well as written to disk. The log directory by default is located in the directory of the main script.

Running time of the scraper may vary heavily depending on the number of terms being scraped as well as external server or network factors that are not within the scraper's control.

Sample Output JSON Format

The output shown below does not accurately reflect real data; it is just meant to show the structure and format of the data. Note that the shown format is after the scraping step only (i.e. no postprocessing).

[
    "CSCI": {
        "subject_name": "Computer Science",
        "courses": {
            "CSCI 1100": {
                "course_name": "Computer Science I",
                "course_detail": {
                    "description": "An introduction to computer programming algorithm design and analysis.",
                    "corequisite": [
                        "Earth & Environmental Science 1100",
                        "Physics 1101"
                    ],
                    "prerequisite": {},
                    "crosslist": [
                        "Architecture 5100",
                        "Electrical & Comp. Sys. Engr. 4480"
                    ],
                    "attributes": [
                        "Data Intensive I  DI1",
                        "Introductory Level Course  FRSH"
                    ],
                    "restrictions": {
                        "campus": [
                            "Troy (T)"
                        ],
                        "not_campus": [],
                        "classification": [
                            "Freshman (FR)",
                            "Sophomore(SO)"
                        ],
                        "not_classification": [],
                        "college": [
                            "School of Science (S)"
                        ],
                        "not_college": [],
                        "degree": [
                            "Doctor of Philosophy (PHD)"
                        ],
                        "not_degree": [],
                        "department": [
                            "Architecture (ARCH)"
                        ],
                        "not_department": [],
                        "level": [
                            "Graduate"
                        ],
                        "not_level": [],
                        "major": [
                            "Computer Science (CSCI)",
                            "Computer & Systems Engineering (CSYS)"
                        ],
                        "not_major": [],
                        "minor": [
                            "Electronic Arts (EART)"
                        ],
                        "not_minor": [],
                        "special_approval": [
                            "Instructor's Approval"
                        ]
                    },
                    "credits": {
                        "min": 4,
                        "max": 0
                    },
                    "sections": [
                        {
                            "CRN": "61891",
                            "instructor": [
                                "Doe, John (doej)",
                                "Jane, Mary (janem)",
                                "Huffman, Stanton (Unknown RCSID)"
                            ],
                            "capacity": 30,
                            "registered": 29,
                            "open": 1
                        },
                    ]
                }
            },
        }
    },
    "ENGR": {
        "subject_name": "Core Engineering",
        "courses": {},
    },
]

Name		Name	Last commit message	Last commit date
Latest commit History 295 Commits
.github/workflows		.github/workflows
sis_scraper		sis_scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Modern Web Scraper for the RPI Student Information System (SIS)

Data Collected

About the SIS Scraper

Scraping Step

Postprocess Step

Running the SIS Scraper

1. Installing Required Dependencies

2. Creating the .env File

3. Running the Script

Sample Output JSON Format

About

Uh oh!

Uh oh!

Contributors 4

Uh oh!

Languages

License

Project-CARPI/sis-scraper

Folders and files

Latest commit

History

Repository files navigation

A Modern Web Scraper for the RPI Student Information System (SIS)

Data Collected

About the SIS Scraper

Scraping Step

Postprocess Step

Running the SIS Scraper

1. Installing Required Dependencies

2. Creating the .env File

3. Running the Script

Sample Output JSON Format

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 4

Uh oh!

Languages