NBA Game Outcome Predictor 🏀

This project predicts the outcome of upcoming NBA games using sequential feature selection on a RidgeClassifier model that is trained on historical NBA game data.

⚙️ Overview:

Data Collection:
get_data.py scrapes NBA games from basketball-reference.com using Playwright. NOTE: Currently the 'data/standings' directory in this repository has games from the beginning of the 2017/2018 season to March 28, 2025.
Data Parsing:
parse_data.py converts raw HTML box scores into a pandas dataframe that is saved as a CSV file. Running this file gives you a CSV file titled "nba_games.csv" which contains game data from the last 7 seasons. NOTE: The repository already contains a "nba_games.csv" file which has data from the beginning of the 2017/2018 season to March 28, 2025.
Model Training:
train_model.py:
- Cleans the dataset (nba_games.csv)
- Adds rolling averages, feature selection, and target columns
- Trains a Ridge Classifier using SequentialFeatureSelector
- Saves:
  - sfs_model.pkl – the trained selector with model
  - predictors.pkl – the chosen feature columns
  - latest_full.csv - this file will give you the final dataset that is used for prediction.
- NOTE: latest_full.csv.zip is in this repository, unzipping this file will give you the final dataset that is used for prediction. If you would like to obtain a new version of this file that contains data from the latest nba games, run train_model.py after you have run get_data.py, followed by parse_data.py.
Prediction:
predict.py:
- Loads saved model and predictors
- Predicts results for upcoming games
- Displays each team’s expected outcome and matchup
- NOTE: the repository already contains data a file containing saved predictors (predictors.pkl) and a saved model that is trained on game data from the beginning of the 2017/2018 NBA season to March 28, 2025 using backward feature selection. Thus, you can immediately run predict.py after cloning this repository and unzipping latest_full.csv.zip. Make sure the unzipped file is named latest_full.csv if it isn't already, directly after unzipping.

🧪 Example Output (of a prediction):

home team	opponent	prediction	probability (of the home team winning)	date
IND	NYK	1 (win)	0.54444	2025-05-29

As such it displays the prediction for each of the 30 teams next game (so the table displays 15 rows if all teams have a next game to play within the current season). It only displays the prediction if the team has a next game so during the playoffs it only displays the appropriate predictions based on which teams are playing.

Testing Accuracy: testing_accuracy.py:

Tests the model's accuracy on historical data

Installing Dependencies:
- pip install -r requirements.txt

⚙️ How it Works:

Run get_data.py in order to obtain the historical game data required to train the RidgeClassifer model. Currently the data/scores has games from the beginning of the 2018 season to March 28, 2025. If you would like to update this directory with data from the latest games, then run get_data.py.
Run parse_data.py in order to transform the data within the 'data' directory into a pandas dataframe that can be used to train the RidgeClassifier model. Currently the nba_games.csv has games from the beginning of the 2017/2018 season to March 28, 2025. If you would like an upated version of this file (include data from the latest games) on then run parse_data.py.
Run train_model.py in order to create and train the model using pandas dataframe created from running parse_data.py. Currently the the saved model (sfs_model.pkl) is trained on all games from the beginning of the 2017/2018 season to March 28, 2025. In order to train your model on the most recently played games as well, you have rerun get_data.py followed by parse_data.py, and then you can run train_model.py.
(RECOMMENDED) Go to https://cdn.nba.com/static/json/staticData/scheduleLeagueV2.json, copy everything (Ctrl + A -> Ctrl + C on windows, or command + A -> command + C on mac), create a file called scheduleLeagueV2.json, paste what you copied, and store the file in the same directory as predict.py. This way even if the api call fails, we can still access the NBA schedule to determine each team's next game.
Run predict.py to obtain the predictions for each team's next game.
(Optional) Run testing_accuracy.py to test the accuracy of the model on historical data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NBA Game Outcome Predictor 🏀

⚙️ Overview:

🧪 Example Output (of a prediction):

⚙️ How it Works:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
README.md		README.md
get_data.py		get_data.py
latest_full.csv.zip		latest_full.csv.zip
nba_games.csv		nba_games.csv
parse_data.py		parse_data.py
predict.py		predict.py
predictors.pkl		predictors.pkl
requirements.txt		requirements.txt
scaler.pkl		scaler.pkl
scheduleLeagueV2.json		scheduleLeagueV2.json
sfs_model.pkl		sfs_model.pkl
testing_accuracy.py		testing_accuracy.py
train_model.py		train_model.py

Folders and files

Latest commit

History

Repository files navigation

NBA Game Outcome Predictor 🏀

⚙️ Overview:

🧪 Example Output (of a prediction):

⚙️ How it Works:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages