This project predicts the outcome of upcoming NBA games using sequential feature selection on a RidgeClassifier model that is trained on historical NBA game data.
-
Data Collection:
get_data.pyscrapes NBA games from basketball-reference.com using Playwright. NOTE: Currently the 'data/standings' directory in this repository has games from the beginning of the 2017/2018 season to March 28, 2025. -
Data Parsing:
parse_data.pyconverts raw HTML box scores into a pandas dataframe that is saved as a CSV file. Running this file gives you a CSV file titled "nba_games.csv" which contains game data from the last 7 seasons. NOTE: The repository already contains a "nba_games.csv" file which has data from the beginning of the 2017/2018 season to March 28, 2025. -
Model Training:
train_model.py:- Cleans the dataset (
nba_games.csv) - Adds rolling averages, feature selection, and target columns
- Trains a Ridge Classifier using
SequentialFeatureSelector - Saves:
sfs_model.pkl– the trained selector with modelpredictors.pkl– the chosen feature columnslatest_full.csv- this file will give you the final dataset that is used for prediction.
- NOTE:
latest_full.csv.zipis in this repository, unzipping this file will give you the final dataset that is used for prediction. If you would like to obtain a new version of this file that contains data from the latest nba games, runtrain_model.pyafter you have runget_data.py, followed byparse_data.py.
- Cleans the dataset (
-
Prediction:
predict.py:- Loads saved model and predictors
- Predicts results for upcoming games
- Displays each team’s expected outcome and matchup
- NOTE: the repository already contains data a file containing saved predictors (
predictors.pkl) and a saved model that is trained on game data from the beginning of the 2017/2018 NBA season to March 28, 2025 using backward feature selection. Thus, you can immediately runpredict.pyafter cloning this repository and unzippinglatest_full.csv.zip. Make sure the unzipped file is namedlatest_full.csvif it isn't already, directly after unzipping.
| home team | opponent | prediction | probability (of the home team winning) | date |
|---|---|---|---|---|
| IND | NYK | 1 (win) | 0.54444 | 2025-05-29 |
As such it displays the prediction for each of the 30 teams next game (so the table displays 15 rows if all teams have a next game to play within the current season). It only displays the prediction if the team has a next game so during the playoffs it only displays the appropriate predictions based on which teams are playing.
- Testing Accuracy:
testing_accuracy.py:
- Tests the model's accuracy on historical data
- Installing Dependencies:
- pip install -r requirements.txt
- Run
get_data.pyin order to obtain the historical game data required to train the RidgeClassifer model. Currently thedata/scoreshas games from the beginning of the 2018 season to March 28, 2025. If you would like to update this directory with data from the latest games, then runget_data.py. - Run parse_data.py in order to transform the data within the 'data' directory into a pandas dataframe that can be used to train the RidgeClassifier model. Currently the
nba_games.csvhas games from the beginning of the 2017/2018 season to March 28, 2025. If you would like an upated version of this file (include data from the latest games) on then runparse_data.py. - Run train_model.py in order to create and train the model using pandas dataframe created from running parse_data.py. Currently the the saved model (
sfs_model.pkl) is trained on all games from the beginning of the 2017/2018 season to March 28, 2025. In order to train your model on the most recently played games as well, you have rerunget_data.pyfollowed byparse_data.py, and then you can runtrain_model.py. - (RECOMMENDED) Go to
https://cdn.nba.com/static/json/staticData/scheduleLeagueV2.json, copy everything (Ctrl + A -> Ctrl + C on windows, or command + A -> command + C on mac), create a file calledscheduleLeagueV2.json, paste what you copied, and store the file in the same directory aspredict.py. This way even if the api call fails, we can still access the NBA schedule to determine each team's next game. - Run
predict.pyto obtain the predictions for each team's next game. - (Optional) Run
testing_accuracy.pyto test the accuracy of the model on historical data.