Using Machine Learning to Describe how Players Impact the Game in the MLB

This repository contains the elements necessary to describe how players in the MLB impact the game over the short term, which we colloquially refer to as their form. The corresponding paper, accepted at MIT SSAC '22, can be found here. Concretely, player form is described by a numerical vector derived from the sequence of in-game events in which the player participated. The steps described below will guide you through the process from start to finish - collecting the data, training the model, describing player form, and visualizing the produced form embeddings. We present a small comparison of form- vs statistic- based embedding visualizations below. A more complete set of batter visualizations can be found here and pitcher visualizations here (these more complete visualizations will also be created in step 5 below).

In general, we find that both form- and statistic- based embeddings do a good job of differentiating the "good" players form the "bad" ones, but that the form- based embeddings are better suited for highlighting the manner in which the "good" players impact the game. For a succinct example of this, we look to the starting pitcher visualizations. We see that both the form- and statistic- based embedding visualizations seem to generally induce a region of all-star players. When looking to breaking ball usage, however, we see no association in the statistic- based visualizations, but a clear grouping of pitchers who throw a large portion of breaking balls in the form- based visualizations.

Implementing Pipeline

Before proceeding, please install all packages listed in requirements.txt

Fetching data
Preparing training data
Training player form models
Describing player form
Visualizing form embeddings
Clustering form embeddings
Predicting Game Winner

NOTE: While we provide the code to construct a local database and use it to create at-bat records to train and use the models below, the constructed database, at-bat records, form vectors, and form clusters can be found here. The at-bat records are grouped by season and compressed (i.e. 201*.tar.gz). To extract the data for 2015, for example, execute tar -xzf 2015.tar.gz command. The database can found in the mlb.db.tar.gz file and extracted using the tar -xzf mlb.db.tar.gz command.

To use the pre-constructed database, create a database/ directory at the root-level of the repo and place the decompressed mlb.db database in that directory. If using the pre-constructed database, step 1 below can be skipped.

To use the pre-constructed at-bat records, create a data/ab_seqs/ab_seqs_v1 directory at the root level of the repo and place the decompressed at-bat records grouped by season in that directory. That is, data/ab_seqs/ab_seqs_v1 should have five sub-directories - 2015, 2016, 2017, 2018, and 2019 - each containing at-bat records for that season. If using the pre-constructed at-bat records, step 2 below can be skipped.

To use the pre-computed player form vectors, create an out/forms directory at the root level of the repo. Then, download batter_form_v1.tar.gz and pitcher_form_v1.tar.gz from the link given above, place them in the out/forms directory, and extract them. After doing so, you should have two new directories: out/forms/batter_form_v1 and out/forms/pitcher_form_v1. If using pre-computed player form vectors, step 4 below can be skipped.

To use the pre-computed form clusters, create an out/form_cluster directory at the root level of the repo. Then, download the batter1_agglom.tar.gz and pitcher1_agglom.tar.gz files from the link provided above, place them in the out/form_cluster directory, and extract their contents. You should have two new directories: out/form_cluster/batter1_agglom and out/form_cluster/pitcher1_agglom. If using pre-computed form clusters, you can comment out the first phase of the cluster_player_forms_and_plot.sh script in step 6.

1. Fetching Data

Estimated duration: 10-15 minutes

To fetch data, simply run the fetch_data.sh script (also given below). By default, this will collect pitch-by-pitch statcast data for 2015-2019, and seasonal statistics back to 1995. The script will create a database directory and create a database file at database/mlb.db. To change where the database is placed, change the DB_FP variable in the fetch_data.sh script to the desired location. If you change the location of the database, please make note of it as you will need it later.

The script utilized two custom "worker" classes, a PyBaseball worker and a SQLWorker. The PyBaseball workers fetch data via the PyBaseball library and then pass it to an SQLWorker to populate a local database. Pitch-by-pitch data will be collected first, followed by the season-by-season stats. In total, there will be around 3.6M pitch-by-pitch records, 2k season-by-season pitching records, and 3.7k season-by-season batting records. The workers will periodically print their status, ie how many records have been processed and the current date range being processed. Should only take 10-15 minutes on a modern CPU w/ solid-state storage.

fetch_data.sh:

#!/bin/bash

export DB_FP="$PWD/database/mlb.db"

# move to source dir
cd src

echo "******************************"
echo "* Fetching play-by-play data *"
echo "******************************"
python3 fetch_data.py --statcast T --pitching_by_season F --batting_by_season F \
                      --start_year 2015 --end_year 2019 \
                      --n_pybaseball_workers 3 --database_fp $DB_FP

echo "**********************************"
echo "* Fetching season-by-season data *"
echo "**********************************"
python3 fetch_data.py --statcast F --pitching_by_season T --batting_by_season T \
                      --start_year 1995 --end_year 2019 \
                      --n_pybaseball_workers 1 --database_fp $DB_FP

2. Preparing Training Data

Estimated duration: 65 minutes

The first step in creating the training data is to make a single record for each plate appearance in the newly constructed database. To do so, simply run the construct_at_bat_records.sh script (also given below). If you created the database in a location other than database/mlb.db (i.e. you changed DB_FP in the fetch_data.sh script), please update that for this script as well. By default, running the script will create a data/ directory in the repo, and the individual at-bat records will be placed in the data/ab_seqs/ab_seqs_v1/ directory, grouped by season. The output location can be adjusted by modifying the AB_OUT_DIR variable in the script.

In a system with a modern CPU and solid-state storage, it will take roughly 4.5 seconds to construct the at-bats for an individual game. By default, the script will try to utilize 4 threads to construct the records. This can be adjusted by changing the value of the N_WORKERS variable in the construct_at_bat_records.sh script. When 16 threads are used, it will take roughly 1 hour for the script to complete processing. Records for each season will take up about 10.5 GB, so for all five seasons, so ~55 GB of free space is required.

Once the at-bat records are constructed, the script will begin to aggregate at-bat records by player in chronological order. First pitchers, then batters. This should only take about a minute or less. By default, the script will create a data/player_career_data directory in the repo and place the output within. If you wish to change the location of the output, simply change the CAREER_OUT_DIR variable in the script.

Finally, the script will create whole game records that will be used later on. That is, records describing individual games in terms of the starting batters, pitchers, location, score, and hits among others. By default, the records will be placed in the data/whole_game_records/by_season directory in the repo. This phase should only take around five minutes.

construct_at_bat_records.sh:

#!/bin/bash

export AB_OUT_DIR="$PWD/data/ab_seqs/ab_seqs_v1"
export CAREER_OUT_DIR="$PWD/data/player_career_data"
export WHOLE_GAME_OUT_DIR="$PWD/data/whole_game_records/by_season"

export DB_FP="$PWD/database/mlb.db"
export N_WORKERS=16

# move to source dir
cd src

echo "***************************"
echo "* Building at-bat records *"
echo "***************************"
python3 construct_at_bat_records.py --start_year 2015 --end_year 2019 --n_workers $N_WORKERS \
                                    --out "$AB_OUT_DIR" --db_fp "$DB_FP"

echo "********************************"
echo "* Building pitcher career data *"
echo "********************************"
python3 construct_player_career_records.py --player_type "pitcher" --db_fp "$DB_FP" \
                                           --outdir "$CAREER_OUT_DIR"

echo "*******************************"
echo "* Building batter career data *"
echo "*******************************"
python3 construct_player_career_records.py --player_type "batter" --db_fp "$DB_FP" \
                                           --outdir "$CAREER_OUT_DIR"

echo "*******************************"
echo "* Creating whole game records *"
echo "*******************************"
python3 create_whole_game_records.py --data "$AB_OUT_DIR" --out "$WHOLE_GAME_OUT_DIR"

3. Training Player Form Models

Estimated duration: 2.5 days (pitchers), 3 days (batters)

NOTE: Before proceeding with this step, please unsure you have a CUDA capable GPU and CUDA installed on the machine you intend to run the models on. Additional information to this end can be found here. While you can technically train these models on a CPU, it would take an egregious amount of time to complete.

NOTE: GPUs with a minimum 32 GB of RAM are required to train the batter and pitcher models referenced in our corresponding paper (defaults in training scripts). One GPU was used to train the batter model, while two GPUs were used to train the pitcher model.

We provide scripts to train both the batter and pitcher form models as presented in our paper. As their names suggest, batter_form_modeling.sh trains a batter form model and pitcher_form_modeling.sh trains a pitcher form model. The batter model will take a litte more than 3 days to train (~80 hours) while the pitcher model will take about 2.5 days to train (~58 hours) using A6000 GPU's. We also provide the trained model weights in the pretrained_models directory in the repo.

batter_form_modeling.sh:

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0

export AB_OUT_DIR="$PWD/data/ab_seqs/ab_seqs_v1"
export CAREER_OUT_DIR="$PWD/data/player_career_data"

cd src/

python3 run_player_form_modeling.py --player_type "batter" --epochs 370 --batch_size 256 \
                                    --min_view_step_size 1 --max_view_step_size 5 --view_size 15 \
                                    --form_ab_window_size 20 --min_form_ab_window_size 20 \
                                    --min_ab_to_be_included_in_dataset 40 \
                                    --max_seq_len 200 --max_view_len 125 \
                                    --distribution_based_player_sampling_prob 0.25 \
                                    --mask_override_prob 0.15 --n_warmup_iters 2000 \
                                    --n_data_workers 4 --gpus 0 --port 12345 \
                                    --n_layers 8 --n_attn 8 --n_proj_layers 2 --proj_dim 64 \
                                    --ab_data "$AB_OUT_DIR" --career_data "$CAREER_OUT_DIR"

pitcher_form_modeling.sh:

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0,1

export AB_OUT_DIR="$PWD/data/ab_seqs/ab_seqs_v1"
export CAREER_OUT_DIR="$PWD/data/player_career_data"

cd src/

python3 run_player_form_modeling.py --player_type "pitcher" --epochs 175 --save_model_every 5 \
                                    --batch_size 48 \
                                    --min_view_step_size 1 --max_view_step_size 15 --view_size 60 \
                                    --form_ab_window_size 75 --min_form_ab_window_size 70 \
                                    --min_ab_to_be_included_in_dataset 100 \
                                    --max_seq_len 550 --max_view_len 420 \
                                    --distribution_based_player_sampling_prob 0.25 \
                                    --mask_override_prob 0.15 --n_warmup_iters 4000 \
                                    --n_data_workers 4 --gpus 0 1 --port 12345 \
                                    --n_layers 8 --n_attn 8 --n_proj_layers 2 --proj_dim 64 \
                                    --ab_data "$AB_OUT_DIR" --career_data "$CAREER_OUT_DIR"

4. Describing Player Form

Estimated duration: 25 minutes (pitchers), 60 minutes (batters)

The describe_player_forms.sh script (also given below) is provided for you to describe the form of players in the starting lineup for games from 2015-2019. Please remember to update the FORM_OUT_DIR, AB_OUT_DIR, CAREER_OUT_DIR, and WHOLE_GAME_OUT_DIR variables in the script if you have changed them in any of the previous scripts. If ran as given, the script will describe batter form using the provided pretrained batter model. To describe pitcher form, change the --model_ckpt "$BATTER_MODEL_FP" script argument to --model_ckpt "PITCHER_MODEL_FP". If you wish to use a different model, simply provide the filepath to the desired checkpoint instead.

By default, the model will place the output in the out/forms/[player_type]_form_v1 directory in the repo, where [player_type] is either batter or pitcher. Within that directory, the model will create one directory for each season being processed. Player form vectors will be placed in the appropriate season directory in the form of a Numpy (.npy) file with a filename [GAME_PK]-[PLAYER_ID].npy

NOTE: If providing a different model checkpoint, the script expects the parameters defining the model to be in an args.txt in the parent directory of the model checkpoint file. For example, if the path to the model checkpoint is model_time_id/models/model_ckpt.pt, the script will expect the model parameters to be found in model_time_id/args.txt.

When using the script as provided (--n_workers -1), the script will use 10 threads to build the input data for batters and 3 threads for pitchers. With these parameters, it will take ~12 minutes to process one season of batters and ~5 minutes for a season on pitchers. If you wish to use a different number of threads, change --n_workers to the desired value.

describe_player_forms.sh:

#!/bin/bash

export FORM_OUT_DIR="$PWD/out/forms"
export AB_OUT_DIR="$PWD/data/ab_seqs/ab_seqs_v1"
export CAREER_OUT_DIR="$PWD/data/player_career_data"
export WHOLE_GAME_OUT_DIR="$PWD/data/whole_game_records"

export BATTER_MODEL_FP="$PWD/pretrained_models/batter_form_model/models/model_370e.pt"
export PITCHER_MODEL_FP="$PWD/pretrained_models/pitcher_form_model/models/model_175e.pt"

# move to source dir
cd src

echo "***************************"
echo "* Describing player forms *"
echo "***************************"

python3 describe_player_forms.py --ab_data "$AB_OUT_DIR" --career_data "$CAREER_OUT_DIR" \
                                 --whole_game_record_dir "$WHOLE_GAME_OUT_DIR" \
                                 --model_ckpt "$BATTER_MODEL_FP" --out "$FORM_OUT_DIR" \
                                 --start_year 2015 --end_year 2019 \
                                 --n_workers -1 --out_dir_tmplt "{}_form_v1"

5. Visualizing Form Embeddings

Estimated duration: <5 minutes

We provide the visualize_form_embeddings.sh script (also given below) to visualize the embeddings that were created in step 4. Before actually creating the plots, the script will compute the statistics that will accompany the visualizations (WAR, batting average, ERA, salary, etc). The script will create a bin/ directory in the given --form_rep_dir and save intermediate data there so it does not need to be computed every time. The initial construction of this data should only take a minute or two, and the actual plotting should also only take a minute or two.

tSNE is used to project the form embeddings from their 64-dimensions vector to a 2-D vector. tSNE is a stochastic (random) process, so the results will not be the same each time the points are projected. To ensure reproducible plots, the tSNE projections are saved in the bin/ folder. Delete this file to create new projections. A small example of the embedding visualizations for batters and starting pitchers is given below. A more complete set of batter visualizations can be found here and pitcher visualizations here.

visualize_form_embeddings.sh:

#!/bin/bash

export FORM_DIR="$PWD/out/forms/batter_form_v1"
export WHOLE_GAME_DIR="$PWD/data/whole_game_records"

export DB_FP="$PWD/database/mlb.db"

# move to source dir
cd src

echo "*******************************"
echo "* Visualizing form embeddings *"
echo "*******************************"

python3 visualize_form_embeddings.py --form_rep_dir "$FORM_DIR" \
                                     --whole_game_records_dir "$WHOLE_GAME_DIR" \
                                     --db_fp "$DB_FP" --n_workers 12 --stats_mode "F"

6. Clustering Form Embeddings

Estimated duration: Varies based on system

NOTE: A machine with a very large amount of RAM is required to perform the clustering described in this section. For example, a machine with 512GB RAM was used in the paper.

We provide the cluster_player_forms_and_plot.sh script to, as the name suggests, clusters the form vectors to obtain discrete form IDs, and then plots the discrete form ID of select players at game-start time for games from 2015-2019. The script has two phases: 1) cluster player forms and 2) inspect form clusters. Phase 1 of the script requires a significant amount of RAM to run. A system with 512 RAM was used to obtain the results presented in the paper. A personal computer will not be able to handle the data. Phase 2 should take less than one minute.

By default, the script expects form records to be present in out/forms/batter_form_v1 in the repository, as defined by the FORM_DIR variable. This is the default output location for the script presented above in step 4. The default output location for pitcher form is commented out in line 4 of the script. If you change the FORM_DIR variable, please be sure to change the CLUSTER_OUT_DIR variable to an appropriate, corresponding location. The WHOLE_GAME_RECORD_DIR variable should correspond to the WHOLE_GAME_OUT_DIR variable defined in construct_at_bat_records.sh in step 2.

The script is configured to cluster batters using 75, 50, and 25 clusters and pitchers using 32, 16, and 8 clusters. By default, the script will process the form IDs when 75 clusters were used. This can be changed by modifying the CLUSTER_TO_PLOT value.

cluster_player_forms_and_plot.sh:

#!/bin/bash

export FORM_DIR="$PWD/out/forms/batter_form_v1"
export CLUSTER_OUT_DIR="$PWD/out/form_cluster/batter1_agglom"
export WHOLE_GAME_RECORD_DIR="$PWD/data/whole_game_records/by_season"

export CLUSTER_TO_PLOT="$CLUSTER_OUT_DIR/mappings/cluster_map_k75.json"
export FIG_OUT_DIR="$CLUSTER_OUT_DIR/eval"

# move to source dir
cd src

echo "***************************"
echo "* Clustering player forms *"
echo "***************************"
python3 cluster_player_forms.py --data "$FORM_DIR" --out "$CLUSTER_OUT_DIR"

echo "**************************"
echo "* Plotting form clusters *"
echo "**************************"
python3 inspect_form_clusters.py --data "$CLUSTER_TO_PLOT" --out "$FIG_OUT_DIR" \
                                 --whole_game_records_dir "$WHOLE_GAME_RECORD_DIR"

7. Predicting Game Winner

Estimated duration: Varies based on system and models

Although not necessarily the primary intention of this work, we explored how the form embeddings can be leveraged towards predicting the winner of a regular season game in the MLB and provide the victor_predictor_parm_search.sh script to that end. As configured, the model will look in the default locations for at-bat records (AB_RECORDS_DIR), whole-game records (WHOLE_GAME_RECORD_DIR), and player form embeddings ([BATTER|PITCHER]_FORM_DIR).

The first phase of the script creates the train, test, and eval splits for the games that will be used to train and evaluate the models. The obtained splits will be placed in the directory defined by the SPLITS_DIR variable. Next, the script formats the different inputs that will be used to train the models - team meta statistics, player performance statistics, and player form embeddings. Intermediate data will be placed in the locations defined by FORM_VECTOR_DIR and TEAM_META_VECTOR_DIR, but the final numerical vectors will be placed in the directory defined by the SPLITS_DIR variable.

As configured, the script will perform a basic parameter search on random forest models using the data we found most suitable for this task. Performance statistics describing the models, and the trained models themselves, will be placed in the out/basic_parm_search. To change the type of model architecture, set the --do_rf, --do_logreg, or --do_svm flags to True (only one model architecture can be explored at one time).

victor_predictor_parm_search.sh:

#!/bin/bash

export DB_FP="$PWD/database/mlb.db"

# Inputs
export AB_SEQ_DIR="$PWD/data/ab_seqs/ab_seqs_v1"
export WHOLE_GAME_RECORD_DIR="$PWD/data/whole_game_records/by_season"
export BATTER_FORM_DIR="$PWD/out/forms/batter_form_v1"
export PITCHER_FORM_DIR="$PWD/out/forms/pitcher_form_v1"

# Outputs
export SPLITS_DIR="$PWD/data/whole_game_splits"
export FORM_VECTOR_DIR="$PWD/data/game_form_vectors_v1"
export TEAM_META_VECTOR_DIR="$PWD/data/game_meta_vectors_v1"
export PARM_SEARCH_OUT_DIR="$PWD/out/basic_parm_search"

# move to source dir
cd src

echo "************************"
echo "* Creating Game Splits *"
echo "************************"
python3 create_game_splits.py --db_fp "$DB_FP" --whole_game_record_dir "$WHOLE_GAME_RECORD_DIR" \
                              --ab_data "$AB_SEQ_DIR" --out "$SPLITS_DIR"

echo "******************************"
echo "* Creating Team Meta Vectors *"
echo "******************************"
python3 construct_game_meta_vectors.py --db_fp "$DB_FP" --out "$TEAM_META_VECTOR_DIR" \
                                       --whole_game_record_dir "$WHOLE_GAME_RECORD_DIR"

echo "******************************"
echo "* Creating Team Stat Vectors *"
echo "******************************"
python3 construct_game_stat_vectors.py --whole_game_record_dir "$WHOLE_GAME_RECORD_DIR" \
                                       --splits_basedir "$SPLITS_DIR"

echo "******************************"
echo "* Creating Game Form Vectors *"
echo "******************************"
python3 create_game_reps_from_form.py --whole_game_record_dir "$WHOLE_GAME_RECORD_DIR" \
                                      --batter_form_dir "$BATTER_FORM_DIR" \
                                      --pitcher_form_dir "$PITCHER_FORM_DIR" \
                                      --out "$FORM_VECTOR_DIR" \
                                      --do_pca True --n_pca 5

echo "********************************"
echo "* Performing Basic Parm Search *"
echo "********************************"
python3 basic_parm_search.py --use_stats T --use_form T --use_meta T \
                             --do_rf T --do_logreg F --do_svm F  \
                             --whole_game_record_dir "$WHOLE_GAME_RECORD_DIR" \
                             --splits_basedir "$SPLITS_DIR" \
                             --form_dir "$FORM_VECTOR_DIR" --form_subdir "pca-5" \
                             --game_meta_fp "$TEAM_META_VECTOR_DIR/game_meta_vectors.json" \
                             --out "$PARM_SEARCH_OUT_DIR" --force_new_data T

How can I cite this work?

Please cite the SSAC '22 paper, given below.

@inproceedings{heaton2022using,
  title={Using Machine Learning to Describe how Players Impact the Game in the MLB},
  author={Heaton, Connor and Mitra, Prasenjit},
  booktitle={The 16th Annual MIT Sloan Sports Analytics Conference},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using Machine Learning to Describe how Players Impact the Game in the MLB

Implementing Pipeline

1. Fetching Data

2. Preparing Training Data

3. Training Player Form Models

4. Describing Player Form

5. Visualizing Form Embeddings

6. Clustering Form Embeddings

7. Predicting Game Winner

How can I cite this work?

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
config		config
pretrained_models		pretrained_models
resources		resources
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
batter_form_modeling.sh		batter_form_modeling.sh
cluster_player_forms_and_plot.sh		cluster_player_forms_and_plot.sh
construct_at_bat_records.sh		construct_at_bat_records.sh
describe_player_forms.sh		describe_player_forms.sh
fetch_data.sh		fetch_data.sh
pitcher_form_modeling.sh		pitcher_form_modeling.sh
requirements.txt		requirements.txt
victor_predictor_parm_search.sh		victor_predictor_parm_search.sh
visualize_form_embeddings.sh		visualize_form_embeddings.sh

c-heat16/learning_player_form

Folders and files

Latest commit

History

Repository files navigation

Using Machine Learning to Describe how Players Impact the Game in the MLB

Implementing Pipeline

1. Fetching Data

2. Preparing Training Data

3. Training Player Form Models

4. Describing Player Form

5. Visualizing Form Embeddings

6. Clustering Form Embeddings

7. Predicting Game Winner

How can I cite this work?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages