In an nutshell, this project aims for three things:
- Acquire data from transfermarkt website using the trasfermarkt-scraper.
- Build a clean, public football (soccer) dataset using data in 1.
- Automatate 1 and 2 to keep these assets up to date and publicly available on some well-known data catalogs.
Checkout this dataset also in: ✅ Kaggle | ✅ data.world
![]() |
|---|
| High level data model for transfermarkt-datasets |
This is a DVC repository, therefore all files for the current revision can be pulled from remote storage with the dvc pull command. All project data assets are kept inside the data folder.
data/raw: contains raw data per season as acquired with trasfermarkt-scraper (check acquire)data/prep: contains the prepared datasets as produced bytransfermarkt_datasetsmodule (check prepare)
ℹ️ Read access to the DVC remote storage for the project is required to successfully run
dvc pull. Contributors should feel free to grant themselves access by adding their AWS IAM user ARN to this whitelist. Have a look at this PR for an example.
In the scope of this project, "acquiring" is the process of collecting "raw data", as it is produced by trasfermarkt-scraper. Acquired data lives in the data/raw folder and it can be created or updated for a particular season using the 1_acquire.py script.
$ python 1_acquire.py local --asset all --season 2021This dependency is the reason why trasfermarkt-scraper exists as a sub-module in this project. The 1_acquire.py is a helper script that runs the scraper with a set of parameters and collects the output in data/raw.
In the scope of this project, "preparing" is the process of tranforming raw data to create a high quality dataset that can be conveniently consumed by analysts of all kinds. The transfermark_datasets module contains the preparation logic, which can be executed using the 2_prepare.py script.
Configuration is defined in the config.yml file. The assets section references classes in transfermarkt_datasets/assets, which define the logic for building and validating the different assets.
transfermark_datasets provides a python api that can be used to work with the module from python rather than using the script. This is particularly convenient for working with the datasets from a notebook.
# import the module
from transfermarkt_datasets.transfermarkt_datasets import TransfermarktDatasets
# instantiate the datasets handler
td = TransfermarktDatasets(source_path="raw/data")
# build the datasets from raw data
td.build_datasets()
# inspect the results
td.asset_names # ["games", "players"...]
td.assets["games"].prep_df # get the built asset in a dataframe
td.assets["games"].get_stacked_data() # get the raw data in a dataframeFor more examples on using transfermark_datasets, checkout the sample notebooks.
Define all the necessary infrastructure for the project in the cloud with Terraform.
Contributions to transfermarkt-datasets are most welcome. If you want to contribute new fields or assets to this dataset, instructions are quite simple:
- Fork the repo (make sure to initialize sub-modules as well with
git submodule update --init --recursive) - Set up a new conda environment with
conda env create -f environment.yml - Pull the raw data by either running
dvc pull(requesting access is needed) or using the1_acquire.pyscript (no access request is needed) - Start modifying assets or creating a new one in
transfermarkt_datasets/assets. You can use2_prepare.pyto run and test your changes. - If it's all looking good, create a pull request with your changes 🚀
