transfermarkt-datasets

In an nutshell, this project aims for three things:

Acquire data from transfermarkt website using the trasfermarkt-scraper.
Build a clean, public football (soccer) dataset using data in 1.
Automatate 1 and 2 to keep these assets up to date and publicly available on some well-known data catalogs.

Checkout this dataset also in: ✅ Kaggle | ✅ data.world


High level data model for transfermarkt-datasets

dvc
acquire
prepare
- configuration
- python api
infra
contributing

dvc

This is a DVC repository, therefore all files for the current revision can be pulled from remote storage with the dvc pull command. All project data assets are kept inside the data folder.

data/raw: contains raw data per season as acquired with trasfermarkt-scraper (check acquire)
data/prep: contains the prepared datasets as produced by transfermarkt_datasets module (check prepare)

ℹ️ Read access to the DVC remote storage for the project is required to successfully run dvc pull. Contributors should feel free to grant themselves access by adding their AWS IAM user ARN to this whitelist. Have a look at this PR for an example.

acquire

In the scope of this project, "acquiring" is the process of collecting "raw data", as it is produced by trasfermarkt-scraper. Acquired data lives in the data/raw folder and it can be created or updated for a particular season using the 1_acquire.py script.

$ python 1_acquire.py local --asset all --season 2021

This dependency is the reason why trasfermarkt-scraper exists as a sub-module in this project. The 1_acquire.py is a helper script that runs the scraper with a set of parameters and collects the output in data/raw.

prepare

In the scope of this project, "preparing" is the process of tranforming raw data to create a high quality dataset that can be conveniently consumed by analysts of all kinds. The transfermark_datasets module contains the preparation logic, which can be executed using the 2_prepare.py script.

configuration

Configuration is defined in the config.yml file. The assets section references classes in transfermarkt_datasets/assets, which define the logic for building and validating the different assets.

python api

transfermark_datasets provides a python api that can be used to work with the module from python rather than using the script. This is particularly convenient for working with the datasets from a notebook.

# import the module
from transfermarkt_datasets.transfermarkt_datasets import TransfermarktDatasets

# instantiate the datasets handler
td = TransfermarktDatasets(source_path="raw/data")

# build the datasets from raw data
td.build_datasets()

# inspect the results
td.asset_names # ["games", "players"...]
td.assets["games"].prep_df # get the built asset in a dataframe
td.assets["games"].get_stacked_data() # get the raw data in a dataframe

For more examples on using transfermark_datasets, checkout the sample notebooks.

infra

Define all the necessary infrastructure for the project in the cloud with Terraform.

contributing

Contributions to transfermarkt-datasets are most welcome. If you want to contribute new fields or assets to this dataset, instructions are quite simple:

Fork the repo (make sure to initialize sub-modules as well with git submodule update --init --recursive)
Set up a new conda environment with conda env create -f environment.yml
Pull the raw data by either running dvc pull (requesting access is needed) or using the 1_acquire.py script (no access request is needed)
Start modifying assets or creating a new one in transfermarkt_datasets/assets. You can use 2_prepare.py to run and test your changes.
If it's all looking good, create a pull request with your changes 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
.dvc		.dvc
.github/workflows		.github/workflows
data		data
infra		infra
notebooks		notebooks
tests		tests
transfermarkt-scraper @ ba1f2e6		transfermarkt-scraper @ ba1f2e6
transfermarkt_datasets		transfermarkt_datasets
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
1_acquire.py		1_acquire.py
2_prepare.py		2_prepare.py
3_sync.py		3_sync.py
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
bootstrap.sh		bootstrap.sh
cloud_lib.py		cloud_lib.py
config.yml		config.yml
diagram.png		diagram.png
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

transfermarkt-datasets

dvc

acquire

prepare

configuration

python api

infra

contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

transfermarkt-datasets

dvc

acquire

prepare

configuration

python api

infra

contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages