This is a repository of the code produced during the Project "Migration Forecast EU" by the Bertelsmann Foundation. Goal of the project was to use data from Google Trends to predict the number of registrations of EU nationals in Germany. It is basically composed of two components:
- A python backend for data ingestion and transformation, particularly for Google Trends data using the official API
- A selection of Jupyter notebooks for both exploratory analysis and training/evaluation of various ML regression models
- A valid Python installation (3.9 or newer)
- Access to the official Google Trends API with a valid developer key (available for research institutions via request at Google)
- Clone repository and open main folder in terminal (Windows users may use GitBash or WSL)
- Create a new conda environment (https://conda.io/projects/conda/en/latest/user-guide/getting-started.html#managing-environments) or venv (https://docs.python.org/3/library/venv.html) and activate it
- Install required packages:
pip install -r requirements.txt
- Create a .env-file with your Google API key:
echo "GOOGLE_DEVELOPER_KEY=insert_your_developer_key_here" > .env
- Run
python get_data.py
to obtain the data from the Google Trends API (python get_data.py -h
for more options) - Run
python process_data.py
andpython process_registrations.py
to process and transform the raw data both for Google Trends and the official registration statistic. They are now stored in the folderdata/processed
. (Processed data as of mid-2021 are already included in the repo) - You can now play around with the Jupyter notebooks in the
notebooks
-folder.
/
: the main folder contains all the python executables for loading and processing the data/data
: all the data (raw and processed) as well as config files with metadata/data/config
: config files describing the metadata, used for processing/data/keywords
: Excel files containing the keywords for which the Google Trends API is queried/data/processed
: contains the processed data after the processing scripts have been run/data/raw
: raw data/data/raw/eurostat
: macroeconomic indicators for EU countries, obtained from EUROSTAT (Excel)/data/raw/registrations
: monthly registrations of EU nationals in Germany, obtained from DESTATIS (Excel)/data/raw/trends
: raw data from the Google Trends API, generated by runningget_data.py
/modules
: various python modules containing utilities/modules/eumf_custom_models.py
: custom ML models (right now only a linear dummy model)/modules/eumf_data.py
: utility functions for loading and transforming data/modules/eumf_eval.py
: utility functions for evaluating ML performance/modules/eumf_google_trends.py
: higher level functions for generating the Google Trends API queries from the keywords and storing the data/notebooks
: Jupyter notebooks for analysis, ML training and evaluation/notebooks/analysis
: descriptive and diagnostic analysis/notebooks/experiments
: experiments and analysis with various forecasting/regression algorithms (probably the most important folder)/notebooks/presentations
: plots generated for workshop presentations/notebooks/prototypes
: playground for trying out things
- The forecast algorithms are right now only trained for academic analysis, there is right now no code for deploying them in production (as in a dashboard, e.g.)
- It would be nice to use the DESTATIS API instead of Excel files (same for EUROSTAT)
- In the branch
dev_db
, there are alternative scripts to store the Google Trends data in a Postgres database instead of csv files. This has been tested rudimentarily but not integrated into the main branch and the notebooks yet. - Right now, access to the official Google Trends API is needed which is only provided for certain institutions especially in research. It would be nice if also the inofficial, publicly accessible pytrends (https://pypi.org/project/pytrends/) could be used as backend
Happy forecasting!