Skip to content

3b. Experiment 2

Katherine Hoffmann Pham edited this page Sep 13, 2020 · 3 revisions

Data

This experiment used the following data sources:

Data type Source Key variables
Conflict ACLED API Incidents, fatalities
Prices FSNAU Commodity prices
Environment and health EW-EA dashboard Rainfall, NDVI, river levels, cholera cases
Historical displacement UNHCR PRMN data on arrivals' previous, current, and future regions
Distances imputed manually data on geographic distance

Below is a brief overview of the data. Yellow represents high values, while blue represents low values. White represents missing values.

Image Image Image Image Image

Tools/algorithms

As a "sanity check", we benchmarked our performance against four naive baseline models:

  • Last observation carried forward (LOCF): this simply assumes that arrivals will not change over the course of three months
  • Expanding mean: this is an "average" which takes into account all prior data (e.g. not restricted to recent months)
  • Exponential weighted mean: this is a 12-month "average" which gives more weight to recent periods
  • Historical mean: this model is simply a rolling average of the last 12 months' arrivals, shifted forward by three months

Our primary models include a range of standard regression algorithms, namely:

  • Ridge and lasso regressions
  • Multi-layer perceptrons
  • XGBoost and Adaboost
  • Decision trees
  • Random forests
  • Support vector machines

We also test a simple Long-Term Short-Term Memory (LSTM) neural network.

Finally, we experiment with alternative tools -- such as Eureqa and H20.ai -- but have currently decided to proceed with open-source SciKit-Learn models for reprodicibility and intelligibility.

Technical problem set up

Data pre-processing

The range of data was restricted to fall from 2011-04 to present (to allow for a three-month lag from the start of arrivals data on 2011-01).

Handling of missing data

Binary variables were created to signify whether a variable was missing or not.

Generation of new variables

We also generated the following variables:

  • A dummy indicator for each region
  • A dummy indicator for each month of the year
  • A continuous variable counting months since January 2010
  • A 12-month historical average of all time-varying independent variables for the focal region, lagged by three months
  • All time-varying independent variables for the focal region, lagged by 3,4,5,6,9, and 12 months
  • All time-varying independent variables for all other regions, lagged by 3,4,5,6,9, and 12 months

Dependent variable

The dependent variable is the number of arrivals recorded by the UNHCR PRMN per region and month. This data is manually collected by a network of enumerators and observers on the ground.

In contrast to Experiment #1, we pooled data across all regions and time periods to fit a single displacement model. On the one hand, this means that our model might not fit each region as well. However, it also means that we are able to learn a more general model of displacement, and to use a larger dataset when training the algorithms.

Independent variables

We use a broad set of independent variables. As described above, these include:

  • Region
  • Month
  • Time since January 2010
  • The number of conflict incidents
  • The number of fatalities
  • The prices of wheat flour, water drums, local goats, red sorghum, petrol, charcoal, and firewood
  • The daily wage
  • The Shilling to USD conversion rate
  • The number of deaths from cholera
  • The number of cases of cholera, malaria, and measles
  • The number of hospital admissions for acute malnutrition
  • Rainfall
  • Vegetation cover (NDVI)
  • The river levels recorded at the Baardheere, Belet Weyne, Buuale, Bulo Burto, Doolow, Jowhar, and Luuq stations
  • For each pair of regions,
    • Whether they share a border
    • The estimated driving distance between their centroids in hours and kilometers
    • The minimum estimated direct (as the crow flies) kilometer distance between any two points in the region

Train/test split

We train the model using data up until August 2018. The test data extends from September 2018 to present.

Cross-validation

We implement 10-fold time series cross-validation. Unlike standard cross-validation approaches, time series cross-validation uses an expanding window from the starting period of the dataset, so that the volume of data grows with each fold.

Model Performance

Performance metrics

We report multiple performance metrics, including:

  • RMSE: The root mean squared error (in raw counts of arrivals)
  • MAE: The mean average error (in raw counts of arrivals)
  • MAPE: The mean average percent error of predictions, relative to the true value
  • R-squared: Essentially, the fraction of variation in the data that is explained by the algorithm
  • PCC: A self-defined metric consisting of the ratio of predicted to true values; an indicator of how much the algorithm is on average over- or under-predicting

Model Development (sensitivity analysis)

Limitations