3b. Experiment 2

Data

This experiment used the following data sources:

Data type	Source	Key variables
Conflict	ACLED API	Incidents, fatalities
Prices	FSNAU	Commodity prices
Environment and health	EW-EA dashboard	Rainfall, NDVI, river levels, cholera cases
Historical displacement	UNHCR PRMN	data on arrivals' previous, current, and future regions
Distances	imputed manually	data on geographic distance

Below is a brief overview of the data. Yellow represents high values, while blue represents low values. White represents missing values.

Tools/algorithms

As a "sanity check", we benchmarked our performance against four naive baseline models:

Last observation carried forward (LOCF): this simply assumes that arrivals will not change over the course of three months
Expanding mean: this is an "average" which takes into account all prior data (e.g. not restricted to recent months)
Exponential weighted mean: this is a 12-month "average" which gives more weight to recent periods
Historical mean: this model is simply a rolling average of the last 12 months' arrivals, shifted forward by three months

Our primary models include a range of standard regression algorithms, namely:

Ridge and lasso regressions
Multi-layer perceptrons
XGBoost and Adaboost
Decision trees
Random forests
Support vector machines

We also test a simple Long-Term Short-Term Memory (LSTM) neural network.

Finally, we experiment with alternative tools -- such as Eureqa and H20.ai -- but have currently decided to proceed with open-source SciKit-Learn models for reprodicibility and intelligibility.

Technical problem set up

Data pre-processing

The range of data was restricted to fall from 2011-04 to present (to allow for a three-month lag from the start of arrivals data on 2011-01).

Handling of missing data

Binary variables were created to signify whether a variable was missing or not.

Generation of new variables

We also generated the following variables:

A dummy indicator for each region
A dummy indicator for each month of the year
A continuous variable counting months since January 2010
A 12-month historical average of all time-varying independent variables for the focal region, lagged by three months
All time-varying independent variables for the focal region, lagged by 3,4,5,6,9, and 12 months
All time-varying independent variables for all other regions, lagged by 3,4,5,6,9, and 12 months

Dependent variable

The dependent variable is the number of arrivals recorded by the UNHCR PRMN per region and month. This data is manually collected by a network of enumerators and observers on the ground.

In contrast to Experiment #1, we pooled data across all regions and time periods to fit a single displacement model. On the one hand, this means that our model might not fit each region as well. However, it also means that we are able to learn a more general model of displacement, and to use a larger dataset when training the algorithms.

Independent variables

We use a broad set of independent variables. As described above, these include:

Region
Month
Time since January 2010
The number of conflict incidents
The number of fatalities
The prices of wheat flour, water drums, local goats, red sorghum, petrol, charcoal, and firewood
The daily wage
The Shilling to USD conversion rate
The number of deaths from cholera
The number of cases of cholera, malaria, and measles
The number of hospital admissions for acute malnutrition
Rainfall
Vegetation cover (NDVI)
The river levels recorded at the Baardheere, Belet Weyne, Buuale, Bulo Burto, Doolow, Jowhar, and Luuq stations
For each pair of regions,
- Whether they share a border
- The estimated driving distance between their centroids in hours and kilometers
- The minimum estimated direct (as the crow flies) kilometer distance between any two points in the region

Train/test split

We train the model using data up until August 2018. The test data extends from September 2018 to present.

Cross-validation

We implement 10-fold time series cross-validation. Unlike standard cross-validation approaches, time series cross-validation uses an expanding window from the starting period of the dataset, so that the volume of data grows with each fold.

Model Performance

Performance metrics

We report multiple performance metrics, including:

RMSE: The root mean squared error (in raw counts of arrivals)
MAE: The mean average error (in raw counts of arrivals)
MAPE: The mean average percent error of predictions, relative to the true value
R-squared: Essentially, the fraction of variation in the data that is explained by the algorithm
PCC: A self-defined metric consisting of the ratio of predicted to true values; an indicator of how much the algorithm is on average over- or under-predicting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly