-
Notifications
You must be signed in to change notification settings - Fork 9
3b. Experiment 2
This experiment used the following data sources:
Data type | Source | Key variables |
---|---|---|
Conflict | ACLED API | Incidents, fatalities |
Prices | FSNAU | Commodity prices |
Environment and health | EW-EA dashboard | Rainfall, NDVI, river levels, cholera cases |
Historical displacement | UNHCR PRMN | data on arrivals' previous, current, and future regions |
Distances | imputed manually | data on geographic distance |
Below is a brief overview of the data. Yellow represents high values, while blue represents low values. White represents missing values.
As a "sanity check", we benchmarked our performance against four naive baseline models:
- Last observation carried forward (LOCF): this simply assumes that arrivals will not change over the course of three months
- Expanding mean: this is an "average" which takes into account all prior data (e.g. not restricted to recent months)
- Exponential weighted mean: this is a 12-month "average" which gives more weight to recent periods
- Historical mean: this model is simply a rolling average of the last 12 months' arrivals, shifted forward by three months
Our primary models include a range of standard regression algorithms, namely:
- Ridge and lasso regressions
- Multi-layer perceptrons
- XGBoost and Adaboost
- Decision trees
- Random forests
- Support vector machines
We also test a simple Long-Term Short-Term Memory (LSTM) neural network.
Finally, we experiment with alternative tools -- such as Eureqa and H20.ai -- but have currently decided to proceed with open-source SciKit-Learn models for reprodicibility and intelligibility.
The range of data was restricted to fall from 2011-04 to present (to allow for a three-month lag from the start of arrivals data on 2011-01).
Binary variables were created to signify whether a variable was missing or not.
We also generated the following variables:
- A dummy indicator for each region
- A dummy indicator for each month of the year
- A continuous variable counting months since January 2010
- A 12-month historical average of all time-varying independent variables for the focal region, lagged by three months
- All time-varying independent variables for the focal region, lagged by 3,4,5,6,9, and 12 months
- All time-varying independent variables for all other regions, lagged by 3,4,5,6,9, and 12 months
The dependent variable is the number of arrivals recorded by the UNHCR PRMN per region and month. This data is manually collected by a network of enumerators and observers on the ground.
In contrast to Experiment #1, we pooled data across all regions and time periods to fit a single displacement model. On the one hand, this means that our model might not fit each region as well. However, it also means that we are able to learn a more general model of displacement, and to use a larger dataset when training the algorithms.
We use a broad set of independent variables. As described above, these include:
- Region
- Month
- Time since January 2010
- The number of conflict incidents
- The number of fatalities
- The prices of wheat flour, water drums, local goats, red sorghum, petrol, charcoal, and firewood
- The daily wage
- The Shilling to USD conversion rate
- The number of deaths from cholera
- The number of cases of cholera, malaria, and measles
- The number of hospital admissions for acute malnutrition
- Rainfall
- Vegetation cover (NDVI)
- The river levels recorded at the Baardheere, Belet Weyne, Buuale, Bulo Burto, Doolow, Jowhar, and Luuq stations
- For each pair of regions,
- Whether they share a border
- The estimated driving distance between their centroids in hours and kilometers
- The minimum estimated direct (as the crow flies) kilometer distance between any two points in the region
We train the model using data up until August 2018. The test data extends from September 2018 to present.
We implement 10-fold time series cross-validation. Unlike standard cross-validation approaches, time series cross-validation uses an expanding window from the starting period of the dataset, so that the volume of data grows with each fold.
We report multiple performance metrics, including:
- RMSE: The root mean squared error (in raw counts of arrivals)
- MAE: The mean average error (in raw counts of arrivals)
- MAPE: The mean average percent error of predictions, relative to the true value
- R-squared: Essentially, the fraction of variation in the data that is explained by the algorithm
- PCC: A self-defined metric consisting of the ratio of predicted to true values; an indicator of how much the algorithm is on average over- or under-predicting