-
Notifications
You must be signed in to change notification settings - Fork 9
3. Methodology
If we were to write a simple “recipe” with all the “ingredients” required to undertake this predictive analytics project – regardless of the experiment (see section below) – the recipe will be the following, mirroring a traditional data science process:
- Data (Input Data Section)
- Initial Data Exploration (Input Data Section)
- Modelling Applications
- Open-source scripting knowledge
- Server and strong computing power
- Technical Capacity (a team)
We visualize the results of the process in both a map and the graphs with the models, which we called Jetson engine. This engine portrays the historical predictions up to April 2019.
As explored in the Input Data section, this section explores the problem set up, the data pre-processing, including the handling of missing values and the generation of new variables. Also some initial exploration, that will further be developed in Experiment #2 visualizations.
Modelling application(s): this is a flexible component in Jetson and it varies depending on data protection requirements. We have tested both open-source applications meaning, we have built upon some of the research work out there on predictive analytics with time-series analysis (TSA) in R and Python. But also we have tested or required a demo for some of the off-the-shelf, including licensed software and other commercial applications for modeling purposes.
From the rapid comparison of applications to conduct this type of work, these are the five essential tech specs a modeling application need to have to TSA predictive analytics work:
- The application needs to support integration with tabular applications (e.g. excel/google sheets) and ideally needs to have a python/JS/R API
- It has to have the ability to conduct multivariate time series forecasting, defining time lags, and windows of time to perform machine-learning (ML) of the past and project future values. The machine should be able to see the dependency among dependent variables (x) and the target variable (y), but also inter-dependency between dependent variables (x1, x2, x3…). This is one of the main concepts for dynamic modeling. For example here a paper on data-driven dynamic modeling for prediction with time series analysis.
- It needs to have the possibility to run predictions both locally (for data security/protection purposes or small scale testing) as well as run predictions in the cloud;
- It contains fairly good interpretability elements to understand machine calculations [and avoid the A.I. black-box concept] and;
- Ideally, that has a ‘feedback loop’ element, this means that with new inputs in data, the machine adapts the new data points into the future predictions or is easier to have anomaly detection. For this reason, each experiment used a different combination of applications to obtain at least four of the five required functionalities. If you know of an application that is open-source software (e.g. python/R) or your team is building one that contains all these tech specs together, please let us know at [email protected]. We have explored at least 12 modeling applications, both off-the-shelf and open-source and majority of them only featured 3 or 4 out of 5 tech specs needed.
To know more about the modeling applications tested and used, please refer to sections 3a. Experiment #1 and 3b. Experiment #2.
Open-source scripting knowledge: for experiment #1 and #2 we needed to build parsers and transformation scripts to collect automatically some of the data sources as well as converting some regression functions into predictions (experiment #1 only, given the output of the modeling application we used). These parsers pull data from certain websites, collate them into originally a repository and now directly to the data visualizations and push it to the public website. We also built an additional application (R-shiny) with some performance metrics visualization (e.g. heatmap, graphs) for selecting best performing models. Finally the data visualization elements: 1) the dashboard is based also on R-shiny and 2) the map is based on javascript.
Server and strong computing power: depending on the modeling applications features, some of them have integration to cloud-based instances or the ability to connect to virtual machines. This will support running models in a more efficient way. However, if the data is sensitive, it is recommended to run locally or on-premises. For this reason, we needed computers with a minimum of 4 core processor and good RAM memory. Any computer designed for rendering, design or video games will be enough. To push out applications (e.g. website and its domain, shiny app, automation) it is recommended to have a server or a virtual machine that can host all the elements for public consumption.
Technical capacity (a team): last but not least, it is imperative to have skilled staff (e.g. computer scientist, data scientist, information systems engineers, artificial intelligence engineers) to provide maintenance to the project. To create the 2 experiments, we needed a technical team working both on the website and the applications. Our technical team is composed by the following people: a) a UX/UI designer, b) 2 data scientists and c) 1 artificial intelligence engineer. For maintenance it is recommended that one or two people with good knowledge on data science and artificial intelligence give regular maintenance to the systems.
Each experiment had their own training methodology, given that they used different modelling applications. This will be specified in section 3a. Experiment 1 and 3b. Experiment 2.
Depending on the experiment, different train/split measures were taken. For example:
-
For experiment #1: the train/split was done automatically by the modelling application in a random 50%/50% to the testing set. By default, the application randomly shuffle the data and then split it into training and validation data sets based on the total size of the dataset.
-
For experiment #2: the training set was created with data until June 2018.
We also created a holdout of available data to the date at the time of the experiment, for example:
-
For experiment #1: we had 2 holdouts, one with data form July 2017 and onwards (e.g. April 2018) and one with data from October 2017 and onwards.
-
For experiment #2: we had 1 holdout, with data from July 2018 and onwards (to the date).
For experiment #1: for cross-validation purposes, we utilized repeated sub-random sampling technique. One of the features of the off-the-shelf software used was the ability to randomly select points to be for the testing set and testing set. Therefore, we created 20 training/testing sets per each region: 10 with known data to the machine up to June 2017 and 10 for September 2017.
For experiment #2:
Again, depending on the experiment, different metrics were used. For example for experiment #1, we encountered a complication with the license of the modelling software. So the first performance metrics used for training/testing - embedded in the software - were not the same ones used for the evaluation set, where we no longer used the software and went for an more open-source solution (python transformation script and R-based calculations).
The performance metrics were embedded in the application with the results of test/training modelling process:
- R-squared: also known as goodness of fit, represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.
- Mean Absolute Error (MAE): to measure the average magnitude of the errors in a set of forecasts, without considering their direction.
- Maximum Error: Minimizes the single highest error of the residuals. It is the maximum difference between the point estimate (prediction) and the actual parameter (actual arrivals).
As a lesson learned, we made a decision to switch to open source modeling, even though some of the off-the-shelf solutions were very good in terms of accuracy and computational speed. For experiment #1, we switched then the metrics for the evaluation set, into the following:
- Akaike Information Criterion (AIC): AIC is an estimator of the relative quality of statistical models for a given set of data.
- Bayesian Information Criterion (BIC): Is an additional model performance metric, if a model is estimated on a particular data set (training set), BIC score is the the estimate of the model performance with a new data set (testing set). BIC score would be inversely proportional to AIC score.
- Simple Pearson Correlation: we have applied also a simple complete.obs correlation between machine predictions vs. actual arrivals, this was done both for the last 20-months of data available and and for the 7 years of data.
We utilized 3 metrics to evaluate the performance of the evaluation set:
- Percentage of [machine] Correct Classification: also know as PCC, which is a measure to take the proportion of correctly “predicted” months vs. the actual arrivals.
- Multivariate Imputation by Chained Equations (MICE) algorithm: also to include new data (evaluation set) with an imputation technique and see the fit, with a random seed of 50% of the data.
- Occurrence of top “influencers”: or how many times the weight of a certain predictors (strongest dependent variables) appeared recurrently. We highlighted the most repeated variables in the output model function by weight or occurrence.
In order to be able to visualize all these performance metrics, in a more user-friendly way, we created an R-shiny application called modelselector.R.
As described in the section above, we conducted a sensitivity analysis for Experiment #1, to see how prevalent were the predictors (dependent variables)