Skip to content

QBLD Quantile Regression for Binary Longitudinal Data

Ayush Agarwal edited this page Dec 29, 2020 · 8 revisions

qbld: R package

This R package is now on CRAN, and its preferred URL is https://CRAN.R-project.org/package=qbld.

You may find a developmental Github repository here.

Background

Panel data have been attractive for understanding behaviour and dynamics, the modelling complexities involved have moved attention away from the data’s unique capacities. Modeling features such as a binary outcome variable or a quantile analysis, which are relatively straightforward to implement with cross-sectional data, are challenging and computationally burdensome for panel data. However, these features are important as they allow for the modelling of probabilities and lead to a richer view of how the covariates influence the outcome variable. For limited dependent variables, the concern is modelling the latent utility differential in the quantile framework, since the response variable takes limited values and does not yield continuous quantiles.

This project follows Rahman and Vossmeyer (2019) as its motivating literature(hereon referred to as master literature), and contributes to the three literatures by extending the various methodologies to a hierarchical Bayesian quantile regression model for binary longitudinal data (QBLD) and proposing a Markov chain Monte Carlo (MCMC) algorithm to estimate the model. The model handles both common (fixed) and individual-specific (random) parameters (commonly referred to as mixed effects in statistics). The algorithm implements a blocking procedure that is computationally efficient and the distributions involved allow for straightforward calculations of covariate effects.

Related work

To the best of my knowledge, several quantile regression packages already exist namely, quantreg, BayesQR, MCMCpack, bqror, rqPen. However, these packages are either based on frequentist frameworks or incapable of handling mixed effects BLD. BayesQR can handle a BLD framework, albeit using a Metropolis-within-Gibbs sampler and can only account for random effects. The proposed package aims to complement the existing packages and widen the scope of quantile regression framework to heterogenous binary longitudinal data. Similarly brq package implements the idea of Bayesian adaptive Lasso quantile regression.

In a recent Bayesian paper, Luo, Lian, and Tian (2012) develop a hierarchical model to estimate the parameters of conditional quantile functions with random effects. The authors do so by adopting an asymmetric Laplace (AL) distribution for the residual errors and suitable prior distributions for the parameters. Sampler for AL distribution is implemented in the ald package.

However, directly using the AL distribution does not yield tractable conditional densities for all of the parameters and hence a combination of Metropolis-Hastings (MH) and Gibbs sampling is required for model estimation. The use of the MH algorithm may require tuning at each quantile. To overcome this limitation, I will implement a full Gibbs sampling algorithm that utilizes the normal-exponential mixture representation of the AL distribution. The other significant point of difference is that the model handles both common (fixed) and individual-specific (random) parameters.

Details of your coding project

Asymmetric Laplace distribution

The current implementation of the random number generator function rALD in the ald package uses the three parameter Asymmetric Laplace Distribution defined in Koenker and Machado (1999). I will implement a more efficient method by utilizing the normal-exponential mixture representation of the AL distribution, presented in Kozumi and Kobayashi (2011). This method only requires the skewness parameter p as the input and will make use of rnorm and rexp functions.

The Gibbs Sampler

Following the master literature, I will implement a blocked Gibbs sampler for the QBLD framework which can handle heterogeneity in both common (fixed) and individual-specific(random) parameters. The Blocking procedure is essential for efficiency and reducing computational load. This blocked approach also significantly improves the mixing properties of the Markov chain. The success of these blocking techniques can be found in J.S. Liu (1994), Chib and Carlin (1999), and Chib and Jeliazkov (2006). The functions to sample from the conditional distributions will be set up generically for a wide scope usage outside the model.

To improve the speed of the routine, the Markov Chain Monte Carlo (MCMC) part of the algorithm will be programmed in Rcpp and be called from within the function in R.

Handling Categorical Variables

It is imperative that categorical variables be handled correctly, because if the algorithm does not employ any checks and treats the categorical variables as normal integers the sampler will inadvertently lead to wrong solutions. Since, this sampler works for the QBLD setting, categorical variables are highly common in these datasets. The design matrix for the model will be needed to be transformed and augmented properly depending on the number of factor variables. This should be done under the hood as the user must be able to specify the formula generically as in for instance, lm(y~x) can handle both continuous and factor regressors without user interference. I plan to take this into account and employ a factor variable checker, one of which could be made using as.factor function so that the algorithm gets to the right design matrix. This function can be implemented generically for independent usage in multiple data settings including QBLD.

Blocked vs Non-Blocked Sampling

Blocking technique is preferred over the Non-blocking counterpart because is provides with a better autocorrelation structure and relatively faster mixing. However, the blocking procedure employs inverting large matrices which may be too computationally burdensome. I plan to also give the blocking vs non-blocking samplers as an additional choice for the user.

However, Since the non-blocked version involves inverting only diagonal matrices which is computationally a lot easier, I plan to implement a method for the algorithm to decide when is it computationally feasible to switch to a non-blocked sampler despite giving slightly more autocorrelated samples without the user having to decide or wait longer for the results. One generic way to bypass this problem is to switch beyond a specified dimension of the covariate vector.

A second method involves the use of alternative algorithms to sample from the multivariate normal which do not require such inverse calculations. I plan to try and implement one of the following two algorithms, (with the effort to find more efficient methods)

  • Fast sampling, Anirban Bhattacharya (2016)
  • Perturbation-Optimization, François Orieu (2013)

These algorithms are specific to a class of normals having desired structure, and avoid the inversion of matrices and can significantly improve efficiency of the sampler. Since, sampling from a multivariate normal is an essential central step in this Gibbs sampler, this will speed up the algorithm significantly. I will also try to implement the samplers generically for usage outside the present setting.

Sampling from Univariate/Multivariate Truncated Normal

The Gibbs sampler requires to sample from a Truncated Multivariate Normal distribution multiple times across runs and thus it is essential to look at its implementation. In the master literature, algorithm samples $$z_{i} \sim TMVN_{B_{i}}(X_{i}\beta + w_{i}\theta,\Omega_{i})$$ using a series of conditional posteriors which are univariate truncated normal distributions, where the symbols hold their appropriate meaning. This algorithm was proposed in Geweke(1991).

Geweke's method is computationally heavy and suffers from slow mixing issues, therefore I will implement other, more efficient methods to sample from a Truncated Multivariate Normal. A few of the algorithms are mentioned :-

  • Exact Hamiltonian Monte Carlo, Pakman (2013), this has been implemented in the tmg package
  • Minimax Tilting, Botev (2016), this has been implemented in the TruncatedNormal package
  • Fast Simulation of Hyperplane-Truncated, Cong (2017), no known implementation exists, I will try to implement this efficiently and for generic use.

Inefficiency Factor

Final summary table will also include calculations for inefficiency factor (IF) by implementing the Batch Means method employed in Greenberg (2012). I will also try to implement and estimate the probit model for binary longitudinal data (PBLD) using the algorithm presented in Koop, Poirier, and Tobias (2007) and Greenberg (2012) and identical priors for relevant parameters. The results for the QBLD and PBLD models will help us check the conformity with the already existing models while also highlight the extra information we get with the lower and higher quantiles.

Convergence and Standard Errors

I will also make the output object for Gibbs sampler compatible with mcmcse package. Using the mcse.q function Standard Error will be calculated and presented to the user in summary format.

The mcmcse package will also further help in testing for convergence in the implemented sampler and will be essential especially while using Non-Blocking technique. The package will also provide confidence regions and Effective Sample Size for further analysis using the inbuilt functions.

Expected impact

The paper on, "Estimation and Applications of Quantile Regression for Binary Longitudinal Data" by Rahman, and Vossmeyer (2019) develops a framework for quantile regression in binary longitudinal data settings. A novel Markov chain Monte Carlo (MCMC) method is designed to fit the model and its computational efficiency is demonstrated. Panel data have been attractive to researchers for understanding behaviour and dynamics. The paper already has 6 citations in less than 6 months, demonstrating an interest in the field. An open source code in R is not available yet, however, a rudimentary implementation in MATLAB has been achieved. Availability of an open source tool will further help researchers take advantage of the theoretical development.

Researchers, and students in fields related to Quantile Regression, Econometrics, and Policy making that use R will be benefited by the new tools generated with this project. The set of tools to be created will facilitate deeper insights into panel data in a binary response variable environment, that otherwise will require extensive analyses in distinct software platforms. These tools will help in accounting for heterogeneity. Heterogeneity implies that response may differ in terms of certain unmeasured variables that affect the probability of the outcome. Users will find in this package an important set of tools that will help them to make informed decisions.

CRAN release

The qbld package will be submitted to the authors for a peer review. After its hopefully smooth acceptance, I will submit it to CRAN under GPL(>2). A developmental version of the package will be hosted on GitHub and tested for future releases.

Mentors

Students, please contact mentors below after completing at least one of the tests below.

Evaluating Mentor: Dootika Vats ([email protected]) is the author and maintainer of R package mcmcse and a contributor on R package stableGR. She was a GSoC student participant in 2015 for this same package and an expert in MCMC output analysis.

Co-Mentor - Adam Maidman ([email protected]), Microsoft, Seattle, is the author and contributor to the CRAN hosted package rqPen and an expert in quantile regression.

Students

Ayush Agarwal Github

Clone this wiki locally