Statistical and Algorithmic Methods for High-Dimensional and Highly-Correlated Data
Allen, Genevera I
Doctor of Philosophy
Technological advances have led to a proliferation of high-dimensional and highly correlated data. This sort of data poses enormous challenges for statistical analysis, pushing the limits of distributed optimization, predictive modeling, and statistical inference. We propose new methods, motivated by biomedical applications, for predictive modeling and variable selection in this challenging setting. First, we build predictive models for multi-subject neuroimaging data. This is an ultra-high-dimensional problem that consists of a highly spatially and temporally correlated matrix of covariates (brain locations by time points) for each subject; few methods currently exist to fit supervised models directly to this tensor data. We propose a novel modeling and algorithmic strategy, Local Aggregate Modeling, to apply generalized linear models (GLMs) to this massive tensor data that not only has better prediction accuracy and interpretability, but can also be fit in a distributed manner. Second, we propose a novel method, Algorithmic Regularization Paths, for variable selection with high-dimensional and highly correlated data. Existing penalized regression methods such as the Lasso solve a relaxation of the best subsets problem that runs in polynomial time; however, the Lasso can only correctly recover the true sparsity pattern if the design matrix satisfies the so-called Irrepresentability Condition or related conditions, which are easily violated when the data is highly correlated. Our method achieves better variable selection performance and faster computation in ultra-high-dimensional and high-correlation settings where the Lasso and many other standard methods fail.
Multi-subject neuroimaging, two-way smoothing, tensor covariates, variable selection, high-dimensional and highly-correlated data.