Statistical Approaches for Large-Scale and Complex Omics Data
Li, Meng; Morris, Jeffrey S.
Doctor of Philosophy
In this thesis, we propose several novel statistical approaches to analyzing large-scale and complex omics data. This thesis consists of three projects. In the first project, with the goal of characterizing gene-level relationships between DNA methylation and gene expression, we introduce a sequential penalized regression approach to identify methylation-expression quantitative trait loci (methyl-eQTLs), a term that we have coined to represent, for each gene and tissue type, a sparse set of CpG loci best explaining gene expression and accompanying weights indicating direction and strength of association, which can be used to construct gene-level methylation summaries that are maximally correlated with gene expression for use in integrative models. Using TCGA and MD Anderson colorectal cohorts to build and validate our models, we demonstrate our strategy explains expression variability much better than commonly used integrative methods. In the second project, we propose a unified Bayesian framework to perform quantile regression on functional responses (FQR). Our approach represents functional coefficients with basis functions to borrow strength from nearby locations, and places a global-local shrinkage prior on the basis coefficients to achieve adaptive regularization. We develop a scalable Gibbs sampler to implement the approach. Simulation studies show that our method has superior performance against competing methods. We apply our method to a mass spectrometry dataset and identify proteomic biomarkers of pancreatic cancer that were entirely missed by mean-regression based approaches. The third project is a theoretical investigation of the FQR problem, extending the previous project. We propose an interpolation-based estimator that can be strongly approximated by a sequence of Gaussian processes, based upon which we can derive the convergence rate of the estimator and construct simultaneous confidence bands for the functional coefficient. The strong approximation results also build a theoretical foundation for the development of alternative approaches that are shown to have better finite-sample performance in simulation studies.