Dimension reduction methods with applications to high dimensional data with a censored response
Nguyen, Tuan S.
Doctor of Philosophy
Dimension reduction methods have come to the forefront of many applications where the number of covariates, p, far exceed the sample size, N. For example, in survival analysis studies using microarray gene expression data, 10--30K expressions per patient are collected, but only a few hundred patients are available for the study. The focus of this work is on linear dimension reduction methods. Attention is given to the dimension reduction method of Random Projection (RP), in which the original p-dimensional data matrix X is projected onto a k-dimensional subspace using a random matrix Gamma. The motivation of RP is the Johnson-Lindenstrauss (JL) Lemma, which states that a set of N points in p-dimensional Euclidean space can be projected onto a k ≥ 24lnN3e2-2e 3 dimensional Euclidean space such that the pairwise distances between the points are preserved within a factor 1 +/- epsilon. In this work, the JL Lemma is revisited when the random matrix Gamma is defined as standard Gaussian and Achlioptas-typed. An improvement on the lower bound for k is provided by working directly with the distributions of the random distances rather than resorting to the moment generating function technique used in the literature. An improvement on the lower bound for k is also provided when using pairwise L2 distances in the space of the original points and pairwise L 1 distances in the space of the projected points. Another popular dimension reduction method is Partial Least Squares. In this work, a variant of Partial Least Squares is proposed, denoted by Rank-based Modified Partial Least Squares (RMPLS). The weight vectors of RMPLS can be seen to be the solution to an optimization problem. The method is insensitive to outlying values of both the response and the covariates, and takes into account the censoring information in the construction of its weight vectors. Results from simulation and real datasets under the Cox and Accelerated Failure Time (AFT) models indicate that RMPLS outperforms other leading methods for various measures when outliers are present in the response, and is comparable to other methods in the absence of outliers in the response.
Statistics; Theoretical mathematics