Tumor tissue samples comprise a mixture of cancerous and surrounding normal cells. Investigating cellular heterogeneity in tumors is crucial to genomic analyses associated with cancer prognosis and treatment decisions, where the contamination of non-cancerous cells may substantially affect gene expression profiling in clinically derived malignant tumor samples. For this purpose, we first computationally purify tumor profiles, and then develop new statistical modeling techniques to incorporate tumor purity estimates for genetic correlation and prediction of clinical outcome in cancer research. In this thesis, we propose novel approaches to analyzing and modeling cellular heterogeneity problems using genomic data from three perspectives. First, we develop a computation tool, DeMixT, which applies a deconvolution algorithm to explicitly account for at most three cellular components associated with cancer. Compared with the experimental approach to isolate single cells, in silico dissection of tumor samples is faster and cheaper, but computational tools previously developed have limited ability to estimate cellular proportions and tumor-specific expression profiles, when neither is given with prior information. Our model al- lows inclusion of the infiltrating immune cells as a component as well as the tumor cells and stromal cells. We assume a linear mixture of gene expression profiles for
each component satisfying a log2-normal distribution and propose an iterated conditional modes algorithm to estimate parameters. We also involve a novel two-stage estimation procedure for the three-component deconvolution. Our method is computationally feasible and yields accurate estimates through simulations and real data analyses. The estimated cellular proportions and purified expression profiles can pro- vide deeper insight for cancer biomarker studies. Second, we propose a novel edge regression model for undirected graphs, which incorporates subject-level covariates to estimate the conditional dependencies. Current work for constructing graphical models for multivariate data does not take into account the subject specific information, which can bias the conditional independence structure in heterogeneous data. Especially for tumor samples with inherent contamination from normal cells, ignoring the cellular heterogeneity and modeling the population-level genomic graphs may inhibit the discovery of the true tumor graph, which would be attenuated towards the normal graph. Our model allows undirected networks to vary with the exogenous covariates and is able to borrow strength from different related graphs for estimating more robust covariate-specific graphs. Bayesian shrinkage algorithms are presented to efficiently estimate and induce sparsity for generating subject-level graphs. We demonstrate the good performance of our method through simulation studies and apply our method to cytokine measurements from blood plasma samples from hepatocellular carcinoma (HCC) patients and normal controls. Third, we build a model with respect to logistic regression that includes tumor purity as a scaling factor to improve model robustness for the purpose of both estimation and prediction. Penalized logistic regression is used to identify variables (genes) and predict clinical status with binary outcomes that are associated with cancers in high-dimensional genomic data. We aim to reduce the uncertainty introduced by cellular heterogeneity through incorporating the measure of tumor purity to quantify the power of data for each sample. We provide strategies of choosing scaling parameters. Our model is finally shown to work well through a set of simulation studies. We believe that the statistical modeling, technical pipelines and computational results included in our work will serve as a first guide for the development of statistical methods accounting for cellular heterogeneity in cancer research.