Statistical Methods for Elucidating Tumor Heterogeneity and Evolution from Single-cell DNA Sequencing Data
Doctor of Philosophy
Intra-tumor heterogeneity, as caused by a combination of mutation and selection, poses significant challenges to the diagnosis and clinical therapy of cancer. Resolving this heterogeneity to identify the tumor cell populations (clones) and delineate their evolutionary history is of critical importance in improving cancer diagnosis and therapy. This heterogeneity can be readily elucidated and understood through the reconstruction of the clonal genotypes and evolutionary history of the tumor cells. These tasks are challenging since genomic data is most often collected from one snapshot during the evolution of the tumor's constituent cells. Consequently, using computational methods that infer the tumor phylogeny and tumor subpopulations from sequence data is the approach of choice. Recently emerged single-cell DNA sequencing (SCS) technologies promise to resolve intra-tumor heterogeneity to a single-cell level. However, inherent technical errors in SCS datasets, including false-positive (FP) errors, false-negatives (FN) due to allelic dropout, cell doublets and coverage non-uniformity significantly complicate these tasks. In this thesis, we first develop a likelihood-based approach for inferring tumor trees from imperfect SCS genotype data with potentially missing entries, under a finite-sites model of evolution. Our model of evolution introduces a continuous time Markov chain that accounts for the effects of different events in tumor evolution including point mutations, loss of heterozygosity, deletion and recurrent mutations on genomic sites. Our method probabilistically accounts for false positive and false negative errors and missing entries in SCS datasets. With the help of a heuristic search algorithm, our method finds a maximum-likelihood solution for the phylogenetic tree that best describes the evolutionary history of the tumor cells in the SCS dataset. In doing so, our method also estimates the error rates associated with the datasets. Another contribution of this method is to infer the order of the mutations on the branches of the inferred tumor phylogeny. This is done using a maximum-likelihood-based dynamic programming algorithm. The performance of our method on synthetic and experimental datasets from two colorectal cancer patients to trace evolutionary lineages in primary and metastatic tumors suggests that employing a finite-sites model leads to an improved inference of tumor phylogenies. Secondly, we develop a non-parametric Bayesian method that simultaneously reconstructs the clonal populations as clusters of single cells, mutations associated with each clone, and the genealogical relationships between the clonal populations. It employs a tree-structured Chinese restaurant process as a prior on the number and composition of clonal populations. The evolution of the clonal populations is modeled by a clonal phylogeny and a finite-sites model of evolution to account for potential mutation recurrence and losses. We probabilistically account for FP and FN errors, and cell doublets are modeled by employing a Beta-binomial distribution. We develop a Gibbs sampling algorithm comprising of partial reversible-jump and partial Metropolis-Hastings updates to explore the joint posterior space of all parameters. The performance of our method on synthetic and experimental datasets suggests that joint reconstruction of tumor clones and clonal phylogeny under a finite-sites model of evolution leads to more accurate inferences. Our method is the first to enable this joint reconstruction in a fully Bayesian framework, thus providing measures of support of the inferences it makes.
Statistical Learning; Probabilistic Graphical Model; Single-cell Sequencing; Tumor phylogeny; Intratumor Heterogeneity