Inference of parsimonious species phylogenies from multi-locus data
Than, Cuong V.
Nakhleh, Luay K.
Doctor of Philosophy
The main focus of this dissertation is the inference of species phylogenies, i.e. evolutionary histories of species. Species phylogenies allow us to gain insights into the mechanisms of evolution and to hypothesize past evolutionary events. They also find applications in medicine, for example, the understanding of antibiotic resistance in bacteria. The reconstruction of species phylogenies is, therefore, of both biological and practical importance. In the traditional method for inferring species trees from genetic data, we sequence a single locus in species genomes, reconstruct a gene tree, and report it as the species tree. Biologists have long acknowledged that a gene tree can be different from a species tree, thus implying that this traditional method might infer the wrong species tree. Moreover, reticulate events such as horizontal gene transfer and hybridization make the evolution of species no longer tree-like. The availability of multi-locus data provides us with excellent opportunities to resolve those long standing problems. In this dissertation, we present parsimony-based algorithms for reconciling species/gene tree incongruence that is assumed to be due solely to lineage sorting. We also describe a unified framework for detecting hybridization despite lineage sorting. To address the first problem of species/gene tree incongruence caused by lineage sorting, we present three algorithms. In Chapter 3, we present an algorithm based on an integer-linear programming (ILP) formula to infer the species tree's topology and divergence times from multiple gene trees. In Chapter 4, we describe two methods that infer the species tree by minimizing deep coalescences (MDC), a criterion introduced by Maddison in 1997. The first method is also based on an ILP formula, but it eliminates the enumeration phase of candidate species trees of the algorithm in Chapter 3. The second algorithm further eliminates the dependence on external ILP solvers by employing dynamic programming. We ran those methods on both biological and simulated data, and experimental results demonstrate their high accuracy and speed in species tree inference, which makes them suitable for analyzing multi-locus data. The second problem this dissertation deals with is reticulation (e.g., horizontal gene transfer, hybridization) detection despite lineage sorting. The phylogeny-based approach compares the evolutionary histories of different genomic regions and test them for incongruence that would indicate hybridization. However, since species tree and gene tree incongruence can also be due to lineage sorting, phylogeny-based hybridization methods might overestimate the amount of hybridization. We present in this dissertation a framework that can handle both hybridization and lineage sorting simultaneously. In this framework, we extend the MDC criterion to phylogenetic networks, and use it to propose a heuristic to detect hybridization despite lineage sorting. Empirical results on a simulated and a yeast data set show its promising performance, as well as several directions for future research.
Biology; Bioinformatics; Computer science; Applied sciences