Gene Tree Distributions under Duplication, Loss and Deep Coalescence
Master of Science
Gene duplication and loss are two evolutionary processes that occur across all three domains of life. These two processes result in different loci, across a set of related genomes, having different gene trees. Inferring the phylogeny of the genomes from data sets of such gene trees is a central task in phylogenomics. Furthermore, when the evolutionary history of the genomes includes relatively close divergence events, as in cases of closely related organisms or rapid radiations, deep coalescence of gene copies could be at play, in addition to duplication and loss, further adding to the complexity of gene/genome relationships. In this work, we develop a probabilistic model of gene evolution that incorporates duplications and loss, and accounts for deep coalescence. We formulate the models in terms of Markov chains, and provide algorithms for computing gene tree distributions for the two cases of gene trees with and without branch lengths. We illustrate the use of our work on simulated and biological data by assessing the accuracy of species tree inferences under our models (topology and branch lengths) and contrasting them to inferences under cases of deep coalescence alone. It is important to highlight that our models sidestep the issue of hidden paralogy by ``integrating out" the possible orthology assignments of gene copies. Our work enables new statistical phylogenomic analyses, particularly when hidden paralogy and deep coalescence could be at play.