Since Darwin proposed that all species on the earth have evolved from a common ancestor, evolution has played an important role in understanding biology. While the evolutionary relationships/histories of genes are represented using trees, the genomic evolutionary history may not be adequately captured by a tree, as some evolutionary events, such as horizontal gene transfer (HGT), do not fit within the branches of a tree. In this case, phylogenetic networks are more appropriate for modeling evolutionary histories.
In this dissertation, we present computational algorithms to reconstruct phylogenetic networks from different types of data. Under the assumption that species have single copies of genes, and HGT and speciation are the only events through the course of evolution, gene sequences can be sampled one copy per species for HGT detection. Given the alignments of the sequences, we propose systematic methods that estimate the significance of detected HGT events under maximum parsimony (MP) and maximum likelihood (ML). The estimated significance aims at addressing the issue of overestimation of both optimization criteria in the search for phylogenetic networks and helps the search identify networks with the ``right" number of HGT edges. We study their performance on both synthetic and biological data sets. While the studies show very promising results in identifying HGT edges, they also highlight the issues that are challenging for each criterion.
We also develop algorithms that estimate the amount of HGT events and reconstruct phylogenetic networks by utilizing the pairwise Subtree-Prune-Regraft (SPR) operation from a collection of trees. The methods produce good results in general in terms of quickly estimating the minimum number of HGT events required to reconcile a set of trees. Further, we identify conditions under which the methods do not work well in order to help in the development of new methods in this area.
Finally, we extend the assumption for the genetic evolutionary process and allow for duplication and loss. Under this assumption, we analyze gene family trees of proteobacterial strains using a parsimony-based approach to detect evolutionary events. Also we discuss the current issues of parsimony-based approaches in the biological data analysis and propose a way to retrieve significant estimates.
The evolutionary history of species is complex with various evolutionary events. As HGT contributes largely to this complexity, accurately identifying HGT will help untangle evolutionary histories and solve important questions. As our algorithms identify significant HGT events in the data and reconstruct accurate phylogenetic networks from them, they can be used to address questions arising in large-scale biological data analyses.