Improving the interpretation of metabolic pathfinding results with clustering and compound hubs
Kim, Sarah Michelle
Kavraki, Lydia E
Master of Science
Knowledge on metabolic networks across species can be utilized to help address many challenges in biotechnology, including metabolic engineering. Large-scale annotated metabolic databases, such as KEGG and MetaCyc, provide a wealth of information to researchers designing novel biosynthetic pathways. However, many metabolic pathfinding tools that assist in identifying possible solution pathways fail to facilitate the interpretation of these pathway results. This work begins to address this problem by examining the performance of standard clustering algorithms on results produced by a popular metabolic pathfinding algorithm and suggesting the use of compound ”hubs” for examining the produced results. To address the first point, we assessed the ability of standard clustering method to expertly group pathways. Three standard clustering methods (hierarchical, k-means, and k-medoids) along with three pair-wise distance measures (Levenshtein, Jaccard, and n-gram) were used to group lysine, isoleucine, and 3-hydroxypropanoic acid (3-HP) biosynthesis pathways produced by a recent metabolic finding algorithm. The quality of the resulting clusters were quantitatively evaluated against expected pathway groupings taken from theliterature. Hierarchical clustering and Levenshtein distance appeared to best match external pathway labels across the three biosynthesis pathways but results suggest that grouping pathways with more complex underlying topologies may require more tailored clustering methods. In summary, the clustering of pathways proved much more nuanced than excepted due to the various intricacies of computed paths and several ways of getting between two compounds conserving the same number of atoms. To address the second point, we investigate the use of “hub” compounds. Hub compounds were selected by metabolic experts among compounds with a large number of in-degree reactions. An analysis of our results shows that hub compounds are common in the pathfinding results but that themselves alone cannot be used to cluster pathways. Our observations give rise to a new proposed method that will compute pathways between input and output compounds by using a precomputed a lookup table for pathways between the most well connected compound hubs in the metabolic network. The ultimate goal of precomputing the lookup table is to reduce search space while still obtaining most, if not all, pathway results found by the original search algorithm. We provide evidence that this is a promising direction for future research and can yield results that are more easily interpreted and refined by users.
metabolic pathfinding; clustering; compound hubs