Computation, Visualization, and Applications of Convex Clustering
Allen, Genevera I
Doctor of Philosophy
Clustering is a ubiquitous tool for exploratory data analysis across the sciences, with the general aim of identifying groups of similar objects. Recent work has recast the clustering problem within the framework of convex optimization, addressing many shortcomings of traditional methods such as interpretability, stability, and parameter selection. The method of Convex Clustering has proven to be a canonical example of such an approach, and its extensions and applications will be the focus of this work. We begin by considering the application of Convex Clustering in the novel setting of region detection for high-throughput genomic data. We illustrate the versatility of Convex Clustering by developing a novel extension, Spatial Convex Clustering (SpaCC), specifically catered to multivariate spatially correlated genomics data. We demonstrate SpaCC to achieve state-of-the-art performance on the well-studied prob- lem of Copy Number Segmentation, and show it to be similarly successful in the novel setting of DNA Methylation region detection. Next, we address several shortcomings of Convex Clustering including slow computation and lack of familiar visualizations relative to its traditional counterparts. To do so, we introduce algorithms for the fast approximation of the Convex Clustering solution path and provide both theoretical guarantees of error control as well as empirical investigations. Next, we provide a suite of visualization techniques to aid in the interpretation of the clustering solutioniii path, exploring their insights via several real data examples. Finally we introduce the R-package, clustRviz, which gives practitioners direct access to the fast computation and dynamic visualizations introduced throughout.
Clustering; Convex Optimization; Data Visualization; High-Throughput Genomics