Very Large Scale Bayesian Machine Learning
Doctor of Philosophy
This thesis aims to scale Bayesian machine learning (ML) to very large datasets. First, I propose a pairwise Gaussian random field model (PGRF) for high dimensional data imputation. The PGRF is a graphical, factor-based model. Besides its high accuracy, the PGRF is more efficient and scalable than the Gaussian Markov random field model (GMRF). Experiments show that the PGRF followed by the linear regression (LR) or support vector machine (SVM) reduces the RMSE by 10% to 45% compared with the mean imputation followed by the LR or SVM. Furthermore, the PGRF scales the imputation to very large datasets distributed in a 100-machine cluster that could not be handled by the GMRF or Gaussian methods at all. Unfortunately, the PGRF model is hard to implement -- approximately 18000 lines of Hadoop code and 4 months of work in distributed debugging and running. To reduce the huge amount of human effort, I designed a database system called SimSQL. SimSQL supports rich analytical methods such as Bayesian ML, and scales such methods to terabytes of data distributed over 100 machines. SimSQL enlarges the analysis power of relational database systems, and at the same time keeps merits such as declarative language, transparent optimization and automatic parallelization. SimSQL builds upon the MCDB uncertainty database, and allows the definition of recursive stochastic tables. SimSQL is an ideal platform for Markov chain simulations or iterative algorithms such as PageRank. To show SimSQL's performance, I introduce an objective benchmark that compares SimSQL with Giraph, GraphLab and Spark on five Bayesian ML problems.. The results show that SimSQL provides the best programmability and competitive performance. To run a general Bayesian ML model, SimSQL takes 1X less code than Spark, 6X less than GraphLab, and 12X less code than Giraph, while its time cost is within 5X slowdown in the worst case compared with Giraph and GraphLab. In brief, I consider both modeling and inference for large scale Bayesian ML. The goals for both sides are the same: scaling Bayesian ML to very large datasets, achieving better performance and reducing time cost in design, implementation and execution of ML algorithms.
Bayesian inference; large scale machine learning; MapReduce