Large Scale Online Aggregation Via Distributed Systems
Doctor of Philosophy
From movie recommendations to fraud detection to personalized health care, there is growing need to analyze huge amounts of data quickly. To deal with huge amounts of data, many analysts use MapReduce, a software framework that parallelizes computations across a compute cluster. However, due to the sheer volume of data, MapReduce is sometimes still not fast enough to perform complicated analysis. In this thesis, I address this problem by developing a statistical estimation framework on top of MapReduce to provide for interactive data analysis. I present three projects I have worked on under this topic. In the first project, I consider extending Online Aggregation (OLA) to a MapReduce environment. Online aggregation (OLA) allows the user to compute an arbitrary aggregation function over a data set and output probabilistic bounds on accuracy in online fashion. OLA in a relational database system uses classical sampling theory to estimate confidence bounds. The key difference in a large-scale distributed computing environment is the importance of block-based processing. At a random time instance, the system is likely processing blocks that take longer to process. Hence blocks that take longer to process are less likely to be taken into account when an estimate is generated. Since one might expect correlation between processing time and the aggregated value of a block, the estimates for the aggregate can be biased. To address the inspection paradox, I propose a Bayesian model that utilizes a joint prior over the values to be aggregated and time take to process/schedule each block. Since the model is taking timing information into account, the bias is removed. This model is implemented on Hyracks, an open-source project similar to Hadoop, the most popular implementation of MapReduce. In the second project, I consider implementing gradient descent on top of MapReduce. Gradient descent is an optimization algorithm that finds the local minima of a function L(w) by starting with an initial point $w_0$ and then taking steps in direction of negative gradient of the function to be optimized. The computation of the gradient is referred to as an epoch and gradient descent computes many epochs iteratively until completion. If the number of data points N is very large it can take lot of time to compute the aggregate for an epoch k. Since the gradient descent algorithm is essentially a user-defined aggregate function, the OLA framework developed in the first part of my thesis can be used to speed up this algorithm in a MapReduce framework. The key technical question that must be answered is “When do we stop the OLA estimation for a given epoch?”. In this thesis, I propose and evaluate a new statistical model for addressing this question Finally, I design, implement, and evaluate a particular machine learning algorithm. An extremely popular feature selection methodology is topic modeling . A topic is defined as a probability distribution over sets of words or phrases and each document in the corpus is drawn from mixture of these topics. A topic model for a corpus specifies the set of topics, as well as the proportion in which they are present in any given document. The recent interest in topic models has been driven by the explosion of electronic, text based data that are available for analysis. From web pages to emails to microblogs, text-based data are everywhere. However, not all electronically-available natural language corpora are text-based. In my thesis, I consider the problem of learning topic models over spoken language. My work is motivated by our involvement with the Spoken Web (also called the World Wide Telecomm Web), which allows users in rural India to post farming-related questions and responses to an audio forum using mobile phones. I propose a new topic model that leverages the statistical algorithms used in most modern speech-to-text software. I develop alternative version of the popular LDA topic model called the spoken topic model, or STM for short. This model uses a Bayesian interpretation of the output of a speech-to-text software that takes into account the software's explicit uncertainty description (the phrases and weights) in a principled fashion.
Online aggregation; Gradient Descent; MapReduce; Hadoop; Topic Models