## Large Scale Online Aggregation Via Distributed Systems

##### Author

Pansare, Niketan

##### Date

2014-12-04##### Advisor

Jermaine, Chris

##### Degree

Doctor of Philosophy

##### Abstract

From movie recommendations to fraud detection to personalized health care, there is growing need to analyze huge amounts of data quickly. To deal with huge
amounts of data, many analysts use MapReduce, a software framework that parallelizes computations across a compute cluster. However, due to the sheer volume of data, MapReduce is sometimes still not fast enough to perform complicated analysis. In this thesis, I address this problem by developing a statistical estimation framework on top of MapReduce to provide for interactive data analysis. I present three projects I have worked on under this topic.
In the first project, I consider extending Online Aggregation (OLA) to a MapReduce environment. Online aggregation (OLA) allows the user to compute an arbitrary
aggregation function over a data set and output probabilistic bounds on accuracy in online fashion. OLA in a relational database system uses classical
sampling theory to estimate confidence bounds. The key difference in a large-scale distributed computing environment is the importance of block-based processing. At a random time instance, the system is likely processing blocks that take longer to process. Hence blocks that take longer to process are less likely to be taken into account when an estimate is generated. Since one might expect correlation between processing time and the aggregated value of a block, the estimates for the aggregate can be
biased. To address the inspection paradox, I propose a Bayesian model that utilizes a joint prior over the values to be aggregated and time take to process/schedule
each block. Since the model is taking timing information into account, the bias is removed. This model is implemented on Hyracks, an open-source project
similar to Hadoop, the most popular implementation of MapReduce.
In the second project, I consider implementing gradient descent on top of MapReduce. Gradient descent is an optimization algorithm that finds the local minima of
a function L(w) by starting with an initial point $w_0$ and then taking steps in direction of negative gradient of the function to be optimized. The computation of
the gradient is referred to as an epoch and gradient descent computes many epochs iteratively until completion. If the number of data points N is very large it can
take lot of time to compute the aggregate for an epoch k. Since the gradient descent algorithm is essentially a user-defined aggregate function, the OLA framework
developed in the first part of my thesis can be used to speed up this algorithm in a
MapReduce framework.
The key technical question that must be answered is “When do we stop the OLA estimation for a given epoch?”. In this thesis, I propose and evaluate a new
statistical model for addressing this question
Finally, I design, implement, and evaluate a particular machine
learning algorithm. An extremely popular feature selection methodology
is topic modeling . A topic is defined as a probability distribution over sets of words
or phrases and each document in the corpus is drawn from mixture of
these topics. A topic model for a corpus specifies the set of topics, as well as the
proportion in which
they are present in any given document. The recent interest in topic
models has been driven by the explosion of electronic, text based
data that are available for analysis. From web pages to emails to
microblogs, text-based
data are everywhere. However, not all electronically-available natural
language corpora are text-based. In my thesis, I consider the problem of learning
topic models over
spoken language. My work is motivated by our involvement with the
Spoken Web (also called the World
Wide Telecomm Web), which allows users in rural India to post
farming-related questions
and responses to an audio forum using mobile phones. I propose a new
topic model that leverages the statistical algorithms used
in most modern speech-to-text software. I develop alternative
version of the popular LDA topic model called the
spoken topic model, or STM for short. This model uses a Bayesian
interpretation of the output of a speech-to-text software that takes
into account the software's explicit uncertainty description (the
phrases and
weights) in a principled fashion.

##### Keyword

Online aggregation; Gradient Descent; MapReduce; Hadoop; Topic Models