Some Rare LSH Gems for Large-scale Machine Learning
Doctor of Philosophy
Locality Sensitive Hashing (LSH) is an algorithm for approximate nearest neighbor (ANN) search in high dimensional space. In this thesis, instead of using LSH as an ANN tool, we investigate the possibility of using LSH for addressing the computational and memory challenges in large scale machine learning tasks. We show some rare 'gems' of locality-sensitive hashing that can shed important lights on large scale learning system. We first show the power of LSH for high-speed anomaly detection. Anomaly detection is one of the frequent and important subroutines deployed in largescale data processing applications. Even being a well-studied topic, existing techniques for unsupervised anomaly detection require storing significant amounts of data, which is prohibitive from memory, latency and privacy perspectives, especially for small mobile devices which has ultralow memory budget and limited computational power. We Introduce ACE (Arrays of (locality-sensitive) Count Estimators) algorithm that can much faster than most state-of-the-art unsupervised anomaly detection algorithms with very low memory requirement. Secondly, we show a novel sampler view of LSH and propose to use LSH for scaling-up Split-Merge MCMC Inference. Split-Merge MCMC (Monte Carlo Markov Chain) is one of the essential and popular variants of MCMC for problems when an MCMC state consists of an unknown number of components. It is well known that state-of-the-art methods for split-merge MCMC do not scale well. Here, we leverage some unique properties of weighted MinHash, which is a popular LSH, to design a novel class of split-merge proposals which are significantly more informative than random sampling but at the same time efficient to compute. In the end, we show a practical usage of LSH for Indoor Navigation tasks. In this work, we developed the first camera based (privacy-preserving) indoor mobile positioning system, CaPSuLe, which does not involve any communication (or data transfer) with any other device or the cloud. The system only needs 78.9MB of memory and can localize a mobile device with $92.11\%$ accuracy with very fast speed. The ability to run the complete system on the mobile device eliminates the need for the cloud, making CaPSuLe a privacy-preserving localization algorithm by design as it does not require any communication.
Locality Sensitive Hashing; Large Scale Machine Learning