ECE Theses and Dissertationshttp://hdl.handle.net/1911/83162019-10-23T02:32:27Z2019-10-23T02:32:27ZShuFFLE: Automated Framework for HArdware Accelerated Iterative Big Data Analysishttp://hdl.handle.net/1911/882352016-03-15T19:44:12Z2014-10-22T00:00:00ZShuFFLE: Automated Framework for HArdware Accelerated Iterative Big Data Analysis
This thesis introduces ShuFFLE, a set of novel methodologies and tools for automated analysis and hardware acceleration of large and dense (non-sparse) Gram matrices. Such matrices arise in most contemporary data mining; they are hard to handle because of the complexity of known matrix transformation algorithms and the inseparability of non-sparse correlations. ShuFFLE learns the properties of the Gram matrices and their rank for each particular application domain. It then utilizes the underlying properties for reconfiguring accelerators that scalably operate on the data in that domain. The learning is based on new factorizations that work at the limit of the matrix rank to optimize the hardware implementation by minimizing the costly off-chip memory as well as I/O interactions. ShuFFLE also provides users with a new Application Programming Interface (API) to implement a customized iterative least squares solver for analyzing big and dense matrices in a scalable way. This API is readily integrated within the Xilinx Vivado High Level Synthesis tool to translate user's code to Hardware Description Language (HDL). As a case study, we implement Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) as an l1 regularized least squares solver. Experimental results show that during FISTA computation using Field-Programmable Gate Array (FPGA) platform, ShuFFLE attains 1800x iteration speed improvement compared to the conventional solver and about 24x improvement compared to our factorized solver on a general purpose processor with SSE4 architecture for a Gram matrix with 4.6 billion non-zero elements.
2014-10-22T00:00:00ZA Data and Platform-Aware Framework For Large-Scale Machine Learninghttp://hdl.handle.net/1911/882122019-08-30T00:13:03Z2015-04-24T00:00:00ZA Data and Platform-Aware Framework For Large-Scale Machine Learning
This thesis introduces a novel framework for execution of a broad class of iterative machine learning algorithms on massive and dense (non-sparse) datasets. Several classes of critical and fast-growing data, including image and video content, contain dense dependencies. Current pursuits are overwhelmed by the excessive computation, memory access, and inter-processor communication overhead incurred by processing dense data. On the one hand, solutions that employ data-aware processing techniques produce transformations that are oblivious to the overhead created on the underlying computing platform. On the other hand, solutions that leverage platform-aware approaches do not exploit the non-apparent data geometry.
My work is the first to develop a comprehensive data- and platform-aware solution that provably optimizes the cost (in terms of runtime, energy, power, and memory usage) of iterative learning analysis on dense data. My solution is founded on a novel tunable data transformation methodology that can be customized with respect to the
underlying computing resources and constraints.
My key contributions include: (i) introducing a scalable and parametric data transformation methodology that leverages coarse-grained parallelism in the data to create versatile and tunable data representations, (ii) developing automated methods for quantifying platform-specific computing costs in distributed settings, (iii) devising optimally-bounded partitioning and distributed flow scheduling techniques
for running iterative updates on dense correlation matrices, (iv) devising methods that enable transforming and learning on streaming dense data, and (v) providing user-friendly open-source APIs that facilitate adoption of my solution on multiple platforms including (multi-core and many-core) CPUs and FPGAs.
Several learning algorithms such as regularized regression, cone optimization, and power iteration can be readily solved using my APIs. My solutions are evaluated on a number of learning applications including image classification, super-resolution, and
denoising. I perform experiments on various real-world datasets with up to 5 billion non-zeros on a range of computing platforms including Intel i7 CPUs, Amazon EC2, IBM iDataPlex, and Xilinx Virtex-6 FPGAs. I demonstrate that my framework can
achieve up to 2 orders of magnitude performance improvement in comparison with current state-of-the-art solutions.
2015-04-24T00:00:00ZClient Beamforming for Rate Scalability of MU-MIMO Networkshttp://hdl.handle.net/1911/881862016-03-15T19:44:12Z2015-04-24T00:00:00ZClient Beamforming for Rate Scalability of MU-MIMO Networks
The multi-user MIMO (MU-MIMO) technology allows an AP with multiple antennas to simultaneously serve multiple clients to improve the network capacity. To achieve this, the AP leverages zero-forcing beamforming (ZFBF) to eliminate the intra-cell interference between served clients. However, current MU-MIMO networks suffer from two fundamental problems that limit the network capacity. First, for a single MU-MIMO cell, as the number of clients approaches the number of antennas on the AP, the cell capacity often flattens and may even drop. Second, for multiple MU-MIMO cells, the multiple APs cannot simultaneously serve their clients due to inter-cell interference, so that the concurrent streams are constrained to a single cell with limited network capacity. Our unique perspective to tackle these two problems is that modern mobile clients can be equipped with multiple antennas for beamforming. We have proposed two solutions that leverage the client antennas. For the capacity scalability problem in a single MU-MIMO cell, we use multiple client antennas to improve the orthogonality between the channel vectors of the clients. The orthogonality between clientsâ€™ channels determines the SNR reduction from the zero-forcing beamforming by the AP, and is therefore critical for the capacity of a MU-MIMO cell to become more scalable to the number of clients. We have devised a 802.11ac-based protocol called MACCO, in which each client locally optimizes its beamforming weights based on the channel knowledge obtained from overhearing other clientsâ€™ channel reports. For the inter-cell interference problem in multiple MU-MIMO cells, we leverage multiple client antennas to assist the interfering APs to coordinately cancel the inter-cell interference between them. To achieve such coordinated interference cancellation in a practical way, We have proposed a two-step optimization including antenna usage optimization and beamforming weight optimization. We have devised another 802.11ac-based protocol called CoaCa, which integrates this two-step optimization into 802.11ac with small modifications and negligible overhead, allowing each AP and client to locally identify the optimal beamforming weights. We have implemented both MACCO and CoaCa on the WARP SDR platform leveraging the WARPLab framework, and experimentally evaluated their performance under real-world indoor wireless channels. The results have demonstrated the effectiveness of MACCO and CoaCa toward solving the capacity scalability and inter-cell interference problems of MU-MIMO networks. First, on average MACCO can increase the capacity of a single MU-MIMO cell with eight AP antennas and eight clients by 35%, compared to existing solutions that use client antennas differently. Second, for a MU-MIMO network with two cells, by cancelling the inter-cell interference CoaCa can convert the majority of the number of streams increase (50%-67%) into network capacity improvement (41%-52%).
2015-04-24T00:00:00ZCompressive Sensing in Positron Emission Tomography (PET) Imaginghttp://hdl.handle.net/1911/881832016-03-15T19:44:12Z2015-04-16T00:00:00ZCompressive Sensing in Positron Emission Tomography (PET) Imaging
Positron emission tomography (PET) is a nuclear medicine functional imaging modality, applicable to several clinical problems, but especially in detecting the metabolic activity (as in cancer). PET scanners use multiple rings of gamma ray detectors that surround the patient. These scanners are quite expensive (1-3 million dollars), therefore a technology that would allow the reduction in the number of detectors per ring without affecting image quality, could reduce the scanner cost, thereby making this imaging modality more accessible to patients. In this thesis , a mathematical technique known as compressive sensing is applied in an effort to decrease the number of detectors required, while maintaining good image quality.
A CS model was developed based on a combination of gradient magnitude and wavelet domains to recover missing observations associated with PET data acquisition. The CS model also included a Poisson-distributed noise term. The overall model was formulated as an optimization problem wherein the cost function was a weighted sum of the total variation and the L1-norm of the wavelet coefficients. Subsequently, the cost function was minimized subject to the CS model equations, the partially observed data, and a penalty function for noise suppression (the Poisson log-likelihood function). We refer to the complete model as the WTV model.
This thesis also explores an alternative reconstruction method, wherein a different CS model based on an adaptive dictionary learning (DL) technique for data recovery in PET imaging was developed. Specifically, a PET image is decomposed into small overlapped patches and the dictionary is learned from these overlapped patches. The technique has good sparsifying properties and the dictionary tends to capture local as well as structural similarities, without sacrificing resolution. Recovery is accomplished in two stages: a dictionary learning phase followed by a reconstruction step.
In addition to developing optimized CS reconstruction, this thesis also investigated: (a) the limits of detector removal when using the DL CS reconstruction algorithm; and (b) the optimal detector removal configuration per ring while minimizing the impact on image quality following recovery using the CS model. Results of these investigations can serve to help make PET scanners more affordable while maintaining image quality. These results can also be used to improve patient throughput by redesigning scanners so that removed detectors can be placed in axial extent to image a larger portion of the body. This will help increase scanner throughput hence improve scanner efficiency as well as patient discomfort due to long scan time.
2015-04-16T00:00:00Z