Understanding and Improving the Efficiency of Failure Resilience for Big Data Frameworks
Ng, T. S. Eugene
Doctor of Philosophy
Big data processing frameworks (MapReduce, Hadoop, Dryad) are hugely popular today because they greatly simplify the management and deployment of big data analysis jobs requiring the use of many machines in parallel. A strong selling point is their built-in failure resilience support. Big data frameworks can run computations to completion despite occasional failures in the system. However, an important but overlooked point has been the efficiency of their failure resilience. The vision of this thesis is that big data frameworks should not only be failure resilient but that they should provide the resilience in an efficient manner with minimum impact on computations both under failures as well as during failure-free periods. To this end, the first part of the thesis presents the first in-depth analysis of the efficiency of the failure resilience provided by the popular Hadoop framework under failures. The results show that even single machine failures can lead to large, variable and unpredictable job running times. This thesis determines the causes behind this inefficient behavior and points out the responsible Hadoop mechanisms and their limitations. The second part of the thesis focuses on providing efficient failure resilience for the case of computations comprised of multiple jobs. We present the design, implementation and evaluation of RCMP, a MapReduce system based on the fundamental insight that using data replication to enable failure resilience oftentimes leads to significant and unnecessary increases in computation running time. In contrast, RCMP is designed to use job re-computation as a first-order failure resilience strategy. Re-computations under RCMP are efficient. Specifically, RCMP re-computes the minimum amount of work and uniquely it ensure this minimum re-computation work is performed efficiently. In particular, RCMP mitigates hot-spots that affect data transfers during re-computations and ensures that the available compute node parallelism is well leveraged.