Flow-Controlled Background Replication for Big Data Jobs
Dzinamarira, Simbarashe J
Master of Science
This thesis proposes mechanisms to reduce the cost of data replication. Data replication is an extremely expensive but crucial operation in distributed file systems(DFSs). DFSs lie at the foundation of big data processing systems, therefore, improving them benefits nearly the entire big data ecosystem. Replication secures data against system failures but slows down applications by increasing I/O contention in the system. This thesis proposes flow-controlled background replication as a method to minimize the impact of replication on the performance of applications. The proposed system accelerates jobs, exploits under-utilized storage I/O bandwidth and supports job-based and replica-based bandwidth allocation. Our implementation, called Pfimbi, improved the runtime of data-intensive jobs by up to 30%. Pfimbi accelerated the job runtimes, in a workload based on a Facebook trace, by 15% on average. Pfimbi successfully improves application runtime while obtaining a work span comparable to that of the common synchronous replication scheme.
replication; flow control; distributed file system; big data