Improving the Efficiency of Map-Reduce Task Engine
Cox, Alan L.
Master of Science
Map-Reduce is a popular distributed programming framework for parallelizing computation on huge datasets over a large number of compute nodes. This year completes a decade since it was invented by Google in 2004. Hadoop, a popular open source implementation of Map-Reduce was introduced by Yahoo in 2005. Over these years many researchers have worked on various problems related to Map-Reduce and similar distributed programming models. Hadoop itself has been the subject of various research projects. The prior work in this field is focussed on making Map- Reduce more efficient for iterative processing, or making it more pipelined across different jobs. This has resulted in an improvement of performance for iterative applications. However, little focus was given to the task engine which carries out the Map-Reduce computation itself. Our analysis of applications running on Hadoop shows that more than 50% of the time is spent in the framework in doing tasks such as sorting, serialization and deserialization . We solve this problem introducing an extension to the Map-Reduce programming model. This extension allows us to use more efficient data structures like hash tables. It also allows us to lower the cost of serialization and deserialization of the key value pairs. With these efforts we have been able to lower the overheads of the framework, and the performance of certain important applications such as Pagerank and Join has improved by 1.5 to 2.5 times.
Hadoop; Map-Reduce; Pagerank; Join; Barrier free