Abstract:

The High performance computing and Grid Computing Communities have been doing large scale data processing by using API's as Message Passing Interface. High Performance Computing distributes the work across the cluster of machines which assess the shared file system hosted by a Storage Area Network. The problem arises when nodes need to access larger data volumes since the network bandwidth is the bottleneck and compute nodes becomes idle. This problem is solved by the MapReduce.MapReduce is a distributed parallel computing process for large scale data intensive applications like data mining and web indexing. MapReducecollacate the data with the compute node. So the data access is fast because it is local. This feature is called data locality , which is the reason for the High Performance of Mapreduce. Hadoop is an open-source implementation of MapReduce. Due to its parallel programming property mapreduce can analyze very large scale data. Mapreduce is a semi structured and record-oriented program. In this paper we summarize the architecture design , development and functioning of MapReduce.


Keywords: Map Reduce (MR); Volunteer, Parallel computing , distributed computing, data locality;