What is Hadoop MapReduce?
What is Hadoop MapReduce?
Map reducing is a technical program that is used for distributed systems and it is based on Java. The algorithm of map-reduce contains two tasks which are known as Map and Reduce. The tasks carried out by map are as follows: map takes a set of data and converts it into another set of data and converts into another set of data which is known as tuples(value pairs/keys).
Reduce: Output from a map is taken and combines those tuples into small tuples, As per the name Reduce task is done after map task only.
Map-Reduce makes data processing over multiple nodes easy. The primitives in the processing are called as mappers and reducers Once an application is written in the form if the map reduces it is easy for scaling the application over n number of machines in a cluster by taking care of changes in the configuration.
Algorithm for MapReduce
The main task for the map stage is to convert the raw input data into several chunks of data. The input data is stored in HDFS which is known as Hadoop Distributed File systems.
This stage is a pair of both shuffle and reduce, the job of this stage is to process the data that comes from the mapper and produces a new set of output and that is stored in HDFS.
Old and New APIS for MapReduce
The old Hadoop APIS for MapReduce framework is the PI’s that are used by the various older than 0.20.0. The new APIs which are called as “context jobs ” are used in Hadoop v0.20.0 and later. Switching from older API to newer one requires the application to be rewritten because of type incompatibility. However, the new API is designed in such a way that it can easily adapt to the later versions that might be developed in the future without the need for re-writing.
The new API mostly got completed with the release of version 1.x series. However, it does not include certain libraries associated with the MapReduce framework.
Since the new API focuses on providing efficient transition over future APIs, it employes abstract classes instead of interfaces. The use of these classes provides flexibility in additional methods thereby avoiding the breakage of the existing implementation. An example is an abstract class that acts as a mapper and reducer interface in the new API.
The package org.apache.hadoop. MapReduce contains a new API whereas the package contains old API mapped. The new API is capable of controlling the flow of execution through mappers and reducers. It uses to run () method for this purpose. In old API and new API, the key-value record pairs are pushed to mappers and reducers.
Job client got replaced by job class in the new API to provide better control over jobs. Jobconf object that is used for job configuration in old API got replaced with a configuration such as certain helper methods in new API.
Unlike old API that uses a common name to represent outputs associated with both map and reduce, the new API uses different names for outputs of a map and reduce. These names are part-m-nnnnn and part-r-nnnnnn for the map and reduce respectively where n is an integer.
Classes that should be Included in MapReduce
There are three classes that need to be included in MapReduce, they are as follows,
MapperClass() method is used to set parameters to reduce to map types.
It accepts a group of values from the mapper, which are reduces using aggregation to generate a pair of key and value. It aggregates Mapper output and writes in HDFS.
These methods are used for setting (or) managing the outputs generated from map-reduce operations. Both of these methods usually perform similar operations. In the case of variations, they are transformed in terms of map methods.
This method is used for setting (or) managing the input format.
This method is used for mapping the code to the JAR file in which the defined class is present, this can be done by simply using the name of class within this method.
Driver class allows the user to run a MapReduce job using input, Output, Mapper class, reducer class and the required set of parameters.
Driver class allows the user the run a MapReduce job using input, Output, mapper class, Reducer class and the required set of parameters.
The code associated with the driver class is referred to as driver code. A driver class is responsible for the execution of the MapReduce jobs by passing various parameters using mapper and reducer classes, input, output, etc.
It is considered as a boilerplate code where a piece of code is placed multiple times within a p; program with certain modifications in parameters.to provide additional support, it also includes an auxiliary class.
Consider the following code for evaluating the minimum temperature as an example of driver class. In the above code, a configuration object is created to retrieve the job class instance. Within this class, it is possible to include the three methods with different parameters.
Mapper code in Hadoop
A mapper in MapReduce is responsible for providing parallelism. The Task Tracker contains the mapper and process it. The code associated with the mapper is referred to as the mapper code. The Logic used in the mapper code must be capable of executing independently.It should be capable of performing the parallel tasks mentioned in the algorithm.
The input format resides in the driver program of specific InputFormat type or the file in which the mapper is executed.The output of the mapper might be mapper and value that are set in the mapper output. It is stored in an intermediate file that is specifically created in the OS space path. Operations such as read, shuffle, and sort is performed on this file.
The general format of a mapper class is as follows,
<INPUT_KEY,INPUT_VALUE, OUTPUT_KEY, OUTPUT_VALUE>
These four parameters specify the type of inputs and outputs associated with the mapper function.
Reducer is capable of reducing the intermediate values all of them which share the key to a smaller set of values. A reducer in MapReduce performs three major operations. They are,
It is responsible for collecting the inputs and generating the mappers in sorted order.It uses the HTTP protocol to retrieve the required partition of the output of mappers.
It is responsible for arranging the reducer keys in various orders typically using the merge sort approach.
This phase of reducer uses a method reduce (object, Iterable, context) which relates to <key,(collection of values)> associated with every input in the sorted array.It uses TaskInputOutputContext,Write(object,object) to forward the results generated from reduce() to Record Writer.
The general format of reducer code can be written as,
Org.apache.hadoop.mapreduce.Reducer <INPUT_KEY, INPUT_VALUE, OUTPUT_KEY, OUTPUT_VALUE>.
It can be observed that it also uses four parameters similar to the mapper code.
The four parameters are Text, IntWritable, Text, and IntWritable in which, the first two are input types and the remaining are output types. Here, an iterator is used to move across all the words and counting each of the words i.e., the total result is provided as output.
Combiners are used mainly for two reasons they are as follows ;
1.For minimizing the number of values associated with keys that are allocated by the mapper.
2.For minimizing the amount of data forwarded to carryout shuffling.
The functionality of a combiner can be allocated to a reducer. However, it cannot replace the reducer. The combiner function in Hadoop acts as input for reducer and can be executed over output generated from mapper.
The combiner function is used as an extension to provide optimized performance and hence it can be used any number of times if it generates the same output.
A practitioner transmits pairs of key-value from mapper to particular reducers. It uses the hash function as its default partitioner. Using this function, it captures a key and finds the source of that particular key. It keeps on finding the partitions until the number of partitions becomes equal to the number of reducer tasks.
The hash partitioner is a default partitioner by which the record key is hashed in order to determine the source of the record.
The general format of a partitioner class is as follows,
where k is a key and v is a value.
The partitioners that can be subclassed from this class are BinaryPartitioner, TotalOrderPartitioner, HashPartitioner, HashPartitioner and keyFieldBasedPartitioner.It is an intermediate phase that occurs between Map and Reduces function.