Target readers: All, Developers, Programmers
Keywords: Hadoop, HDFS, Map Reduce
Introduction
Hadoop
In this era we have huge amount of data to manage. As per the IDC estimate, the total size of our “digital Universe” is around 2 zetta bytes(1 zetta byte = 270 bytes). The enormous data which is present subsumes the data generated by machines and people. Machine Logs, Vehicle GPS, and RFID readers etc., are contributing to this huge mountain of data. Storing and analyzing this vast data is a major concern. The speed to access data from the hard drives has failed to keep up with the drastic increase in the storage capacity of hard drives. For instance, if 1 TB (240bytes) of data is read with a speed of 100 MB/sec, it would take roughly 3 hours to complete the process. The solution to the above problem is Hadoop. Hadoop provides a reliable shared storage along with analysis of the performance. The storage is done by HDFS and analysis is done by Map Reduce.
Map Reduce
Map Reduce is a programming framework which abstracts the problem from disks read and write and transforms it into a computation over multiple sets of keys and values. Map Reduce is a batch query processor and is capable of getting result for a given dataset, by running an ad hoc query, in a reasonable amount of time.
RDBMS vs. Map Reduce
1. Traditional RDBMS is capable of handling data in terms of Gigabytes, whereas Map Reduce can handle data of size up to Peta bytes (250 bytes).
2. Traditional databases works well for structured data (data which is in defined format. Ex-XML documents). As far as Map Reduce is concerned, it works well on unstructured data also (for data like plain text or image data).
3. Map Reduce is linearly scalable. There are two functions Map and Reduce for defining mapping between one set of key value pairs to other. These functions are not dependent on the size of the data. This is not true for Relational databases.
4. Traditional databases are normalized to retain integrity as well as reduce redundancy, but we cannot apply normalization to Map Reduce, since data is read as a non-local operation.
Grid Computing vs. Map Reduce
Grid Computing works well for compute-intensive jobs. However this doesn’t works in a desired manner when nodes have to access larger amount of data. The network bandwidth becomes the bottleneck. Map Reduce tries to collate the data with compute nodes. This leads to faster data access. Map Reduce gives good performance because of the concept of data locality. MapReduce models network topology in such a fashion that it helps in conserving network bandwidth. In large scale distributed computing, coordinating a process is a challenge. Problems like remote process failure, partial failures have to be handled properly. With Map Reduce, as a programmer, one doesn’t have to worry about such failures, since the implementation automatically detects failed map or reduce tasks and reschedule replacements. This is possible since Map Reduce is a Shared-nothing architecture, which means that tasks do not have any dependence on each other.
Power of Hadoop
There is a wide range of algorithms expressed in Map Reduce. Problems like graph-based problems, image processing etc. can be solved using Hadoop. Using Hadoop, team at Yahoo was able to sort 1 terabyte (240 bytes) in 62 seconds. Hadoop is mostly known for Map Reduce and distributed file system, but it is also used in lot other related projects that fall under the category of large scale data processing and distributed computing.
MapReduce
Map Reduce is a programming framework for processing of data. Hadoop can run Map Reduce programs written in different languages like Java, Ruby, Python, and C++ etc. Hadoop also provides parallel processing which makes large scale data analysis very simple. This advantage can be utilized by writing our query as a Map Reduce job. Map Reduce breaks the process in two phases:
1. Map Phase
2. Reduce Phase
In each phase both input and output will be represented by a Key-value pair. The programmer has to specify:
1. The type of pair.
2. Two functions :
a) Map Function
b) Reduce Function
Working of MapReduce Model
Map Function reads each line of raw input and pulls out the relevant data as key-value pair. The Map Function sends this to the Reducer function, which does the processing of each pair accordingly. For example, if certain university wants to find out the marks of topper (among all streams and batches) for each year, from 1900 to 2011, where the data is saved in different files and directories, in format:
Year#Name#RollNumber#Stream#Marks#Batch#Country#PhoneNumber#Address#VehicleNumber#…..
To visualize the working of MapReduce, let’s consider the following data,
1950#Ram#12345#Science#78#mar1950batch#India#NA#NA#NA
1950#SRam#123456#Science#75#mar1950batch#India#NA#NA#NA
1950#DRam#123457#Science#98#mar1950batch#India#NA#NA#NA
1950#VRam#123458#Science#68#mar1950batch#India#NA#NA#NA
….
1951#KRam#123459#Science#78#mar1951batch#India#NA#NA#NA
1951#TRam#123451#Science#68#mar1951batch#India#NA#NA#NA
1951#ERam#123452#commerce#99#mar1951cbatch#India#NA#NA#NA
1951#QRam#123453#Arts#12#mar1951abatch#India#NA#NA#NA
…..
Let’s assume we have 75 lakhs of such data records with us.
The input to the map function will be the complete set of this raw data. The Map function will extract the year and marks from each data record, like,
(1950, 78)
(1950, 75)
(1950, 98)
(1950, 68)
…
(1951, 78)
(1951, 68)
(1951, 99)
(1951, 12)
…
Before sending this data to the reduce function, the output of the Map function is processed by MapReduce Framework. In this example, the framework will process each pair and group it, like,
(1950, [78, 75, 98, 68,])
(1951, [78, 68, 99, 12 …])
….
Now when this grouped data is given to the ‘Reduce function’, the only work of the function is to find out the maximum number out of each group. So the final output from Reduce Function looks like,
(1950, 98)
(1951, 99)
…..
Dataflow
Hadoop executes the jobs by dividing it into two tasks i.e. Map tasks and reduce tasks.
The job execution is controlled by two types of nodes:
1. Multiple task trackers
2. Job tracker
Job tracker coordinates the jobs running on the system by scheduling tasks (dividing the jobs into tasks) to run on task trackers. Task trackers runs the task assigned and sends the progress report to the job tracker. Hadoop divides the input into fixed size pieces named splits. One map task is created for each split. This map task runs the user-defined function for each record in the split. Once the Map task has executed successfully, it writes its corresponding output to local disks. Map output is actually an intermediate output which is given to Reduce tasks to produce the final output.
HDFS
HDFS is the acronym for Hadoop distributed file system .It is a file system for storing very large files. A HDFS cluster consists of Name Node which manages the file system’s metadata and data Nodes are used to store the important data. Hadoop and HDFS are best suited for distributed storage and distributed processing. It is scalable and fault tolerable. HDFS is easily configurable with a default configuration which is used for many installations. Generally, configuration needs to be tuned for extremely large clusters.
Some of the salient features of HDFS are:-
o Rack awareness: The node’s physical allocation list is taken in account when scheduling tasks as well as allocating storage.
o Safe mode: It is an administrative mode used for maintenance.
o Upgrade and rollback: It is possible to roll back to HDFS’s earlier state before the upgrade in case of unwanted problems.
o Backup node: This is an extension to Checkpoint node. Along with check pointing, it also receives edits from Name Node and maintains its own in-memory copy of namespace, which remains always in sync with active Name Node namespace state. One Backup node may be registered with Name Node at the same time.
Abhishek Malik
MBA-IT
IIIT Allahabad