Hadoop

Target readers: All
Keywords: Big Data, Hadoop, Architecture

Introduction of Hadoop:

Apache Hadoop software framework that supports the applications that use intensive amount of data. It makes the applications to work with thousands of machines and with petabytes of data. Hadoop was created by Doug Cutting and Michael J. Cafarella. Hadoop is an activity that is used globally using Java as a programming language.

Architecture of HADOOP:

Hadoop consists of the common area which is known as the Hadoop Common. It is used to provide access to the systems supported by it. As Hadoop used java as programming language so the Hadoop common contains the needed jar files.

Every file system will provide the awareness of the location for the better scheduling. Hadoop can use this application for better scheduling of the work. HDFS (Hadoop Distribution File System) is used in Hadoop for replicating the data in different networks. The main purpose for this replication is to use the data even during the network failure.

Hadoop can be consists of single node known as master node or multiple nodes known as worker nodes. Master node consists of JobTracker, TaskTracker, NameNode, and DataNode and worker node can act both as the DataNode and TaskTracker.

In a large Hadoop each HDFS is assigned with particular primary NameNode to host and secondary NameNode for taking snapshots of NameNode memory structures resulting in dropping the amount of loss of data and file system alteration. Single JobTracker can manage the arrangement of jobs.

File systems In HADOOP:

HDFS:

HDFS is designed to run on hardware. It is similar to other file system but there are some differences which make it highly compatible. It is highly tolerant to any failures and can operate on low cost hardware. HDFS can give the frequent admittance for the data and is appropriate for an application which contains Big Data.

HDFS has potential to backing the data in case of failure of the NameNode. It includes secondary name node which deceives people about the secondary node taking over primary node in case of failure. An advantage of taking snapshots by slave nodes is that in case of primary name node goes offline without rerunning the complete cycle by referring to the snapshot entire directory can be created.

HDFS also creates alertness between JobTracker and Tasktracker. Jobtracker will plan for the jobs to TaskTracker with knowledge to TaskTracker about the data site. It can be helpful for reducing the redundant transfer of data.

Multi Node Hadoop Cluster:

HDFS Architecture:

Block Replication in HDFS:

Block replication is a very special feature of HDFS file system. This can be used to recover the data in case of failure of the NameNode.

Disadvantages of Hadoop:

Generally in Hadoop complex queries with multiple joins cannot be supported. Also normalized documents cannot be used developer has to do the DE normalization. Or we can say Hadoop is best suited for OLTP.
• No uniformity.
• While using Hadoop the access control is insufficient.
• The program becomes more complicated with time.
• Hadoop systems cannot support the relationship features.
• Hadoop cannot perform better in real time scenarios.

Anshruta
MBA-IT
IIIT Allahabad