Target readers: All Big Data Aspirants
Keywords: Hadoop, HDFS, Map or Reduce
We have huge volumes of untapped information in unstructured documents spread across the networks. This data helps us to create new products, refine existing products, discover new trends & moreover helps in understanding our business. Hadoop is a distributed framework which is designed to handle such huge volumes, literally Petabytes of data. It helps in processing large datasets in a scalable and fault tolerant manner.
As we see in the above picture, the core of the Hadoop is HDFS & Map Reduce. HDFS provides distributed storage across the clusters consisting of commodity machines. Map Reduce is the programming model we use to process voluminous amounts of data stored across the clusters. There are many Hadoop projects inside the Hadoop Ecosystem. Each of the projects is designed to solve a specific problem. Most of the projects are in the incubation stage. In this document we will see about the projects which are graduated from the incubator stage and have already become part of Hadoop implementations.
Data Processing: Hive & Pig are the data processing libraries available which help us to process data stored in the HDFS. The main reason why we need these different ways to process data inside Hadoop is, we don’t find many professionals who are a low level Java, Python, or a C/C++ programmer’s that can write Map/Reduce jobs for fetching the data from HDFS and processing it. Moreover some things like filtering data, joining, grouping which we do using SQL language are very difficult to implement using programming languages like Java, also time consuming.
Hive: Hive is an open source data processing project from the Apache Software Foundation. It is a data warehouse built on top of Hadoop. It contains a query language called Hive QL, which is very much similar to SQL. Hive provides a way to project structure on to the large datasets residing in the Hadoop Distributed File System , helps in managing them using HiveQL. Hive converts the queries we write into Map/Reduce jobs and submits them to the HDFS cluster. Hive provides access to the data stored in HDFS or in other databases like Hbase. Hive is designed for OLAP, not for OLTP.
Pig: Pig is a high-level data flow scripting language. Pig was initially developed at Yahoo! The core components of Pig are Pig Latin and Pig compiler. Pig Latin is the programming language, Pig Runtime/Compiler compiles the Pig Latin and converts to Map/Reduce job, submits to the cluster. Pig helps you in analyzing the data residing in HDFS even if you do not have any idea on Map Reduce concepts.
Data Storage:
HBase: Apache HBase is an open source implementation of Google’s Big Table. It is a database which sits on top of the Hadoop Distributed file System: it is a scalable, non-Relational, distributed, column-oriented, multi-dimensional & a highly available database. It is NoSQL database – it means that the underlying structure of this database is not very strict (schema oriented) like in the traditional relational databases, it is very flexible which makes it very scalable. HBase provides an efficient way of storing both structured and semi structured data; it is also capable of storing large amounts of sparse data (data which contains lot of void). HBase is better suited for application areas which need random, real times write/read access to the large volumes of data. You can write Pig and Hive queries against the data residing in HBase tables.
Cassandra: Cassandra database has its roots in Amazon’s Dynamo data store. It was originally developed at Facebook, after Facebook open sourced the code; Cassandra became a top-level Apache project. It is Real time interactive transaction processing on top of Hadoop. It is also a NoSQL database which is designed for providing high availability of large volumes of data spanning across clusters, data centers with no single point of failure. If the application area needs high availability, scalability & require high performance seeking against Hadoop data, Cassandra best suits it.
Note: As most of the features and details of these databases overlap with each other, it is our duty initially to smoke test with each of these technologies and find out which better suits our application area.
Data Serialization (Avro & Thrift): Serialization is a way that we can take data from an application, package it into format that we can either store it on the disk, transfer/exchange to another application, unpack it, then desterilize into a format they understand. Most of the times data is serialized as xml or json or some binary format.
Avro is a generic data serialization exchange framework. Thrift is language neutral serialization framework. Thrift is more specific to creating flexible schemas that work with Hadoop data, it is meant for cross language compatibility. If you built an application that works with Hadoop data in java, and you want to use those same objects inside of an application that is built in Ruby or Python or C++ or JavaScript you can consume it.
Data Intelligence – Mahout: Mahout is a machine learning algorithm library that conquers the three C’s. 1) Collaborative filtering (recommendation), 2) Clustering – way to group related documents, 3) Classification – way to categorize related documents. Mahout is mainly used in the areas where predictive analysis, recommendations need to be made using the previous trends of the data.
Data Extraction Tools:
Sqoop: Sqoop, a top level Apache project, is used for importing data from relational databases such as Oracle, MySQL into the HDFS storage & vice-versa. For instance say we have the result of a map reduce, rather than taking those results and putting them in HDFS, we can send those results to relational world so that data professionals can do their own analysis. Sqoop is useful for pushing bulk loads of data from HDFS to relational databases & also useful for pushing data from relational world into Hadoop for archiving and other purposes. Sqoop can be used to integrate Hadoop with various other relational world databases like Oracle, MySQL, and Teradata etc.
Flume: Apache Flume is an application which is useful for streaming large volumes of data from various web sources on the internet to Hadoop Distributed File System. Flume helps in real time log processing. For instance, the huge amounts of log data that is generated by the web servers can be pushed to HDFS in real time, stored and analyzed thus helping the users to obtain some meaningful information. Flume ensures there is no data loss during the streaming process. It is very reliable.
Orchestration, Management, Monitoring:
Scheduler – Oozie: Oozie is integrated with Hadoop and can be used for scheduling the Hadoop jobs. It is a workflow library that allows us to connect the dots between the essential Hadoop projects like Hive, Pig and Sqoop. For instance, we want to run a Pig script, when that was completed kick off a Hive query, after which we want to start a Sqoop job, Oozie allows us to do this.
Management – Zookeeper: It is a distributed service coordinator. It is a way to keep all the services running in your cluster to keep in sync. It helps us in synchronization of various services by providing a centralized management point. Zookeeper will have the health reports of all the nodes in all the clusters. You can also add nodes to your cluster with the help of Zookeeper.
Administration – Ambari: Ambari helps us to provision a cluster, which means that we can install services like Hive, Pig, Oozie, HBase, and Sqoop across all the nodes in a cluster. Ambari lets us manage all the services in the cluster, like stopping & starting services from one centralized path. Ambari has got a nice Web UI –dashboard, helps us to monitor the health of Hadoop clusters.
Ankrish
MBA-IT
IIIT Allahabad