B'Cognizance » Tech-Hive

Responsive Web Designing

admin — Fri, 14 Nov 2014 10:46:37 +0000

Target readers: Website Designers and Developers, Programmers
Keywords: CSS, HTML, Frameworks

Introduction

It is the web development approach, which emphasizes on user experience practices to make reading easier with minimum scrolling, clicks and resizing required. It is responding to the size of the browser window or the device, maintaining a single code.

Why do we need RWD?

Web development has been in market since internet is. But why suddenly a new concept, which led to redesigning of thousands of existing web sites, was introduced and welcomed by all designers. Emerge of RWD is basically an outcome of astonishing progress of smart phones and tablets in market. Each month there is an increasing amount of people who switch to using mobile and tablets instead of desktops, for surfing.

To make the same website look good across all devices, there can be two approaches
1. Maintaining different code base for these devices. But here, you will have to develop different code base, which leads to high development and maintenance cost.
2. Use responsive web designing approach. You design for three device sizes but develop a single code base in HTML that makes the same website look good on large desktop monitors and small smart phones and anywhere in between.

How to create RWD?

It is the same HTML used for all devices, using CSS (which determines the layout of webpage) to change the appearance of the page. It relies on proportion-based grids to rearrange content and design elements. For example, media queries are used to make certain css styles apply only for devices with screen width as small as mobiles:

@media only screen and (max-width:500px){
#wrapper #nav{ margin: 0 0 0 -160px;
}

Popular responsive css frameworks in market

There are certain css frameworks available that make responsive development easier. The most popular are:
1. Bootstrap.
2. Foundation.
3. Gumby.
4. Skeleton.

Richa Deshwal
Deloitte. |Bengaluru Area
Email: deshwalricha8@gmail.com

3D Printing an emerging technology

admin — Fri, 14 Nov 2014 06:02:41 +0000

3D printing or additive manufacturing( AM) refers to various processors for printing a 3-dimentional object primarily additive processes are used, in which successive layer of material are laid down under computer control. These object can be of almost any shape or geometry, and are produced from a 3D model other electronic data source. A 3D printer is a type of industrial machine.

General principles used in 3D printing, 3D models can be created with a CAD package or via 3D scanner. The manual modeling of process preparing geometric data for 3D computer graphics is similar to plastic art such as sculpting. 3D scanning is a process of analyzing and collection digital data on the shape and appearance of real object can then be produced. Both manual and automatic creation of 3D printable models is difficult for average consumer. This is why several 3D printing marketplaces have emerged over the last years.

Application of 3D printing are in the fields like product development, data visualization, rapid prototype and specialize manufacturing .The technology is also expanding toward field like job production, industrial design, toys industries, engineering designs, automotive designing , dental , military etc.

Most recent update in 3D printing use is now expanding towards space and aeronautics, where some components of satellite will be replaced by printed component on the same place; this means astronauts can now print any component of satellite in space with help of highly advance 3D printer.

References: Wikipedia ,Journals ,informative websites

Shrut Kirti Nandan
MBA-IT
IIIT Allahabad

Hadoop

admin — Fri, 14 Nov 2014 06:00:50 +0000

Target readers: All
Keywords: Big Data, Hadoop, Architecture

Introduction of Hadoop:

Apache Hadoop software framework that supports the applications that use intensive amount of data. It makes the applications to work with thousands of machines and with petabytes of data. Hadoop was created by Doug Cutting and Michael J. Cafarella. Hadoop is an activity that is used globally using Java as a programming language.

Architecture of HADOOP:

Hadoop consists of the common area which is known as the Hadoop Common. It is used to provide access to the systems supported by it. As Hadoop used java as programming language so the Hadoop common contains the needed jar files.

Every file system will provide the awareness of the location for the better scheduling. Hadoop can use this application for better scheduling of the work. HDFS (Hadoop Distribution File System) is used in Hadoop for replicating the data in different networks. The main purpose for this replication is to use the data even during the network failure.

Hadoop can be consists of single node known as master node or multiple nodes known as worker nodes. Master node consists of JobTracker, TaskTracker, NameNode, and DataNode and worker node can act both as the DataNode and TaskTracker.

In a large Hadoop each HDFS is assigned with particular primary NameNode to host and secondary NameNode for taking snapshots of NameNode memory structures resulting in dropping the amount of loss of data and file system alteration. Single JobTracker can manage the arrangement of jobs.

File systems In HADOOP:

HDFS:

HDFS is designed to run on hardware. It is similar to other file system but there are some differences which make it highly compatible. It is highly tolerant to any failures and can operate on low cost hardware. HDFS can give the frequent admittance for the data and is appropriate for an application which contains Big Data.

HDFS has potential to backing the data in case of failure of the NameNode. It includes secondary name node which deceives people about the secondary node taking over primary node in case of failure. An advantage of taking snapshots by slave nodes is that in case of primary name node goes offline without rerunning the complete cycle by referring to the snapshot entire directory can be created.

HDFS also creates alertness between JobTracker and Tasktracker. Jobtracker will plan for the jobs to TaskTracker with knowledge to TaskTracker about the data site. It can be helpful for reducing the redundant transfer of data.

Multi Node Hadoop Cluster:

HDFS Architecture:

Block Replication in HDFS:

Block replication is a very special feature of HDFS file system. This can be used to recover the data in case of failure of the NameNode.

Disadvantages of Hadoop:

Generally in Hadoop complex queries with multiple joins cannot be supported. Also normalized documents cannot be used developer has to do the DE normalization. Or we can say Hadoop is best suited for OLTP.
• No uniformity.
• While using Hadoop the access control is insufficient.
• The program becomes more complicated with time.
• Hadoop systems cannot support the relationship features.
• Hadoop cannot perform better in real time scenarios.

Anshruta
MBA-IT
IIIT Allahabad

Hadoop Ecosystem

admin — Fri, 14 Nov 2014 05:53:00 +0000

Target readers: All Big Data Aspirants
Keywords: Hadoop, HDFS, Map or Reduce

We have huge volumes of untapped information in unstructured documents spread across the networks. This data helps us to create new products, refine existing products, discover new trends & moreover helps in understanding our business. Hadoop is a distributed framework which is designed to handle such huge volumes, literally Petabytes of data. It helps in processing large datasets in a scalable and fault tolerant manner.

As we see in the above picture, the core of the Hadoop is HDFS & Map Reduce. HDFS provides distributed storage across the clusters consisting of commodity machines. Map Reduce is the programming model we use to process voluminous amounts of data stored across the clusters. There are many Hadoop projects inside the Hadoop Ecosystem. Each of the projects is designed to solve a specific problem. Most of the projects are in the incubation stage. In this document we will see about the projects which are graduated from the incubator stage and have already become part of Hadoop implementations.

Data Processing: Hive & Pig are the data processing libraries available which help us to process data stored in the HDFS. The main reason why we need these different ways to process data inside Hadoop is, we don’t find many professionals who are a low level Java, Python, or a C/C++ programmer’s that can write Map/Reduce jobs for fetching the data from HDFS and processing it. Moreover some things like filtering data, joining, grouping which we do using SQL language are very difficult to implement using programming languages like Java, also time consuming.

Hive: Hive is an open source data processing project from the Apache Software Foundation. It is a data warehouse built on top of Hadoop. It contains a query language called Hive QL, which is very much similar to SQL. Hive provides a way to project structure on to the large datasets residing in the Hadoop Distributed File System , helps in managing them using HiveQL. Hive converts the queries we write into Map/Reduce jobs and submits them to the HDFS cluster. Hive provides access to the data stored in HDFS or in other databases like Hbase. Hive is designed for OLAP, not for OLTP.

Pig: Pig is a high-level data flow scripting language. Pig was initially developed at Yahoo! The core components of Pig are Pig Latin and Pig compiler. Pig Latin is the programming language, Pig Runtime/Compiler compiles the Pig Latin and converts to Map/Reduce job, submits to the cluster. Pig helps you in analyzing the data residing in HDFS even if you do not have any idea on Map Reduce concepts.
Data Storage:

HBase: Apache HBase is an open source implementation of Google’s Big Table. It is a database which sits on top of the Hadoop Distributed file System: it is a scalable, non-Relational, distributed, column-oriented, multi-dimensional & a highly available database. It is NoSQL database – it means that the underlying structure of this database is not very strict (schema oriented) like in the traditional relational databases, it is very flexible which makes it very scalable. HBase provides an efficient way of storing both structured and semi structured data; it is also capable of storing large amounts of sparse data (data which contains lot of void). HBase is better suited for application areas which need random, real times write/read access to the large volumes of data. You can write Pig and Hive queries against the data residing in HBase tables.

Cassandra: Cassandra database has its roots in Amazon’s Dynamo data store. It was originally developed at Facebook, after Facebook open sourced the code; Cassandra became a top-level Apache project. It is Real time interactive transaction processing on top of Hadoop. It is also a NoSQL database which is designed for providing high availability of large volumes of data spanning across clusters, data centers with no single point of failure. If the application area needs high availability, scalability & require high performance seeking against Hadoop data, Cassandra best suits it.
Note: As most of the features and details of these databases overlap with each other, it is our duty initially to smoke test with each of these technologies and find out which better suits our application area.
Data Serialization (Avro & Thrift): Serialization is a way that we can take data from an application, package it into format that we can either store it on the disk, transfer/exchange to another application, unpack it, then desterilize into a format they understand. Most of the times data is serialized as xml or json or some binary format.

Avro is a generic data serialization exchange framework. Thrift is language neutral serialization framework. Thrift is more specific to creating flexible schemas that work with Hadoop data, it is meant for cross language compatibility. If you built an application that works with Hadoop data in java, and you want to use those same objects inside of an application that is built in Ruby or Python or C++ or JavaScript you can consume it.

Data Intelligence – Mahout: Mahout is a machine learning algorithm library that conquers the three C’s. 1) Collaborative filtering (recommendation), 2) Clustering – way to group related documents, 3) Classification – way to categorize related documents. Mahout is mainly used in the areas where predictive analysis, recommendations need to be made using the previous trends of the data.
Data Extraction Tools:

Sqoop: Sqoop, a top level Apache project, is used for importing data from relational databases such as Oracle, MySQL into the HDFS storage & vice-versa. For instance say we have the result of a map reduce, rather than taking those results and putting them in HDFS, we can send those results to relational world so that data professionals can do their own analysis. Sqoop is useful for pushing bulk loads of data from HDFS to relational databases & also useful for pushing data from relational world into Hadoop for archiving and other purposes. Sqoop can be used to integrate Hadoop with various other relational world databases like Oracle, MySQL, and Teradata etc.

Flume: Apache Flume is an application which is useful for streaming large volumes of data from various web sources on the internet to Hadoop Distributed File System. Flume helps in real time log processing. For instance, the huge amounts of log data that is generated by the web servers can be pushed to HDFS in real time, stored and analyzed thus helping the users to obtain some meaningful information. Flume ensures there is no data loss during the streaming process. It is very reliable.

Orchestration, Management, Monitoring:

Scheduler – Oozie: Oozie is integrated with Hadoop and can be used for scheduling the Hadoop jobs. It is a workflow library that allows us to connect the dots between the essential Hadoop projects like Hive, Pig and Sqoop. For instance, we want to run a Pig script, when that was completed kick off a Hive query, after which we want to start a Sqoop job, Oozie allows us to do this.
Management – Zookeeper: It is a distributed service coordinator. It is a way to keep all the services running in your cluster to keep in sync. It helps us in synchronization of various services by providing a centralized management point. Zookeeper will have the health reports of all the nodes in all the clusters. You can also add nodes to your cluster with the help of Zookeeper.

Administration – Ambari: Ambari helps us to provision a cluster, which means that we can install services like Hive, Pig, Oozie, HBase, and Sqoop across all the nodes in a cluster. Ambari lets us manage all the services in the cluster, like stopping & starting services from one centralized path. Ambari has got a nice Web UI –dashboard, helps us to monitor the health of Hadoop clusters.

Ankrish
MBA-IT
IIIT Allahabad

Map Reduce Framework

admin — Fri, 14 Nov 2014 05:33:33 +0000

Target readers: All, Developers, Programmers
Keywords: Hadoop, HDFS, Map Reduce

Introduction

Hadoop
In this era we have huge amount of data to manage. As per the IDC estimate, the total size of our “digital Universe” is around 2 zetta bytes(1 zetta byte = 270 bytes). The enormous data which is present subsumes the data generated by machines and people. Machine Logs, Vehicle GPS, and RFID readers etc., are contributing to this huge mountain of data. Storing and analyzing this vast data is a major concern. The speed to access data from the hard drives has failed to keep up with the drastic increase in the storage capacity of hard drives. For instance, if 1 TB (240bytes) of data is read with a speed of 100 MB/sec, it would take roughly 3 hours to complete the process. The solution to the above problem is Hadoop. Hadoop provides a reliable shared storage along with analysis of the performance. The storage is done by HDFS and analysis is done by Map Reduce.

Map Reduce
Map Reduce is a programming framework which abstracts the problem from disks read and write and transforms it into a computation over multiple sets of keys and values. Map Reduce is a batch query processor and is capable of getting result for a given dataset, by running an ad hoc query, in a reasonable amount of time.

RDBMS vs. Map Reduce
1. Traditional RDBMS is capable of handling data in terms of Gigabytes, whereas Map Reduce can handle data of size up to Peta bytes (250 bytes).
2. Traditional databases works well for structured data (data which is in defined format. Ex-XML documents). As far as Map Reduce is concerned, it works well on unstructured data also (for data like plain text or image data).
3. Map Reduce is linearly scalable. There are two functions Map and Reduce for defining mapping between one set of key value pairs to other. These functions are not dependent on the size of the data. This is not true for Relational databases.
4. Traditional databases are normalized to retain integrity as well as reduce redundancy, but we cannot apply normalization to Map Reduce, since data is read as a non-local operation.

Grid Computing vs. Map Reduce
Grid Computing works well for compute-intensive jobs. However this doesn’t works in a desired manner when nodes have to access larger amount of data. The network bandwidth becomes the bottleneck. Map Reduce tries to collate the data with compute nodes. This leads to faster data access. Map Reduce gives good performance because of the concept of data locality. MapReduce models network topology in such a fashion that it helps in conserving network bandwidth. In large scale distributed computing, coordinating a process is a challenge. Problems like remote process failure, partial failures have to be handled properly. With Map Reduce, as a programmer, one doesn’t have to worry about such failures, since the implementation automatically detects failed map or reduce tasks and reschedule replacements. This is possible since Map Reduce is a Shared-nothing architecture, which means that tasks do not have any dependence on each other.

Power of Hadoop
There is a wide range of algorithms expressed in Map Reduce. Problems like graph-based problems, image processing etc. can be solved using Hadoop. Using Hadoop, team at Yahoo was able to sort 1 terabyte (240 bytes) in 62 seconds. Hadoop is mostly known for Map Reduce and distributed file system, but it is also used in lot other related projects that fall under the category of large scale data processing and distributed computing.

MapReduce
Map Reduce is a programming framework for processing of data. Hadoop can run Map Reduce programs written in different languages like Java, Ruby, Python, and C++ etc. Hadoop also provides parallel processing which makes large scale data analysis very simple. This advantage can be utilized by writing our query as a Map Reduce job. Map Reduce breaks the process in two phases:
1. Map Phase
2. Reduce Phase

In each phase both input and output will be represented by a Key-value pair. The programmer has to specify:
1. The type of pair.
2. Two functions :
a) Map Function
b) Reduce Function

Working of MapReduce Model
Map Function reads each line of raw input and pulls out the relevant data as key-value pair. The Map Function sends this to the Reducer function, which does the processing of each pair accordingly. For example, if certain university wants to find out the marks of topper (among all streams and batches) for each year, from 1900 to 2011, where the data is saved in different files and directories, in format:
Year#Name#RollNumber#Stream#Marks#Batch#Country#PhoneNumber#Address#VehicleNumber#…..

To visualize the working of MapReduce, let’s consider the following data,

1950#Ram#12345#Science#78#mar1950batch#India#NA#NA#NA
1950#SRam#123456#Science#75#mar1950batch#India#NA#NA#NA
1950#DRam#123457#Science#98#mar1950batch#India#NA#NA#NA
1950#VRam#123458#Science#68#mar1950batch#India#NA#NA#NA
….
1951#KRam#123459#Science#78#mar1951batch#India#NA#NA#NA
1951#TRam#123451#Science#68#mar1951batch#India#NA#NA#NA
1951#ERam#123452#commerce#99#mar1951cbatch#India#NA#NA#NA
1951#QRam#123453#Arts#12#mar1951abatch#India#NA#NA#NA
…..
Let’s assume we have 75 lakhs of such data records with us.
The input to the map function will be the complete set of this raw data. The Map function will extract the year and marks from each data record, like,
(1950, 78)
(1950, 75)
(1950, 98)
(1950, 68)
…
(1951, 78)
(1951, 68)
(1951, 99)
(1951, 12)
…

Before sending this data to the reduce function, the output of the Map function is processed by MapReduce Framework. In this example, the framework will process each pair and group it, like,
(1950, [78, 75, 98, 68,])
(1951, [78, 68, 99, 12 …])
….

Now when this grouped data is given to the ‘Reduce function’, the only work of the function is to find out the maximum number out of each group. So the final output from Reduce Function looks like,
(1950, 98)
(1951, 99)
…..

Dataflow
Hadoop executes the jobs by dividing it into two tasks i.e. Map tasks and reduce tasks.
The job execution is controlled by two types of nodes:
1. Multiple task trackers
2. Job tracker

Job tracker coordinates the jobs running on the system by scheduling tasks (dividing the jobs into tasks) to run on task trackers. Task trackers runs the task assigned and sends the progress report to the job tracker. Hadoop divides the input into fixed size pieces named splits. One map task is created for each split. This map task runs the user-defined function for each record in the split. Once the Map task has executed successfully, it writes its corresponding output to local disks. Map output is actually an intermediate output which is given to Reduce tasks to produce the final output.

HDFS
HDFS is the acronym for Hadoop distributed file system .It is a file system for storing very large files. A HDFS cluster consists of Name Node which manages the file system’s metadata and data Nodes are used to store the important data. Hadoop and HDFS are best suited for distributed storage and distributed processing. It is scalable and fault tolerable. HDFS is easily configurable with a default configuration which is used for many installations. Generally, configuration needs to be tuned for extremely large clusters.

Some of the salient features of HDFS are:-

o Rack awareness: The node’s physical allocation list is taken in account when scheduling tasks as well as allocating storage.
o Safe mode: It is an administrative mode used for maintenance.
o Upgrade and rollback: It is possible to roll back to HDFS’s earlier state before the upgrade in case of unwanted problems.
o Backup node: This is an extension to Checkpoint node. Along with check pointing, it also receives edits from Name Node and maintains its own in-memory copy of namespace, which remains always in sync with active Name Node namespace state. One Backup node may be registered with Name Node at the same time.

Abhishek Malik
MBA-IT
IIIT Allahabad