Posts

Showing posts from May, 2018

Hadoop Architecture version 1.x

Image
Hadoop Architecture is split into 3 categories:- Edge Node Name Node and Job Tracker Node Data Nodes and Task Tracker Nodes Client/Edge Node:   It is the interface between Hadoop cluster and outside network.For this reason, they are sometimes referred as gateway nodes. Client applications talk to the Name Node whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The Name Node responds the successful requests by returning a list of relevant Data Node servers where the data lives. Usage They are often used as staging areas for data being transferred into the Hadoop cluster.  Most commonly, edge nodes are used to run client applications and cluster administration tools. Tools like Oozie, Pig, Sqoop, and management tools such as Hue and Ambari run well there.  Client Node directly controls Data Node based on information. Master/Slave:   Here, we have a concept of Master...

Distributed Computing

What is meant by Distributed Computing? Distributed Computing refers to the mechanism that deals with both storage as well as processing of data in distributed environment called cluster. In Hadoop, Distributed Computing is done in following manner: Storage of Data - HDFS / S3/ Azure Processing of Data - MapReduce / Spark / Tez

Categories of Data in Big Data Technology

In Big Data, we have 3 different types of data to be dealt with. Structured Data:  Any data which has definite strict schema associated with it is called Structured Data. For example: data in tabular form like CSV file or relational table etc. Semi-Structured Data:  Any data which does not have strict schema associated with it but we can make some sense out of it is called Semi-Structured Data. For example: XML and JSON files. Unstructured Data: Any data which does not have any schema associated with it is called Unstructured data. For example: Tweet, Audio, Video etc. Tools based on their handling of different categories of data:- Structured Data: Hive, Pig and Sqoop Semi-Structured Data: Hive, Pig UnStructured Data: Pig

Hadoop Components

Components of Hadoop File System HDFS (Primary Data storage system) Amazon S3 Azure Execution Engine MapReduce - launched since Hadoop version 1 Apache Spark - launched since Hadoop version 2 (Available as part of Hortonworks and Cloudera distributions) Apache Tez - launched since Hadoop version 2 (Available as part of Hortonworks distribution but not in Cloudera) Ecosystem Tools Hive - Datawarehousing tool on HDFS (used primarily for Structured Data and to some extent for Semi-Structured Data) Pig - ETL tool where source is other system and target is HDFS (used for Structured, Semi-Structured and Unstructured Data) Sqoop - Tool for Importing/Exporting data with relational database like Oracle (used for Structured Data) Oozie - Tool for monitoring Workflows and Scheduling Jobs Zookeeper - Configuration Manager