Hadoop Architecture version 1.x

Hadoop Architecture is split into 3 categories:-

Edge Node
Name Node and Job Tracker Node
Data Nodes and Task Tracker Nodes

Client/Edge Node:
It is the interface between Hadoop cluster and outside network.For this reason, they are sometimes referred as gateway nodes. Client applications talk to the Name Node whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The Name Node responds the successful requests by returning a list of relevant Data Node servers where the data lives.

Usage

They are often used as staging areas for data being transferred into the Hadoop cluster.
Most commonly, edge nodes are used to run client applications and cluster administration tools. Tools like Oozie, Pig, Sqoop, and management tools such as Hue and Ambari run well there.
Client Node directly controls Data Node based on information.

Master/Slave:

Here, we have a concept of Master and Slave where Name Node and Job Tracker are Master nodes. The Data Nodes are slaves of Name Node and Task Tracker Nodes are slaves of Job Tracker node.

HDFS:

In Hadoop, the default file system is Hadoop Distributed File System (HDFS). A file system is a system that manages storage and retrieval of files on top of physical storage.

Block Division in HDFS

HDFS consists of blocks where each block is of 128 MB in size. In general, each block in UNIX OS is of 4 KB in size. The size is so big in Hadoop to store huge data in terms of Terabytes or Petabytes in HDFS in feasible number of blocks otherwise read/write operation on a small block would be extremely time-consuming process.

HDFS Management

HDFS includes set of Java processes that manage and orchestrate Hadoop's distributed file system. HDFS stores data in sequential manner and processing is done in parallel.

Name Node:

Name Node is the heart and master of Hadoop. It maintains the namespace system of Hadoop. Name Node is used to access the storage area in Hadoop using its file system. There is no HDFS in Name Node.

Block Management

Name Node assigns alphanumeric block-id to each block in Hadoop's filesystem. The job of Name Node is to decide where to reside block on which Data Node. Name Node is only responsible for taking file, understand how many data nodes are available, determine how many splits or blocks will be created. Also, it passes this information to Client Node. Name Node is concerned with metadata of cluster and files in cluster. Name Node keeps track of health of blocks whether they are corrupted or not.

Name Node Unavailability

The Name Node is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the Name Node goes down, the file system goes offline. In Hadoop version 1, there is an optional Secondary Name Node that can be hosted on a separate machine which acts as a backup of Name Node for handling Name Node failure.

Secondary Name Node Periodic Schedule

Secondary Name Node takes periodically copy of file system (FS) image of Primary Name Node in itself somewhere around every minute or 1/2 and hour as per the set configuration.

Name Node Recovery

If Primary Name Node goes down, system administrator brings a new primary name node, copies the File System (FS) image from Secondary Name Node. There is small downtime required by System Administrator to start new Primary Name Node. After that, Secondary Name Node starts replicating file system of new Primary Name Node.

Data Node:

Data Nodes are the node which are actually used for storage of data. They are mainly concerned with storage of data in blocks. Client/Edge Node submits a request and is concerned with files in Data Node.

Heartbeat:

Data Node sends periodic heartbeat signals to Name Node to let it know of availability of node, space of node and other relevant metrics. If any Data Node goes down, other Data Node is asked to create another copy from the valid Data Nodes for lost blocks.

hdfs-site.xml

This file has interval of Heartbeat signals.

Block Level Report

Every 10th Heartbeat signal is a bigger signal. It contains information about blocks stored by Data Node and Data Node information. This signal is called Block Level Report (BLR).

Heartbeat Loss

If Name Node does not receive Heartbeat signal from Data Node in 3 minutes, then Name Node will assume Data Node is down and it will re-balance the cluster. However, if after 5 or 6 seconds the heartbeat signal is received from the same Data Node then it will take this Data Node and re-balance the cluster and will ask this Data Node to take replicated blocks to this node from other data nodes.

Rack Awareness:

What is a Rack?

In a Data Center, a Rack is a type of physical steel and electronic framework that is designed to house servers, networking devices, cables and other data center computing equipment. Each node in Rack has some power cable and data cable in Data Center. Hadoop uses the concept of Rack and decides to place replicas in different Racks as much as possible.

Rack Awareness Concept

Name Node creates another copy of each block in different Data Node(s) based on the configuration set for number of replicas. The replicas can be either in the Data Node of same Rack in same cluster or Data Node in different Rack in same cluster or in some other Data Node of different Cluster. This is decided based on the availability of resources in the nearest node.

Job Tracker:
Job Tracker keeps track of all jobs and manage lifecycle of all tasks. The Job Tracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.

Job Tracker Request-Response

1. Client applications submit jobs to the Job tracker.
2. The Job Tracker talks to the Name Node to determine the location of the data.
3. The Job Tracker locates Task Tracker nodes with available slots at or near the data.
4. The Job Tracker submits the work to the chosen Task Tracker nodes.
5. When the work is completed, the Job Tracker updates its status.
6. Client applications can poll the Job Tracker for information.

Task Tracker Failure

The Task Tracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different Task Tracker. A Task Tracker will notify the Job Tracker when a task fails. The Job Tracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the Task Tracker as unreliable.

Search This Blog

A Gateway To Knowledge