A Gateway To Knowledge

Posts

What is Data Analytics?

September 02, 2018

Data Analytics is the process of examining data sets in order to draw conclusions about the information they contain, increasingly with the aid of specialized systems and software. Data analytics technologies and techniques are widely used in commercial industries to make more informed business decisions and by scientists and researchers to verify or disprove scientific models, theories and hypotheses. Data Analytics refers to an assortment of applications from basic business intelligence (BI), reporting and online analytical processing (OLAP) to various forms of advanced analytics. In that sense, it is similar to business analytics, another umbrella term for approaches to analyzing data - with the difference that the latter is oriented to business uses, while data analytics has a broader focus. Advantages of Data Analytics The advantages of Data Analytics are in boosting business performance by:- Increasing Business Revenues Improve Operational Efficiency Optimize Marketing

Hadoop Architecture version 1.x

May 26, 2018

Hadoop Architecture is split into 3 categories:- Edge Node Name Node and Job Tracker Node Data Nodes and Task Tracker Nodes Client/Edge Node: It is the interface between Hadoop cluster and outside network.For this reason, they are sometimes referred as gateway nodes. Client applications talk to the Name Node whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The Name Node responds the successful requests by returning a list of relevant Data Node servers where the data lives. Usage They are often used as staging areas for data being transferred into the Hadoop cluster. Most commonly, edge nodes are used to run client applications and cluster administration tools. Tools like Oozie, Pig, Sqoop, and management tools such as Hue and Ambari run well there. Client Node directly controls Data Node based on information. Master/Slave: Here, we have a concept of Master and Slave where Name No

Distributed Computing

May 26, 2018

What is meant by Distributed Computing? Distributed Computing refers to the mechanism that deals with both storage as well as processing of data in distributed environment called cluster. In Hadoop, Distributed Computing is done in following manner: Storage of Data - HDFS / S3/ Azure Processing of Data - MapReduce / Spark / Tez

Categories of Data in Big Data Technology

May 26, 2018

In Big Data, we have 3 different types of data to be dealt with. Structured Data: Any data which has definite strict schema associated with it is called Structured Data. For example: data in tabular form like CSV file or relational table etc. Semi-Structured Data: Any data which does not have strict schema associated with it but we can make some sense out of it is called Semi-Structured Data. For example: XML and JSON files. Unstructured Data: Any data which does not have any schema associated with it is called Unstructured data. For example: Tweet, Audio, Video etc. Tools based on their handling of different categories of data:- Structured Data: Hive, Pig and Sqoop Semi-Structured Data: Hive, Pig UnStructured Data: Pig

Hadoop Components

May 26, 2018

Components of Hadoop File System HDFS (Primary Data storage system) Amazon S3 Azure Execution Engine MapReduce - launched since Hadoop version 1 Apache Spark - launched since Hadoop version 2 (Available as part of Hortonworks and Cloudera distributions) Apache Tez - launched since Hadoop version 2 (Available as part of Hortonworks distribution but not in Cloudera) Ecosystem Tools Hive - Datawarehousing tool on HDFS (used primarily for Structured Data and to some extent for Semi-Structured Data) Pig - ETL tool where source is other system and target is HDFS (used for Structured, Semi-Structured and Unstructured Data) Sqoop - Tool for Importing/Exporting data with relational database like Oracle (used for Structured Data) Oozie - Tool for monitoring Workflows and Scheduling Jobs Zookeeper - Configuration Manager

How to calculate Timestamps in Oracle Database

April 27, 2011

As we all know, current time is obtained using current_timestamp keyword in Oracle. Most of the time we have observed that whenever we try to calculate using current_timestamp by subtracting or adding any digit to it e.g. current_timestamp + 20, if we're lucky it will work else in most cases it will throw us "SQL Error: ORA-00911: invalid character" error. The reason for this is that we are trying to calculate timestamp but without using its appropriate format. I have mentioned the ways below according to which timestamps should be calculated. To obtain the current time we use following command:- SQL> select current_timestamp from dual; CURRENT_TIMESTAMP -------------------------------------- 27-APR-11 03.08.29.433503000 PM EUROPE/LONDON 1 row selected. To obtain current time to be adjusted in terms of no .of days we use following command:- 1. Retrieve no. of days to be adjusted SQL> select NUMTODSINTERVAL(-1, 'DAY') FROM dual; NUMTODSINTERVAL(-1,'D

Search This Blog

A Gateway To Knowledge

Posts

Data Analytics Process

What is Data Analytics?

Hadoop Architecture version 1.x

Distributed Computing

Categories of Data in Big Data Technology

Hadoop Components

How to calculate Timestamps in Oracle Database