Posts

Data Analytics Process

Data Analytics Team includes Data Scientist, Data Analysts and Data Engineers who are involved in one or more Steps of the whole Data Analytics process. The Data Analytics Process is divided into following Steps:- Collection of Data Integration of Data Preparation of Data Analytical Model Development Analytical Model Testing Analytical Model Revision Data Reporting / Visualization Data Collection Data Scientists identify the information they need for a particular analytics application and then work on their own or with data engineers and IT staffers to assemble it for use. Data Integration Data from different source systems may need to be combined via data integration routines, transformed into a common format and loaded into an analytics system such as Hadoop cluster, NoSQL database or data warehouse. Data Preparation Once the data that is needed is in place, the next step is to find and fix data quality problems that could affect the accuracy of ana

What is Data Analytics?

Data Analytics is the process of examining data sets in order to draw conclusions about the information they contain, increasingly with the aid of specialized systems and software. Data analytics technologies and techniques are widely used in commercial industries to make more informed business decisions and by scientists and researchers to verify or disprove scientific models, theories and hypotheses. Data Analytics refers to an assortment of applications from basic business intelligence (BI), reporting and online analytical processing (OLAP) to various forms of advanced analytics. In that sense, it is similar to business analytics, another umbrella term for approaches to analyzing data - with the difference that the latter is oriented to business uses, while data analytics has a broader focus. Advantages of Data Analytics The advantages of Data Analytics are in boosting business performance by:- Increasing Business Revenues Improve Operational Efficiency Optimize Marketing

Hadoop Architecture version 1.x

Image
Hadoop Architecture is split into 3 categories:- Edge Node Name Node and Job Tracker Node Data Nodes and Task Tracker Nodes Client/Edge Node:   It is the interface between Hadoop cluster and outside network.For this reason, they are sometimes referred as gateway nodes. Client applications talk to the Name Node whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The Name Node responds the successful requests by returning a list of relevant Data Node servers where the data lives. Usage They are often used as staging areas for data being transferred into the Hadoop cluster.  Most commonly, edge nodes are used to run client applications and cluster administration tools. Tools like Oozie, Pig, Sqoop, and management tools such as Hue and Ambari run well there.  Client Node directly controls Data Node based on information. Master/Slave:   Here, we have a concept of Master and Slave where Name No

Distributed Computing

What is meant by Distributed Computing? Distributed Computing refers to the mechanism that deals with both storage as well as processing of data in distributed environment called cluster. In Hadoop, Distributed Computing is done in following manner: Storage of Data - HDFS / S3/ Azure Processing of Data - MapReduce / Spark / Tez

Categories of Data in Big Data Technology

In Big Data, we have 3 different types of data to be dealt with. Structured Data:  Any data which has definite strict schema associated with it is called Structured Data. For example: data in tabular form like CSV file or relational table etc. Semi-Structured Data:  Any data which does not have strict schema associated with it but we can make some sense out of it is called Semi-Structured Data. For example: XML and JSON files. Unstructured Data: Any data which does not have any schema associated with it is called Unstructured data. For example: Tweet, Audio, Video etc. Tools based on their handling of different categories of data:- Structured Data: Hive, Pig and Sqoop Semi-Structured Data: Hive, Pig UnStructured Data: Pig

Hadoop Components

Components of Hadoop File System HDFS (Primary Data storage system) Amazon S3 Azure Execution Engine MapReduce - launched since Hadoop version 1 Apache Spark - launched since Hadoop version 2 (Available as part of Hortonworks and Cloudera distributions) Apache Tez - launched since Hadoop version 2 (Available as part of Hortonworks distribution but not in Cloudera) Ecosystem Tools Hive - Datawarehousing tool on HDFS (used primarily for Structured Data and to some extent for Semi-Structured Data) Pig - ETL tool where source is other system and target is HDFS (used for Structured, Semi-Structured and Unstructured Data) Sqoop - Tool for Importing/Exporting data with relational database like Oracle (used for Structured Data) Oozie - Tool for monitoring Workflows and Scheduling Jobs Zookeeper - Configuration Manager

How to calculate Timestamps in Oracle Database

As we all know, current time is obtained using current_timestamp keyword in Oracle. Most of the time we have observed that whenever we try to calculate using current_timestamp by subtracting or adding any digit to it e.g. current_timestamp + 20, if we're lucky it will work else in most cases it will throw us "SQL Error: ORA-00911: invalid character" error. The reason for this is that we are trying to calculate timestamp but without using its appropriate format. I have mentioned the ways below according to which timestamps should be calculated. To obtain the current time we use following command:- SQL>  select current_timestamp from dual; CURRENT_TIMESTAMP -------------------------------------- 27-APR-11 03.08.29.433503000 PM EUROPE/LONDON 1 row selected. To obtain current time to be adjusted in terms of no .of days we use following command:- 1. Retrieve no. of days to be adjusted SQL>   select NUMTODSINTERVAL(-1, 'DAY') FROM dual; NUMTODSINTERVAL(-1,'D