Unit 1 - DA - Introduction To Data Science

Data Analytics (IT-3006)
Kalinga Institute of Industrial Technology

Deemed to be University
Bhubaneswar-751024
School of Computer Engineering
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
3 Credit Lecture Note

Importance of the Course
2
 The data analytics is indeed a revolution in the field of information technology.

 The use of data analytics by the companies is enhancing every year and the primary
focus of the companies is on customers.
 Many organizations are actively looking out for the right talent to analyze vast amounts
of data.
 Following four perspectives leads to importance of data analytics.
Business Data
Science
Data
Analytics
Real-time
Job Market Usability

Why Learn Data Analytics?
3
 A priority for top organizations.

 Gain problem solving skills.
 High demand
 Increasing job opportunities.
 Increasing pay.
 Various job titles from which to choose (Metrics and Analytics
Specialist, Data Analyst, Big Data Engineer, Data Analytics Consultant)
 Analytics is everywhere.
 It's only becoming more important.
 It represents perfect freelancing opportunities.
 Develop new revenue streams.
Course Contents
4
Sr # Major and Detailed Coverage Area Hrs

1 Introduction to Data Science 9
Introduction to Data, Data Science, Introduction to Big Data (Characteristics), Challenges of
Traditional Systems, Evolution of Analytic Scalability, Types of Computing (Distributed, Parallel,
Grid), Data Analytics Lifecycle, Hadoop (Hadoop Ecosystem, MapReduce)
2 Statistical Concepts 10
Data Exploration: Distribution of a single variable, Basic Concepts (populations and samples,
data sets, variables, and observations, types of data), Descriptive measures for categorical
variables, descriptive measures for numerical variables, Outliers and Missing values,
Finding relationships among variables: Categorical variables, Numerical variables,
Categorical variables and a Numerical variable.
Sampling and distributions: Terminology, Estimation, Confidence Interval estimation,
Sampling distributions, Confidence interval, Hypothesis testing, Chi-square test for
independence.
3 Data Analytics 9
Introduction, Types of Data Analytics, Importance of Data Analytics, Data Analytics Applications,
Regression Modelling Techniques: Linear Regression, Multiple Linear Regression, Non Linear
Regression, Logistic Regression, Time Series Analysis, Performance analysis (RMSE, MAPE).

Course Contents continue…
5
Sr # Major and Detailed Coverage Area Hrs

4 Frequent Itemsets and Association
Introduction to Frequent Itemsets, Market-Basket Model, Algorithm for Finding 6
Frequent, Itemsets, Association Rule Mining, Apriori Algorithm and Correlations.
5 Classification & Clustering 7
Introduction to classification and clustering, Distance-Based Algorithms: K Nearest
Neighbour (KNN), Decision Tree-Based Algorithms: Decision Tree (ID3 Algorithm),
Support Vector Machines (Linear), Naves Bayes. Overview of Clustering Techniques,
Hierarchical Clustering, Partitioning Methods, K- Means Algorithm.
6 Data Streams 7
Introduction to Mining Data Streams, Data Stream Management Systems, Data
Stream Mining, Examples of Data Stream Applications, Stream Queries, Issues in
Data Stream Query, Processing, Sampling in Data Streams, Filtering Streams,
Counting Distinct Elements in a Stream, Estimating Moments.

Course Outcome
6
CO # CO Unit #
CO1 Make use of data science concepts to handle the big data. 1
CO2 Examine the statistical concepts for finding relationships among 2
variables and estimate the data samplings.
CO3 Select the data analytics techniques & models for both data 3
prediction and performance analysis.
CO4 Develop rules using frequent itemsets and association mining. 4
CO5 Solve real-time problems using classification and clustering 5
techniques
CO6 Apply the mining techniques for data streams. 6
Prerequisites
 NIL

Books
7
Textbook
 Data Analytics, Radha Shankarmani, M. Vijayalaxmi, Wiley India Private Limited,
ISBN: 9788126560639.
Reference Books
 Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and
Presenting Data by EMC Education Services (Editor), Wiley, 2014
 Bill Franks, Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data
Streams with advanced analystics, John Wiley & sons, 2012.
 Glenn J. Myatt, Making Sense of Data, John Wiley & Sons, 2007 Pete Warden, Big
 Data Glossary,O’Reilly, 2011.
 Jiawei Han, MichelineKamber “Data Mining Concepts and Techniques”, Second
Edition, Elsevier, Reprinted 2008.
 Stephan Kudyba, Thomas H. Davenport, Big Data, Mining, and Analytics, Components
of Strategic Decision Making, CRC Press, Taylor & Francis Group. 2014
 Big Data, Black Book, DT Editorial Services, Dreamtech Press, 2015
Evaluation
8
Grading:
 Internal assessment – 30 marks
 2 quizzes = 2.5 X 2 = 5 marks
 2 group assignments = 2 X 2.5 = 5 marks
 Class participation = 4 X 2.5 = 10 marks
 Minor Project = 10 marks
 Mid-Term exam - 20 marks
 End-Term exam - 50 marks
?
Data
9
 A representation of information, knowledge, facts, concepts or instructions

which are being prepared or have been prepared in a formalized manner.
 Data is either intended to be processed, is being processed, or has been
processed.
 It can be in any form stored internally in a computer system or computer
network or in a person’s mind.
 Since the mid-1900s, people have used the word data to mean computer
information that is transmitted or stored.
 Data is the plural of datum (a Latin word meaning something given), a single
piece of information. In practice, however, people use data as both the
singular and plural form of the word.
 It must be interpreted, by a human or machine to derive meaning.
 It is presents in homogeneous sources as well as heterogeneous sources.
 The need of the hour is to understand, manage, process, and take the data
for analysis to draw valuable insights.
Data  Information  Knowledge  Actionable Insights
Importance of Data
10
 The ability to analyze and act on data is increasingly important to

businesses. It might be part of a study helping to cure a disease, boost a
company’s revenue, understand and interpret market trends, study
customer behavior and take financial decisions.
 The pace of change requires companies to be able to react quickly to
changing demands from customers and environmental conditions. Although
prompt action may be required, decisions are increasingly complex as
companies compete in a global marketplace.
 Managers may need to understand high volumes of data before they can
make the necessary decisions.
 Relevant data creates strong strategies - Opinions can turn into great
hypotheses, and those hypotheses are just the first step in creating a strong
strategy. It can look something like this: “Based on X, I believe Y, which will
result in Z”.
 Relevant data strengthens internal teams.
 Relevant data quantifies the purpose of the work.
Characteristics of Data
11
Deals with the structure of the

data i.e. source, the granularity
(the details), the type, nature
Composition
whether static or real-time
streaming
Deals with the state of the data
i.e. usability for analysis, does it Condition Data
require cleaning for further
enhancement and enrichment?
Deals with “where it has been Context
generated”, “ why was this
generated”, “how sensitive is
this”, “what are the associated
events” and so on.

Human vs. Machine Readable data
12
 Human-readable refers to information that only humans can interpret and study,
such as an image or the meaning of a block of text. If it requires a person to
interpret it, that information is human-readable.
 Machine-readable refers to information that computer programs can process. A
program is a set of instructions for manipulating data. Such data can be
automatically read and processed by a computer, such as CSV, JSON, XML, etc.
Non-digital material (for example printed or hand-written documents) is by its non-
digital nature not machine-readable. But even digital material need not be machine-
readable. For example, a PDF document containing tables of data. These are
definitely digital but are not machine-readable because a computer would struggle
to access the tabular information - even though they are very human readable. The
equivalent tables in a format such as a spreadsheet would be machine readable.
Another example scans (photographs) of text are not machine-readable (but are
human readable!) but the equivalent text in a format such as a simple ASCII text file
can machine readable and processable.

Classification of Digital Data
13
Digital data is classified into the following categories:

 Structured data
 Semi-structured data
 Unstructured data
Approximate percentage distribution of digital data

Structured Data
14
 It is defined as the data that has a well-defined repeating pattern and this
pattern makes it easier for any program to sort, read, and process the data.
 This is data is in an organized form (e.g., in rows and columns) and can be easily
used by a computer program.
 Relationships exist between entities of data.
 Structured data:
 Organize data in a pre-defined format.
 Is stored in a tabular form.
 Is the data that resides in a fixed fields within a record of file.
 Is formatted data that has entities and their attributes mapped.
 Is used to query and report against predetermined data types.
 Sources:
Relational Multidimensional
database databases
Structured data
Legacy
Flat files
databases
Ease with Structured Data
15
Insert/Update/ DML operations provide the required ease with data

input, storage, access, process , analysis etc.
Delete
Encryption and tokenization solution to warrant the

security of information throughout life cycle.
Security Organization able to retain control and maintain
compliance adherence by ensuring that only authorized
are able to decrypt and view sensitive information.
Indexing speed up the data retrieval operation at the

Structured data Indexing cost of additional writes and storage space, but the
benefits that ensure in search operation are worth the
additional writes and storage spaces.
The storage and processing capabilities of the traditional

Scalability DBMS can be easily be scaled up by increasing the
horsepower of the database server.
Transaction RDBMS has support of ACID properties of transaction

to ensure accuracy, completeness and data integrity.
Processing

Semi-structured Data
16
 Semi-structured data, also known as having a schema-less or self-describing

structure, refers to a form which does not conform to a data model as in
relational database but has some structure.
 In other words, data is stored inconsistently in rows and columns of a database.
 However, it is not in a form which can be used easily by a computer program.
 Example, emails, XML, markup languages like HTML, etc. Metadata for this data
is available but is not sufficient.
 Sources:
Web data in
the form of XML
cookies Semi-structured
data
Other Markup JSON
languages

XML, JSON, BSON format
17
Source (XML & JSON): http://sqllearnergroups.blogspot.com/2014/03/how-to-get-json-format-through-sql.html

Source (JSON & BSON): http://www.expert-php.fr/mongodb-bson/

Characteristics of Semi-structured Data
18
Inconsistent Structure
Self-describing
(level/value pair)
Other schema
Semi-structured information is
data blended with data
values
Data objects may

have different
attributes not known
beforehand

Unstructured Data
19
 Unstructured data is a set of data that might or might not have any logical or
repeating patterns and is not recognized in a pre-defined manner.
 About 80 percent of enterprise data consists of unstructured content.
 Unstructured data:
 Typically consists of metadata i.e. additional information related to data.
 Comprises of inconsistent data such as data obtained from files, social
media websites, satellites etc
 Consists of data in different formats such as e-mails, text, audio, video, or
images.
 Sources: Body of email
Chats, Text
Text both
messages
internal and
external to org.
Mobile data
Unstructured data
Social Media Images, audios,
data videos
Challenges associated with Unstructured data
20
Working with unstructured data poses certain challenges, which are as follows:
 Identifying the unstructured data that can be processed.
 Sorting, organizing, and arranging unstructured data indifferent sets and
formats.
 Combining and linking unstructured data in a more structured format to derive
any logical conclusions out of the available information.
 Costing in terms of storage space and human resources need to deal with the
exponential growth of unstructured data.
Data Analysis of Unstructured Data
The complexity of unstructured data lies within the language that created it. Human
language is quite different from the language used by machines, which prefer
structured information. Unstructured data analysis is referred to the process of
analyzing data objects that doesn’t follow a predefined data model and/or is
unorganized. It is the analysis of any data that is stored over time within an
organizational data repository without any intent for its orchestration, pattern or
categorization.

Dealing with Unstructured data
21
Data Mining (DM)
Natural Language Processing (NLP)

Dealing with
Unstructured data Text Analytics (TA)
Noisy Text Analytics
Note: Refer to Appendix for further details.

Data Science
22
Data Science is collecting, analyzing and interpreting data to gather insights into
the data that can help decision-makers to make informed decisions.
What is Data Science used for?
 Descriptive analysis (what has been happened)
 Diagnostic analysis (why it has happened)
 Predictive analysis (what will happen)
 Prescriptive analysis (what to do for betterment of future)
What is the Data Science process?
1. Obtaining the data (i.e., data identification for analysis)
2. Scrubbing the data (i.e., ensuring readable state)
3. Exploratory analysis (excellent attention to detail)
4. Modeling (algorithm to follow based on the data for analysis)
5. Interpreting the data (uncover findings and present to the
organisation)
Data Science programming language

 R
 Python

Data Science Life Cycle
23

Definition of Big Data
24
Big Data is high-volume, high-velocity,

High-volume
High-velocity
and high-variety information assets that
High-variety demand cost effective, innovative forms
of information processing for enhanced
insight and decision making.
Source: Gartner IT Glossary
Cost-effective, innovative forms
of information processing
Enhanced insight &

decision making

What is Big Data?
25
Think of following:
 Every second, there are around 822 tweets on Twitter.

 Every minutes, nearly 510 comments are posted, 293 K statuses are updated,
and 136K photos are uploaded in Facebook.
 Every hour, Walmart, a global discount departmental store chain, handles more
than 1 million customer transactions.
 Everyday, consumers make around 11.5 million payments by using PayPal.
In the digital world, data is increasing rapidly because of the ever increasing use of
the internet, sensors, and heavy machines at a very high rate. The sheer volume,
variety, velocity, and veracity of such data is signified the term ‘Big Data’.
Semi- Big
Structured Unstructured
structured Data
Data Data
Data

Characteristics of Big Data
26
In most big data circles, these are called the four V’s: volume, variety, velocity, and veracity.
(One might consider a fifth V, value.)
Volume - refers to the incredible amounts of data generated each second from social media,
cell phones, cars, credit cards, M2M sensors, photographs, video, etc. The vast amounts of
data have become so large in fact it can no longer store and perform data analysis using
traditional database technology. So using distributed systems, where parts of the data is
stored in different locations and brought together by software.
Variety - defined as the different types of data the digital system now use. Data today looks
very different than data from the past. New and innovative big data technology is now
allowing structured and unstructured data to be harvested, stored, and used
simultaneously.
Velocity - refers to the speed at which vast amounts of data are being generated, collected
and analyzed. Every second of every day data is increasing. Not only must it be analyzed,
but the speed of transmission, and access to the data must also remain instantaneous to
allow for real-time access. Big data technology allows to analyze the data while it is being
generated, without ever putting it into databases.
Veracity - is the quality or trustworthiness of the data. Just how accurate is all this data?
For example, think about all the Twitter posts with hash tags, abbreviations, typos, etc., and
the reliability and accuracy of all that content.
Characteristics of Big Data cont’d
27
Value - refers to the ability to transform a tsunami of data into business. Having endless
amounts of data is one thing, but unless it can be turned into value it is useless.
Refer to Appendix
for data volumes

Why Big Data?
28
More data for analysis will result into greater analytical accuracy and greater
confidence in the decisions based on the analytical findings. This would entail a greater
positive impact in terms of enhancing operational efficiencies, reducing cost and time,
and innovating on new products, new services and optimizing existing services.
More data
More accurate analysis
Greater confidence in decision making
Greater operational efficiencies, cost

reduction, time reduction, new
product development, and optimized
offering etc.

Challenges of Traditional Systems
29
The main challenge in the traditional approach for computing systems to manage
‘Big Data’ because of immense speed and volume at which it is generated. Some of
the challenges are:
 Traditional approach cannot work on unstructured data efficiently.
 Traditional approach is built on top of the relational data model, relationships
between the subjects of interests have been created inside the system and the
analysis is done based on them. This approach will not adequate for big data.
 Traditional approach is batch oriented and need to wait for nightly ETL
(extract, transform and load) and transformation jobs to complete before
the required insight is obtained.
 Traditional data management, warehousing, and analysis systems fizzle to
analyze this type of data. Due to it’s complexity, big data is processed with
parallelism. Parallelism in a traditional system is achieved through costly
hardware like MPP (Massively Parallel Processing) systems.
 Inadequate support of aggregated summaries of data.

Challenges of Traditional Systems cont’d
30
Other challenges can be categorized as:
 Data Challenges:
 Volume, velocity, veracity, variety
 Data discovery and comprehensiveness
 Scalability
 Process challenges
 Capturing Data
 Aligning data from different sources
 Transforming data into suitable form for data analysis
 Modeling data(Mathematically, simulation)
 Management Challenges:
 Security
 Privacy
 Governance
 Ethical issues
Analytics
31
Analytics is the process of extracting useful information by analysing different
types of data sets. It is used to discover hidden patterns, outliers, unearth
trends, unknown co-relationship and other useful information for the benefit of
faster decision making.
There are 4 types of analytics:

Types of Analytics
32
Approach Explanation
Descriptive What’s happening in my business?
• Comprehensive, accurate and historical data
• Effective Visualisation
Diagnostic Why is it happening?
• Ability to drill-down to the root-cause
• Ability to isolate all confounding information
Predictive What’s likely to happen?
• Decisions are automated using algorithms and technology
• Historical patterns are being used to predict specific outcomes using
algorithms
Prescriptive What do I need to do for better future?
• Recommended actions and strategies based on champion/challenger
strategy outcomes
• Applying advanced analytical algorithm to make specific
recommendations
Evolution of Analytics Scalability
33
 Analytic scalability is the ability to use data to understand and

solve a large variety of problems.
 Because problems come in many forms, analytics must be
flexible enough to address problems in different ways.
 As the amount of data organizations process continue to
increase, the world of big data requires new levels of
scalability. Organizations need to update the technology to
provide a higher level of scalability.
 Luckily, there are multiple technologies available that address
different aspects of the process of taming big data and making
use of it in analytic processes.

Traditional Analytics Architecture
34
The essence is to pull all data together into a separate analytics environment to do
analysis
Database 1
Analytic Server or PC
Database 2
Pull
Database 3
Database n
The heavy processing occurs in the analytic environment which
results in heavy-lifting.

Modern In-Database Analytics Architecture
35
Database 1
Analytic
Server or PC
Database 2 Pull and
consolidate Submit
Request
Database 3 Enterprise Data

Warehouse (EDW)
Database n
In an in-database environment, the processing stays in the database where the data
has been consolidated. The user’s machine just submits the request; it doesn’t do
heavy lifting.

MPP Analytics Architecture
36
Massively parallel processing (MPP) database systems is the most mature, proven, and
widely deployed mechanism for storing and analyzing large amounts of data. An MPP
database spreads data out into independent pieces managed by independent storage
and central processing unit (CPU) resources. Conceptually, it is like having pieces of
data loaded onto multiple network connected personal computers around a house.
The data in an MPP system gets split across a variety of disks managed by a variety of
CPUs spread across a number of servers.
Single overloaded server

In stead of single
overloaded database, an
MPP database breaks the
data into independent
chunks with independent
Multiple lightly loaded server
disk and CPU.

MPP Database Example
37
100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte

chunks chunks chunks chunks chunks
One-terabyte
table 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte
chunks chunks chunks chunks chunks
A Traditional database will query

a one-terabyte table one row at time 10 simultaneous 100-gigabyte queries
MPP database is based on the principle of SHARE THE WORK!

A MPP database spreads data out across multiple sets of CPU and disk space. Think
logically about dozens or hundreds of personal computers each holding a small piece of a
large set of data. This allows much faster query execution, since many independent
smaller queries are running simultaneously instead of just one big query.
If more processing power and more speed are required, just bolt on additional capacity in
the form of additional processing units.
MPP systems build in redundancy to make recovery easy and have resource
management tools to manage the CPU and disk space.

MPP Database Example cont’d
38
An MPP system allows the different sets of CPU and disk to run the process concurrently
An MPP system
breaks the job into pieces
Single Threaded
Process ★ Parallel Process ★
Types of Computing
39
Parallel computing
 This is typically performed on a
User User User
single computer.
 It uses multiple processors at the
same time to execute a single
program. Front end computer
 This type of computing exhibit its
full potential only when the
P1 P2 P3
program can be split into many
pieces so that each processor can
execute a portion. T1 T2 T3
 It is useful for solving computing-
intensive problems. Back end parallel system

Types of Computing cont…
40
Distributed computing
 It is performed among many computers connected via network, but run as a single system.
 The computers that are in a distributed system can be physically close together and
connected by a local network, or they can be geographically distant and connected by a wide
area network.
 It makes a computer network appear as a powerful single computer that provides large-scale
resources to deal with complex challenges.
User User
P2
User User User User
P1 T2 P3
T1 T3
Network

Distributed vs. Parallel Computing
41
Parallel Computing Distributed Computing

Shared memory system Distributed memory system
Multiple processors share a single Autonomous computer nodes
bus and memory unit connected via network
Processor is order of Tbps Processor is order of Gbps
Limited Scalability Better scalability and cheaper
Note:
Distributed computing in local network is called cluster
computing. Distributed computing in wide-area network is called
grid computing.

Grid Computing
42
 Grid Computing can be defined as a network of computers working

together to perform a task that would rather be difficult for a single
machine.
 The task that they work on may include analysing huge datasets or
simulating situations which require high computing power.
 Computers on the network contribute resources like processing
power and storage capacity to the network.
 Grid Computing is a subset of distributed computing, where a virtual
super computer comprises of machines on a network connected by
some bus, mostly Ethernet or sometimes the Internet.
 It can also be seen as a form of parallel computing where instead of
many CPU cores on a single machine, it contains multiple cores
spread across various locations.

How Grid Computing works?
43

Data Analytics Lifecycle
44
 It is a process to understand the data and apply analytics

techniques to get insights for a business objective
 It’s primarily defines the analytics process, and the best
practices from project discovery to completion
 The data analytic lifecycle is designed for traditional data
problems and data science projects
 The cycle is iterative to represent a real project
 Work can return to earlier phases as new information is
uncovered
 It is a cyclical life cycle that has iterative parts in each of its six
steps:

Data Analytics Lifecycle cont’d
45
Source: mkhernandez, data-analytics-lifecycle

46
1. Step 1 — Discovery: In this step, the team learns the business domain,
including relevant history such as whether the organization or business unit
has attempted similar projects in the past from which they can learn. The
team assesses the resources available to support the project in terms of
people, technology, time, and data. Important activities in this step include
framing the business problem as an analytics challenge that can be
addressed in subsequent phases and formulating initial hypotheses (IHs) to
test and begin learning the data.
2. Step 2— Data preparation: It requires the presence of an analytic sandbox,
in which the team can work with data and perform analytics for the
duration of the project. The team needs to execute extract, load, and
transform (ELT) or extract, transform and load (ETL) to get data into the
sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should
be transformed in the ETLT process so the team can work with it and
analyze it. In this step, the team also needs to familiarize itself with the data
thoroughly and take steps to condition the data
ETL vs. ELT
47
ETL is an abbreviation of Extract, ELT is an abbreviation of Extract, Load, and

Transform and Load. In this process, an Transform. ELT is a different method of
ETL tool extracts the data from different looking at the tool approach to data
source systems then transforms the data movement. Instead of transforming the data
like applying calculations, concatenations, before it's written, ELT lets the target
etc. and then load the data into the target system to do the transformation. The data
system. ETL is used in RDBMS database like first copied to the target and then
Oracle, Microsoft SQL Server etc. In ETL transformed in place. ELT usually used with
process transformation engine takes care of no-Sql databases like Hadoop cluster, data
any data changes. appliance or cloud installation.
Source: guru99, ETL vs. ELT: Must Know Differences

48
3. Step 3 — Model planning: Step 3 is model planning, where the team

determines the methods, techniques, and workflow it intends to follow for
the subsequent model building phase. The team explores the data to learn
about the relationships between variables and subsequently selects key
variables and the most suitable models.
4. Step 4 — Model building: In step 4, the team develops datasets for testing,
training, and production purposes. In addition, in this step the team builds
and executes models based on the work done in the model planning phase.
The team also considers whether its existing tools will suffice for running
the models, or if it will need a more robust environment for executing
models and workflows (for example, fast hardware and parallel processing,
if applicable).

49
5. Step 5 — Communicate results: In step 5, the team, in collaboration with

major stakeholders, determines if the results of the project are a success or
a failure based on the criteria developed in step 1. The team should identify
key findings, quantify the business value, and develop a narrative to
summarize and convey findings to stakeholders.
6. Step 6 — Operationalize: In step 6, the team delivers final reports, briefings,
code, and technical documents. In addition, the team may run a pilot project
to implement the models in a production environment.

Hadoop
50
 Hadoop is an open-source framework that allows to store and process big data in a
distributed environment across clusters of computers using simple programming
models.
 It is designed to scale up from single servers to thousands of machines, each offering
local computation and storage.
 It provides massive storage for any kind of data, enormous processing power and
the ability to handle virtually limitless concurrent tasks or jobs.
 Importance:
 Ability to store and process huge amounts of any kind of data, quickly.
 Computing power: It’s distributed computing model processes big data fast.
 Fault tolerance: Data and application processing are protected against
hardware failure.
 Flexibility: Unlike traditional relational databases, preprocess of data does not
require before storing it.
 Low cost: The open-source framework is free and uses commodity hardware to
store large quantities of data.
 Scalability: System can easily grow to handle more data simply by adding
nodes. Little administration is required.
Hadoop Ecosystem
51
 Hadoop Ecosystem is neither a programming language nor a service, it is a platform

or framework which solves big data problems.
 It can be considered as a suite which encompasses a number of services (ingesting,
storing, analyzing and maintaining) inside it.
 Below are the Hadoop components, that together form a Hadoop ecosystem
 HDFS -> Hadoop Distributed File System
 YARN -> Yet Another Resource Negotiator
 MapReduce -> Data processing using programming
 Spark -> In-memory Data Processing
 PIG, HIVE-> Data Processing Services using Query (SQL-like)
 HBase -> NoSQL Database
 Mahout, Spark MLlib -> Machine Learning
 Apache Drill -> SQL on Hadoop
 Zookeeper -> Managing Cluster
 Oozie -> Job Scheduling
 Flume, Sqoop -> Data Ingesting Services
 Solr & Lucene -> Searching & Indexing
 Ambari -> Provision, Monitor and Maintain cluster
Hadoop Ecosystem cont…
52

MapReduce
53
 MapReduce is a processing technique and a program model for

distributed computing based on java.
 The MapReduce algorithm contains two important tasks,
namely Map and Reduce. Map takes a set of data and converts
it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce
task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples.
 As the sequence of the name MapReduce implies, the reduce
task is always performed after the map job.
 The major advantage of MapReduce is that it is easy to scale
data processing over multiple computing nodes.

Working of MapReduce
54
At the crux of MapReduce are two functions: Map and Reduce. They are
sequenced one after the other.
 The Map function takes input from the disk as <key,value> pairs, processes
them, and produces another set of intermediate <key,value> pairs as output.
 The Reduce function also takes inputs as <key,value> pairs, and produces
<key,value> pairs as output.
Example

Working of MapReduce cont…
55

Working of MapReduce cont…
56
The types of keys and values differ based on the use case. All inputs and outputs
are stored in the HDFS. While the map is a mandatory step to filter and sort the
initial data, the reduce function is optional.
<k1, v1> -> Map() -> list(<k2, v2>)
<k2, list(v2)> -> Reduce() -> list(<k3, v3>)
Mappers and Reducers are the Hadoop servers that run the Map and Reduce
functions respectively. It doesn’t matter if these are the same or different
servers.
 Map: The input data is first split into smaller blocks. Each block is then
assigned to a mapper for processing. For example, if a file has 100 records
to be processed, 100 mappers can run together to process one record each.
Or maybe 50 mappers can run together to process two records each. The
Hadoop framework decides how many mappers to use, based on the size of
the data to be processed and the memory block available on each mapper
server.
Working of MapReduce cont’d
57
 Reduce: After all the mappers complete processing, the framework shuffles
and sorts the results before passing them on to the reducers. A reducer
cannot start while a mapper is still in progress. All the map output values
that have the same key are assigned to a single reducer, which then
aggregates the values for that key.
Class Exercise 1 Class Exercise 2
Draw the MapReduce process to Draw the MapReduce process to find the
count the number of words for the maximum electrical consumption for each
input: year:
Dog Cat Rat Year
Car Car Rat
Dog car Rat
Rat Rat Rat

MapReduce Advantages
58
 Parallel and Distributed Computing: In MapReduce, the job is divided

among multiple nodes and each node works with a part of the job
simultaneously. So, MapReduce is based on Divide and Conquer paradigm
which us to process the data using different machines. As the data is
processed by multiple machines instead of a single machine in parallel, the
time taken to process the data gets reduced by a tremendous amount.
 Data locality: Instead of moving data to the processing unit, the processing
unit is moved to the data in the MapReduce Framework. The traditional
system used to bring data to the processing unit and process it. But, as the
data grew and became very huge, bringing this huge amount of data to the
processing unit posed the following issues:
 Moving huge data to processing is costly and deteriorates the network
performance.
 Processing takes time as the data is processed by a single unit which
becomes the bottleneck.
 The master node can get over-burdened and may fail.
MapReduce Advantages cont…
59
 Data locality: Now, MapReduce us to overcome the above issues by

bringing the processing unit to the data. So, as you can see in the below
image that the data is distributed among multiple nodes where each node
processes the part of the data residing on it.

60

Appendix
61
 Data Mining: Data mining is the process of looking for hidden, valid, and
potentially useful patterns in huge data sets. Data Mining is all about
discovering unsuspected/previously unknown relationships amongst the
data. It is a multi-disciplinary skill that uses machine learning, statistics,
AI and database technology.
 Natural Language Processing (NLP): NLP gives the machines the ability
to read, understand and derive meaning from human languages.
 Text Analytics (TA): TA is the process of extracting meaning out of text.
For example, this can be analyzing text written by customers in a
customer survey, with the focus on finding common themes and trends.
The idea is to be able to examine the customer feedback to inform the
business on taking strategic action, in order to improve customer
experience.
 Noisy text analytics: It is a process of information extraction whose goal
is to automatically extract structured or semi-structured information from
noisy unstructured text data.
Appendix cont… Example of Data Volumes
62
Unit Value Example

Kilobytes (KB) 1,000 bytes a paragraph of a text document
Megabytes (MB) 1,000 Kilobytes a small novel
Gigabytes (GB) 1,000 Megabytes Beethoven’s 5th Symphony
Terabytes (TB) 1,000 Gigabytes all the X-rays in a large hospital
Petabytes (PB) half the contents of all US academic research
1,000 Terabytes
libraries
Exabytes (EB) about one fifth of the words people have ever
1,000 Petabytes
spoken
Zettabytes (ZB) 1,000 Exabytes as much information as there are grains of sand on
all the world’s beaches
Yottabytes (YB) 1,000 Zettabytes as much information as there are atoms in 7,000
human bodies

Appendix cont… Multidimensional Databases
63
Multidimensional databases are used mostly for OLAP (online analytical

processing) and data warehousing. They can be used to show multiple
dimensions of data to users. A multidimensional database is created from
multiple relational databases. While relational databases allow users to access
data in the form of queries, the multidimensional databases allow users to ask
analytical questions related to business or market trends.
Example
The revenue costs for a company can be

understood and analyzed on the basis of
various factors like the company products,
the geographical locations of the company
offices, time to develop a product,
promotions done etc.

Home Work
64
1. Outline the main tasks and activities to be performed for each stage in data
analytics life cycle for the followings:
 A small stock trading organization, wants to build a Stock Performance
System. You have been tasked to create a data model to predict good
and bad stocks based on their history. You also have to build a
customized product to handle complex queries such as calculating the
covariance between the stocks for each month.
 A mobile health organization captures patient’s physical activities, by
attaching various sensors on different body parts. These sensors
measure the motion of diverse body parts like acceleration, the rate of
turn, magnetic field orientation, etc. You have to build data model for
effectively deriving information about the motion of different body
parts like chest, ankle, etc.

Home Work cont….
65
 A retail company wants to enhance their customer experience by analysing

the customer reviews for different products. So that, they can inform the
corresponding vendors and manufacturers about the product defects and
shortcomings. You have been tasked to analyse the complaints filed under
each product & the total number of complaints filed based on the
geography, type of product, etc. You also have to figure out the complaints
which have no timely response.
 A new company in the travel domain wants to start their business
efficiently, i.e. high profit for low TCO. They want to analyse & find the
most frequent & popular tourism destinations for their business. You have
been tasked to analyse top tourism destinations that people frequently
travel & top locations from where most of the tourism trips start. They also
want you to analyze & find the destinations with costly tourism packages.

Home Work cont…
66
 A new airline company wants to start their business efficiently. They are
trying to figure out the possible market and their competitors. You have
been tasked to analyse & find the most active airports with maximum
number of flyers. You also have to analyse the most popular sources &
destinations, with the airline companies operating between them.
 A finance company wants to evaluate their users, on the basis of loans they
have taken. They have hired you to find the number of cases per location
and categorize the count with respect to the reason for taking a loan. Next,
they have also tasked you to display their average risk score.
 A new company in Media and Entertainment domain wants to outsource
movie ratings & reviews. They want to know the frequent users who is
giving review and rating consistently for most of the movies. You have to
analyze different users, based on which user has rated the most number of
movies, their occupations & their age-group.

Home Work cont…
67
 Analyze the Aadhaar card data set against different research queries for
example total number of Aadhaar cards approved by state, rejected by
state, total number of Aadhaar card applicants by gender and total number
of Aadhaar card applicants by age type with visual depiction.
 A salesperson may manage many other salespeople. A salesperson is
managed by only one salespeople. A salesperson can be an agent for many
customers. A customer is managed by one salespeople. A customer can
place many orders. An order can be placed by one customer. An order lists
many inventory items. An inventory item may be listed on many orders. An
inventory item is assembled from many parts. A part may be assembled
into many inventory items. Many employees assemble an inventory item
from many parts. A supplier supplies many parts. A part may be supplied
by many suppliers.

Home Work cont…
68
 A manufacturing company produces products. The following product

information is stored: product name, product ID and quantity on hand.
These products are made up of many components. Each component can be
supplied by one or more suppliers. The following component information
is kept: component ID, name, description, suppliers who supply them, and
products in which they are used.
Assumptions:
A supplier can exist without providing components.
A component does not have to be associated with a supplier.
A component does not have to be associated with a product. Not all
components are used in products.
A product cannot exist without components.

Home Work cont…
69
2. Consider the following sample data. Draw the MapReduce process to find
the number of customers from each city.

Home Work cont…
70
3. Consider the following sample data. Draw the MapReduce process to find
the number of employees from each category of marital status.

Unit 1 - DA - Introduction To Data Science

Uploaded by

Unit 1 - DA - Introduction To Data Science

Uploaded by

Data Analytics (IT-3006)

Kalinga Institute of Industrial Technology

School of Computer Engineering

3 Credit Lecture Note

 The data analytics is indeed a revolution in the field of information technology.

School of Computer Engineering

 A priority for top organizations.

Sr # Major and Detailed Coverage Area Hrs

School of Computer Engineering

Sr # Major and Detailed Coverage Area Hrs

School of Computer Engineering

School of Computer Engineering

 Internal assessment – 30 marks

 2 quizzes = 2.5 X 2 = 5 marks

 2 group assignments = 2 X 2.5 = 5 marks

 Class participation = 4 X 2.5 = 10 marks

 Minor Project = 10 marks

 Mid-Term exam - 20 marks

 End-Term exam - 50 marks

 A representation of information, knowledge, facts, concepts or instructions

 The ability to analyze and act on data is increasingly important to

Deals with the structure of the

School of Computer Engineering

School of Computer Engineering

Digital data is classified into the following categories:

Approximate percentage distribution of digital data

School of Computer Engineering

Insert/Update/ DML operations provide the required ease with data

Encryption and tokenization solution to warrant the

Indexing speed up the data retrieval operation at the

The storage and processing capabilities of the traditional

Transaction RDBMS has support of ACID properties of transaction

School of Computer Engineering

 Semi-structured data, also known as having a schema-less or self-describing

School of Computer Engineering

Source (XML & JSON): http://sqllearnergroups.blogspot.com/2014/03/how-to-get-json-format-through-sql.html

School of Computer Engineering

Data objects may

School of Computer Engineering

School of Computer Engineering

Data Mining (DM)

Natural Language Processing (NLP)

Note: Refer to Appendix for further details.

School of Computer Engineering

Data Science programming language

School of Computer Engineering

School of Computer Engineering

Big Data is high-volume, high-velocity,

Enhanced insight &

School of Computer Engineering

 Every second, there are around 822 tweets on Twitter.

School of Computer Engineering

School of Computer Engineering

More accurate analysis

Greater confidence in decision making

Greater operational efficiencies, cost

School of Computer Engineering

School of Computer Engineering

School of Computer Engineering

 Analytic scalability is the ability to use data to understand and

School of Computer Engineering

School of Computer Engineering

Database 3 Enterprise Data

School of Computer Engineering

Single overloaded server

School of Computer Engineering

100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte

A Traditional database will query

MPP database is based on the principle of SHARE THE WORK!

School of Computer Engineering

School of Computer Engineering

School of Computer Engineering

Parallel Computing Distributed Computing

School of Computer Engineering

 Grid Computing can be defined as a network of computers working

School of Computer Engineering