Unit 1 - DA - Introduction To Data Science
Unit 1 - DA - Introduction To Data Science
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
Business Data
Science
Data
Analytics
Real-time
Job Market Usability
CO # CO Unit #
CO1 Make use of data science concepts to handle the big data. 1
CO2 Examine the statistical concepts for finding relationships among 2
variables and estimate the data samplings.
CO3 Select the data analytics techniques & models for both data 3
prediction and performance analysis.
CO4 Develop rules using frequent itemsets and association mining. 4
CO5 Solve real-time problems using classification and clustering 5
techniques
CO6 Apply the mining techniques for data streams. 6
Prerequisites
NIL
Textbook
Data Analytics, Radha Shankarmani, M. Vijayalaxmi, Wiley India Private Limited,
ISBN: 9788126560639.
Reference Books
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and
Presenting Data by EMC Education Services (Editor), Wiley, 2014
Bill Franks, Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data
Streams with advanced analystics, John Wiley & sons, 2012.
Glenn J. Myatt, Making Sense of Data, John Wiley & Sons, 2007 Pete Warden, Big
Data Glossary,O’Reilly, 2011.
Jiawei Han, MichelineKamber “Data Mining Concepts and Techniques”, Second
Edition, Elsevier, Reprinted 2008.
Stephan Kudyba, Thomas H. Davenport, Big Data, Mining, and Analytics, Components
of Strategic Decision Making, CRC Press, Taylor & Francis Group. 2014
Big Data, Black Book, DT Editorial Services, Dreamtech Press, 2015
School of Computer Engineering
Evaluation
8
Grading:
?
School of Computer Engineering
Data
9
Human-readable refers to information that only humans can interpret and study,
such as an image or the meaning of a block of text. If it requires a person to
interpret it, that information is human-readable.
Machine-readable refers to information that computer programs can process. A
program is a set of instructions for manipulating data. Such data can be
automatically read and processed by a computer, such as CSV, JSON, XML, etc.
Non-digital material (for example printed or hand-written documents) is by its non-
digital nature not machine-readable. But even digital material need not be machine-
readable. For example, a PDF document containing tables of data. These are
definitely digital but are not machine-readable because a computer would struggle
to access the tabular information - even though they are very human readable. The
equivalent tables in a format such as a spreadsheet would be machine readable.
Another example scans (photographs) of text are not machine-readable (but are
human readable!) but the equivalent text in a format such as a simple ASCII text file
can machine readable and processable.
It is defined as the data that has a well-defined repeating pattern and this
pattern makes it easier for any program to sort, read, and process the data.
This is data is in an organized form (e.g., in rows and columns) and can be easily
used by a computer program.
Relationships exist between entities of data.
Structured data:
Organize data in a pre-defined format.
Is stored in a tabular form.
Is the data that resides in a fixed fields within a record of file.
Is formatted data that has entities and their attributes mapped.
Is used to query and report against predetermined data types.
Sources:
Relational Multidimensional
database databases
Structured data
Legacy
Flat files
databases
School of Computer Engineering
Ease with Structured Data
15
Web data in
the form of XML
cookies Semi-structured
data
Other Markup JSON
languages
Inconsistent Structure
Self-describing
(level/value pair)
Other schema
Semi-structured information is
data blended with data
values
Unstructured data is a set of data that might or might not have any logical or
repeating patterns and is not recognized in a pre-defined manner.
About 80 percent of enterprise data consists of unstructured content.
Unstructured data:
Typically consists of metadata i.e. additional information related to data.
Comprises of inconsistent data such as data obtained from files, social
media websites, satellites etc
Consists of data in different formats such as e-mails, text, audio, video, or
images.
Sources: Body of email
Chats, Text
Text both
messages
internal and
external to org.
Mobile data
Unstructured data
Social Media Images, audios,
data videos
School of Computer Engineering
Challenges associated with Unstructured data
20
Working with unstructured data poses certain challenges, which are as follows:
Identifying the unstructured data that can be processed.
Sorting, organizing, and arranging unstructured data indifferent sets and
formats.
Combining and linking unstructured data in a more structured format to derive
any logical conclusions out of the available information.
Costing in terms of storage space and human resources need to deal with the
exponential growth of unstructured data.
Data Analysis of Unstructured Data
The complexity of unstructured data lies within the language that created it. Human
language is quite different from the language used by machines, which prefer
structured information. Unstructured data analysis is referred to the process of
analyzing data objects that doesn’t follow a predefined data model and/or is
unorganized. It is the analysis of any data that is stored over time within an
organizational data repository without any intent for its orchestration, pattern or
categorization.
Data Science is collecting, analyzing and interpreting data to gather insights into
the data that can help decision-makers to make informed decisions.
What is Data Science used for?
Descriptive analysis (what has been happened)
Diagnostic analysis (why it has happened)
Predictive analysis (what will happen)
Prescriptive analysis (what to do for betterment of future)
What is the Data Science process?
1. Obtaining the data (i.e., data identification for analysis)
2. Scrubbing the data (i.e., ensuring readable state)
3. Exploratory analysis (excellent attention to detail)
4. Modeling (algorithm to follow based on the data for analysis)
5. Interpreting the data (uncover findings and present to the
organisation)
Think of following:
Semi- Big
Structured Unstructured
structured Data
Data Data
Data
Refer to Appendix
for data volumes
More data
The main challenge in the traditional approach for computing systems to manage
‘Big Data’ because of immense speed and volume at which it is generated. Some of
the challenges are:
Traditional approach cannot work on unstructured data efficiently.
Traditional approach is built on top of the relational data model, relationships
between the subjects of interests have been created inside the system and the
analysis is done based on them. This approach will not adequate for big data.
Traditional approach is batch oriented and need to wait for nightly ETL
(extract, transform and load) and transformation jobs to complete before
the required insight is obtained.
Traditional data management, warehousing, and analysis systems fizzle to
analyze this type of data. Due to it’s complexity, big data is processed with
parallelism. Parallelism in a traditional system is achieved through costly
hardware like MPP (Massively Parallel Processing) systems.
Inadequate support of aggregated summaries of data.
Process challenges
Capturing Data
Aligning data from different sources
Transforming data into suitable form for data analysis
Modeling data(Mathematically, simulation)
Management Challenges:
Security
Privacy
Governance
Ethical issues
School of Computer Engineering
Analytics
31
Analytics is the process of extracting useful information by analysing different
types of data sets. It is used to discover hidden patterns, outliers, unearth
trends, unknown co-relationship and other useful information for the benefit of
faster decision making.
There are 4 types of analytics:
Approach Explanation
Descriptive What’s happening in my business?
• Comprehensive, accurate and historical data
• Effective Visualisation
Diagnostic Why is it happening?
• Ability to drill-down to the root-cause
• Ability to isolate all confounding information
Predictive What’s likely to happen?
• Decisions are automated using algorithms and technology
• Historical patterns are being used to predict specific outcomes using
algorithms
Prescriptive What do I need to do for better future?
• Recommended actions and strategies based on champion/challenger
strategy outcomes
• Applying advanced analytical algorithm to make specific
recommendations
School of Computer Engineering
Evolution of Analytics Scalability
33
The essence is to pull all data together into a separate analytics environment to do
analysis
Database 1
Analytic Server or PC
Database 2
Pull
Database 3
Database n
The heavy processing occurs in the analytic environment which
results in heavy-lifting.
Database 1
Analytic
Server or PC
Database 2 Pull and
consolidate Submit
Request
Database n
In an in-database environment, the processing stays in the database where the data
has been consolidated. The user’s machine just submits the request; it doesn’t do
heavy lifting.
Massively parallel processing (MPP) database systems is the most mature, proven, and
widely deployed mechanism for storing and analyzing large amounts of data. An MPP
database spreads data out into independent pieces managed by independent storage
and central processing unit (CPU) resources. Conceptually, it is like having pieces of
data loaded onto multiple network connected personal computers around a house.
The data in an MPP system gets split across a variety of disks managed by a variety of
CPUs spread across a number of servers.
One-terabyte
table 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte
chunks chunks chunks chunks chunks
An MPP system allows the different sets of CPU and disk to run the process concurrently
An MPP system
breaks the job into pieces
Single Threaded
Process ★ Parallel Process ★
School of Computer Engineering
Types of Computing
39
Parallel computing
This is typically performed on a
User User User
single computer.
It uses multiple processors at the
same time to execute a single
program. Front end computer
This type of computing exhibit its
full potential only when the
P1 P2 P3
program can be split into many
pieces so that each processor can
execute a portion. T1 T2 T3
It is useful for solving computing-
intensive problems. Back end parallel system
Distributed computing
It is performed among many computers connected via network, but run as a single system.
The computers that are in a distributed system can be physically close together and
connected by a local network, or they can be geographically distant and connected by a wide
area network.
It makes a computer network appear as a powerful single computer that provides large-scale
resources to deal with complex challenges.
User User
P2
User User User User
P1 T2 P3
T1 T3
Network
1. Step 1 — Discovery: In this step, the team learns the business domain,
including relevant history such as whether the organization or business unit
has attempted similar projects in the past from which they can learn. The
team assesses the resources available to support the project in terms of
people, technology, time, and data. Important activities in this step include
framing the business problem as an analytics challenge that can be
addressed in subsequent phases and formulating initial hypotheses (IHs) to
test and begin learning the data.
2. Step 2— Data preparation: It requires the presence of an analytic sandbox,
in which the team can work with data and perform analytics for the
duration of the project. The team needs to execute extract, load, and
transform (ELT) or extract, transform and load (ETL) to get data into the
sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should
be transformed in the ETLT process so the team can work with it and
analyze it. In this step, the team also needs to familiarize itself with the data
thoroughly and take steps to condition the data
School of Computer Engineering
ETL vs. ELT
47
Hadoop is an open-source framework that allows to store and process big data in a
distributed environment across clusters of computers using simple programming
models.
It is designed to scale up from single servers to thousands of machines, each offering
local computation and storage.
It provides massive storage for any kind of data, enormous processing power and
the ability to handle virtually limitless concurrent tasks or jobs.
Importance:
Ability to store and process huge amounts of any kind of data, quickly.
Computing power: It’s distributed computing model processes big data fast.
Fault tolerance: Data and application processing are protected against
hardware failure.
Flexibility: Unlike traditional relational databases, preprocess of data does not
require before storing it.
Low cost: The open-source framework is free and uses commodity hardware to
store large quantities of data.
Scalability: System can easily grow to handle more data simply by adding
nodes. Little administration is required.
School of Computer Engineering
Hadoop Ecosystem
51
At the crux of MapReduce are two functions: Map and Reduce. They are
sequenced one after the other.
The Map function takes input from the disk as <key,value> pairs, processes
them, and produces another set of intermediate <key,value> pairs as output.
The Reduce function also takes inputs as <key,value> pairs, and produces
<key,value> pairs as output.
Example
The types of keys and values differ based on the use case. All inputs and outputs
are stored in the HDFS. While the map is a mandatory step to filter and sort the
initial data, the reduce function is optional.
<k1, v1> -> Map() -> list(<k2, v2>)
<k2, list(v2)> -> Reduce() -> list(<k3, v3>)
Mappers and Reducers are the Hadoop servers that run the Map and Reduce
functions respectively. It doesn’t matter if these are the same or different
servers.
Map: The input data is first split into smaller blocks. Each block is then
assigned to a mapper for processing. For example, if a file has 100 records
to be processed, 100 mappers can run together to process one record each.
Or maybe 50 mappers can run together to process two records each. The
Hadoop framework decides how many mappers to use, based on the size of
the data to be processed and the memory block available on each mapper
server.
School of Computer Engineering
Working of MapReduce cont’d
57
Reduce: After all the mappers complete processing, the framework shuffles
and sorts the results before passing them on to the reducers. A reducer
cannot start while a mapper is still in progress. All the map output values
that have the same key are assigned to a single reducer, which then
aggregates the values for that key.
Class Exercise 1 Class Exercise 2
Draw the MapReduce process to Draw the MapReduce process to find the
count the number of words for the maximum electrical consumption for each
input: year:
Dog Cat Rat Year
Car Car Rat
Dog car Rat
Rat Rat Rat
Data Mining: Data mining is the process of looking for hidden, valid, and
potentially useful patterns in huge data sets. Data Mining is all about
discovering unsuspected/previously unknown relationships amongst the
data. It is a multi-disciplinary skill that uses machine learning, statistics,
AI and database technology.
Natural Language Processing (NLP): NLP gives the machines the ability
to read, understand and derive meaning from human languages.
Text Analytics (TA): TA is the process of extracting meaning out of text.
For example, this can be analyzing text written by customers in a
customer survey, with the focus on finding common themes and trends.
The idea is to be able to examine the customer feedback to inform the
business on taking strategic action, in order to improve customer
experience.
Noisy text analytics: It is a process of information extraction whose goal
is to automatically extract structured or semi-structured information from
noisy unstructured text data.
School of Computer Engineering
Appendix cont… Example of Data Volumes
62
1. Outline the main tasks and activities to be performed for each stage in data
analytics life cycle for the followings:
A small stock trading organization, wants to build a Stock Performance
System. You have been tasked to create a data model to predict good
and bad stocks based on their history. You also have to build a
customized product to handle complex queries such as calculating the
covariance between the stocks for each month.
A mobile health organization captures patient’s physical activities, by
attaching various sensors on different body parts. These sensors
measure the motion of diverse body parts like acceleration, the rate of
turn, magnetic field orientation, etc. You have to build data model for
effectively deriving information about the motion of different body
parts like chest, ankle, etc.
A new airline company wants to start their business efficiently. They are
trying to figure out the possible market and their competitors. You have
been tasked to analyse & find the most active airports with maximum
number of flyers. You also have to analyse the most popular sources &
destinations, with the airline companies operating between them.
A finance company wants to evaluate their users, on the basis of loans they
have taken. They have hired you to find the number of cases per location
and categorize the count with respect to the reason for taking a loan. Next,
they have also tasked you to display their average risk score.
A new company in Media and Entertainment domain wants to outsource
movie ratings & reviews. They want to know the frequent users who is
giving review and rating consistently for most of the movies. You have to
analyze different users, based on which user has rated the most number of
movies, their occupations & their age-group.
Analyze the Aadhaar card data set against different research queries for
example total number of Aadhaar cards approved by state, rejected by
state, total number of Aadhaar card applicants by gender and total number
of Aadhaar card applicants by age type with visual depiction.
A salesperson may manage many other salespeople. A salesperson is
managed by only one salespeople. A salesperson can be an agent for many
customers. A customer is managed by one salespeople. A customer can
place many orders. An order can be placed by one customer. An order lists
many inventory items. An inventory item may be listed on many orders. An
inventory item is assembled from many parts. A part may be assembled
into many inventory items. Many employees assemble an inventory item
from many parts. A supplier supplies many parts. A part may be supplied
by many suppliers.
2. Consider the following sample data. Draw the MapReduce process to find
the number of customers from each city.
3. Consider the following sample data. Draw the MapReduce process to find
the number of employees from each category of marital status.