Chapter 2 - Introduction to Data Science
Chapter 2 - Introduction to Data Science
(EmTe1012)
Chapter 2
Data Science
1
Unit objectives
Differentiate and data and information
Describe the essence of data science and the role of data
scientist
Describe data processing life cycle
Understand different data types from diverse perspectives
Describe data value chain in emerging era of big data.
Understand the basics of Big Data.
Describe the purpose of the Hadoop ecosystem
components.
2
An Overview of Data Science
Data science is a multi-disciplinary field which involves
extracting insights from vast amounts of data using
scientific methods, algorithms, and processes.
It helps to extract knowledge and insights from structured,
semi structured and unstructured data.
More importantly, it enables you to translate a business
problem into a research project and then translate it back
into a practical solution.
It offers a range of roles and requires a range of skills .
As an academic discipline and profession, data science
continues to evolve as one of the most promising and in-
demand career paths for skilled professionals.
3
Con’t…
4
Methodology of data science
5
Application of Data science
Data science is much more than simply analyzing data;
It plays wide range of roles as follows;
Data is the oil for today's world. With the right tools,
technologies, algorithms, we can use data and convert it into a
distinctive business advantage
Can help you to detect fraud using advanced machine learning
algorithms
It could also helps you to prevent any significant monetary
losses
Allows to build intelligence ability in machines
You can perform sentiment analysis to gauge customer brand
loyalty
It enables you to take better and faster decisions
Helps you to recommend the right product to the right
customer to enhance your business
6
Data and Information
Data can be defined as a representation of facts,
concepts, or instructions in a formalized manner with
help of characters such as alphabets (A-Z, a-z), digits
(0-9) or special characters (+, -, /, *, <,>, =, etc.)
Those facts are suitable for communication, interpretation, or processing, by
human or electronic
machines.
7
Con’t…
Data Information
raw facts data with context
no context processed data
just numbers and value-added to data
text summarized
organized
analyzed
8
Con’t…
Data: 51012
Information:
5/10/11 The date of your final exam.
51,012 Birr The average starting salary of an accounting major.
51011 Zip code of Jimma.
9
Data Processing Cycle
Data processing is the re-structuring or re-ordering of data by people or
machines to increase their usefulness and add values for a particular
purpose.
The following are basic steps of data processing:
Input - in this step, the input data is prepared in some convenient form for
processing.
example, when electronic computers are used, the input data can be recorded on any one of the
several types of storage medium, such as hard disk, CD, flash disk and so on.
Processing - in this step, the input data is changed to produce data in a more
useful form.
example, a summary of sales for the month can be calculated from the sales orders.
Output − at this stage, the result of the proceeding processing step is collected.
Example, output data may be payroll for employees.
10
Data types and their representation
11
Data types from computer programming perspective
12
Data types from Data Analytics perspective
From a data analytics point of view, it is important to understand
that there are three common types of data types or structures:
Structured
Semi-structured, and
Unstructured data types
13
Con’t…
Structured data:-
Structured data is data that adheres to a pre-defined data model
and is therefore straightforward to analyze. Structured data
conforms to a tabular format with a relationship between the
different rows and columns
E.g. Excel files, SQL databases
Semi structured data :-
Semi-structured data is a form of structured data that does not
conform with the formal structure. However, such files contains
tags or other markers to separate semantic elements and enforce
hierarchies of records and fields within the data.
E.g. XML, JSON
14
Con’t…
Unstructured data:-
Unstructured data is information that either does not have a
predefined data model or is not organized in a pre-defined manner.
Usually it is typically text-heavy but may contain data such as
dates, numbers, and facts as well
E.g. audio, video files or NoSQL databases
15
Metadata
Technically metadata is not a separate data structure, but it is
one of the most important elements for Big Data analysis and big
data solutions.
It provides additional information about a specific set of data;
conveniently it can be said data about data
In a set of photographs, for example, metadata could describe
when and where the photos were taken i.e.Date and location
The metadata then provides fields for dates and locations which,
by themselves, can be considered structured data. Because of
this reason, metadata is frequently used by Big Data solutions
for initial analysis
16
Data value Chain
Data Value Chain describes the information flow within a big data
system as a series of steps needed to generate value and useful
insights from data.
The Big Data Value Chain identifies the following key high-level
activities:
17
Data Acquisition
Data acquisition is the process of gathering, filtering, and cleaning
data before it is put in a data warehouse or any other storage
solution on which data analysis can be carried out.
The infrastructure required for big data acquisition must deliver
low, predictable latency in both capturing data and in executing
queries.
Moreover, the infrastructure handle very high transaction volumes,
often in a distributed environment; and support flexible and dynamic
data structures.
Data acquisition is major challenges in big data because of it’s high-
end infrastructure requirement.
18
Data Analysis
Data analysis involves exploring, transforming, and modeling data
with the goal of highlighting relevant data, synthesizing and
extracting useful hidden information with high potential from a
business point of view.
It also deals with making the raw data acquired amenable to use in
decision making process
19
Data Curation
Data curation is an active management of data over its life cycle
to ensure it meets the necessary data quality requirements for
its effective usage.
It’s process can be categorized into different activities such
as content creation, selection, classification, transformation,
validation, and preservation.
Data curation is performed by expert curators or annotators
that are responsible for improving the accessibility and quality
of data.
20
Data Storage
Data storage is the persistence and management of data in a
scalable way that satisfies the needs of applications that require
fast access to the data.
Relational database system has been used as a storage paradigm
for over 40 years.
Following the volume and complexity of data recently highly
scalable NoSQL technologies is applied for big data storage model.
21
Data Usage
Data usage covers the data-driven business activities that need
access to data, its analysis, and the tools needed to integrate
the data analysis within the business activity.
It enhances business decision making competitiveness through
the reduction of costs, increased added value, or any other
parameter that can be measured against existing performance
criteria.
22
2.5. Basic concepts of big data
Big Data Every Where!
Lots of data is being collected
and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Social Network
23
Con’t…
Big data is the term for a collection of data sets so large and complex that
it becomes difficult to process using on-hand database management tools or
traditional data processing applications.
The common scale of big datasets is constantly shifting and may vary
significantly from organization to organization.
Big data is characterized by 4V and more:
Volume: Machine generated data is produced in larger quantities than non
traditional data.
large amounts of data Zeta bytes/Massive datasets
Variety: large variety of input data which in turn generates large amount of
data as output.
data comes in many different forms from diverse sources
24
Con’t…
25
Cont..
The following figure depicts the 5 major use cases of big data
26
Cont…
An example of the big data platform in practice
27
How much data?
Google processes more than 20 PB a day (2018)
Wayback Machine has 3 PB + 100 TB/month (3/2009)
Facebook process 500+TB/day (2019)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
CERN’s Large Hydron Collider (LHC) generates 15 PB a year
NASA Climate Simulation 32 petabytes
28
Clustered Computing and Hadoop Ecosystem
• Clustered Computing
Because of the qualities of big data, individual computers are
often inadequate for handling the data at most stages. To better
address the high storage and computational needs of big data,
computer clusters are a better fit.
Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits such as:
Resource Pooling: combining storage space and cpu to process
large dataset
High Availability: Clusters can provide varying levels of fault
tolerance and availability
Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group.
29
Cont…
Employing clustered resources may require managing cluster
membership, coordinating resource sharing, and scheduling actual
work on individual nodes or computers.
The cluster membership and resource allocation task is done by
apache open source framework software's like Hadoop's
YARN(which stands for Yet Another Resource Negotiator.)
The assembled cluster machines act seamlessly and help other
software interfaces to process the data.
30
Hadoop and its Ecosystem
What is Hadoop?
It’s apache open source software framework for reliable,
scalable, distributed computing of massive amount of data
Hides underlying system details and complexities from user
Developed in Java
Flexible, enterprise-class support for processing large
volumes of data
Inspired by Google technologies (MapReduce, GFS,
BigTable, …)
Initiated at Yahoo : to address scalability problems of an
open source web technology(Nutch)
Supports wide variety of data
31
Cont…
Hadoop enables applications to work with thousands of nodes and
petabytes of data in a highly parallel, cost effective manner
CPU + disks = “node”
Nodes can be combined into clusters
New nodes can be added as needed without changing
Data formats
How data is loaded
How jobs are written
32
Cont…
Hadoop has an ecosystem that has evolved from its four core
components: data management, access, processing, and storage.
Hadoop is supplemented by an ecosystem of open source projects
such as:
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query-based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
33
Cont…
The following figure depict Hadoop ecosystem
34
Life cycle of big data with Hadoop
Ingesting data into the system
First the data is ingested or transferred to Hadoop from
various sources such as relational databases, systems, or local
files.
Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data
Processing the data in storage
The second stage is Processing. In this stage, the data is
stored and processed.
The data is stored in the distributed file system, HDFS, and
the NoSQL distributed data, HBase. Spark and MapReduce
perform data processing
35
Cont…
Computing and analyzing data
The third stage is analyzing and processing data using open
source frameworks such as Pig, Hive, and Impala.
Pig converts the data using a map and reduce and then
analyzes it.
Hive is also based on the map and reduce programming and
is most suitable for structured data
Visualizing the results
The fourth stage is Access, which is performed by tools such
as Hue and Cloudera Search.
In this stage, the analyzed data can be accessed by users.
36
Laboratory Tools
Python, Jupyter notebook[Python
version>2.7,Anaconda](Recommended)
IBM SPSS Statistics
37