0% found this document useful (0 votes)

18 views

Chapter 2 - Introduction to Data Science

Uploaded by

Yekeber Aklil

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Chapter 2 - Introduction to Data Science

Uploaded by

Yekeber Aklil

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Introduction to Emerging Technologies

(EmTe1012)

Chapter 2
Data Science
1
Unit objectives
 Differentiate and data and information
 Describe the essence of data science and the role of data
scientist
 Describe data processing life cycle
 Understand different data types from diverse perspectives
 Describe data value chain in emerging era of big data.
 Understand the basics of Big Data.
 Describe the purpose of the Hadoop ecosystem
components.

2
An Overview of Data Science
 Data science is a multi-disciplinary field which involves
extracting insights from vast amounts of data using
scientific methods, algorithms, and processes.
 It helps to extract knowledge and insights from structured,
semi structured and unstructured data.
 More importantly, it enables you to translate a business
problem into a research project and then translate it back
into a practical solution.
 It offers a range of roles and requires a range of skills .
 As an academic discipline and profession, data science
continues to evolve as one of the most promising and in-
demand career paths for skilled professionals.

3
Con’t…

 Data science is much more than simply analyzing data

E.g Supermarket data- On Thursday nights people who buy
milk also tend to buy beer .
 catalog order of items,
 Know the profile of your best customer,
 Predict more wanted item for the next 2 month,
 The probability of customer who buy computer
also buy pen drive.

4
Methodology of data science

Data science methodology

5
Application of Data science
 Data science is much more than simply analyzing data;
 It plays wide range of roles as follows;
 Data is the oil for today's world. With the right tools,
technologies, algorithms, we can use data and convert it into a
distinctive business advantage
 Can help you to detect fraud using advanced machine learning
algorithms
 It could also helps you to prevent any significant monetary
losses
 Allows to build intelligence ability in machines
 You can perform sentiment analysis to gauge customer brand
loyalty
 It enables you to take better and faster decisions
 Helps you to recommend the right product to the right
customer to enhance your business

6
Data and Information
 Data can be defined as a representation of facts,
concepts, or instructions in a formalized manner with
help of characters such as alphabets (A-Z, a-z), digits
(0-9) or special characters (+, -, /, *, <,>, =, etc.)
 Those facts are suitable for communication, interpretation, or processing, by
human or electronic
machines.

 Information is interpreted data; created from

organized, structured, and processed data in a
particular context on which decisions and actions are
based.

7
Con’t…

Data Information
 raw facts  data with context
 no context  processed data
 just numbers and  value-added to data
text  summarized
 organized
 analyzed

8
Con’t…
 Data: 51012
 Information:
 5/10/11 The date of your final exam.
 51,012 Birr The average starting salary of an accounting major.
 51011 Zip code of Jimma.

9
Data Processing Cycle
 Data processing is the re-structuring or re-ordering of data by people or
machines to increase their usefulness and add values for a particular
purpose.
 The following are basic steps of data processing:

 Input - in this step, the input data is prepared in some convenient form for
processing.
 example, when electronic computers are used, the input data can be recorded on any one of the
several types of storage medium, such as hard disk, CD, flash disk and so on.

 Processing - in this step, the input data is changed to produce data in a more
useful form.
 example, a summary of sales for the month can be calculated from the sales orders.
 Output − at this stage, the result of the proceeding processing step is collected.
 Example, output data may be payroll for employees.

10
Data types and their representation

 In computer science and computer programming, a data type is

simply an attribute of data that tells the compiler or interpreter
how the programmer intends to use the data.
 A data type makes the values that expression, such as a variable or
a function, might take.
 This data type defines the operations that can be done on the data,
the meaning of the data, and the way values of that type can be
stored.

11
Data types from computer programming perspective

 Almost all programming languages explicitly include

the notion of data type with different terminology.
Common data types include the following;
 Integers(int)- is used to store whole numbers,
mathematically known as integers
 Booleans(bool)- is used to represent restricted to one of
two values: true or false
 Characters(char)- is used to store a single character
 Floating- point numbers(float)- is used to store real
numbers
 Alphanumeric strings+(string)- used to store a combination
of characters and numbers

12
Data types from Data Analytics perspective
 From a data analytics point of view, it is important to understand
that there are three common types of data types or structures:
 Structured
 Semi-structured, and
 Unstructured data types

13
Con’t…
 Structured data:-
 Structured data is data that adheres to a pre-defined data model
and is therefore straightforward to analyze. Structured data
conforms to a tabular format with a relationship between the
different rows and columns
 E.g. Excel files, SQL databases
 Semi structured data :-
 Semi-structured data is a form of structured data that does not
conform with the formal structure. However, such files contains
tags or other markers to separate semantic elements and enforce
hierarchies of records and fields within the data.
 E.g. XML, JSON

14
Con’t…
 Unstructured data:-
 Unstructured data is information that either does not have a
predefined data model or is not organized in a pre-defined manner.
 Usually it is typically text-heavy but may contain data such as
dates, numbers, and facts as well
 E.g. audio, video files or NoSQL databases

15
Metadata
 Technically metadata is not a separate data structure, but it is
one of the most important elements for Big Data analysis and big
data solutions.
 It provides additional information about a specific set of data;
conveniently it can be said data about data
 In a set of photographs, for example, metadata could describe
when and where the photos were taken i.e.Date and location
 The metadata then provides fields for dates and locations which,
by themselves, can be considered structured data. Because of
this reason, metadata is frequently used by Big Data solutions
for initial analysis

16
Data value Chain
 Data Value Chain describes the information flow within a big data
system as a series of steps needed to generate value and useful
insights from data.
 The Big Data Value Chain identifies the following key high-level
activities:

17
Data Acquisition
 Data acquisition is the process of gathering, filtering, and cleaning
data before it is put in a data warehouse or any other storage
solution on which data analysis can be carried out.
 The infrastructure required for big data acquisition must deliver
low, predictable latency in both capturing data and in executing
queries.
 Moreover, the infrastructure handle very high transaction volumes,
often in a distributed environment; and support flexible and dynamic
data structures.
 Data acquisition is major challenges in big data because of it’s high-
end infrastructure requirement.

18
Data Analysis
 Data analysis involves exploring, transforming, and modeling data
with the goal of highlighting relevant data, synthesizing and
extracting useful hidden information with high potential from a
business point of view.
 It also deals with making the raw data acquired amenable to use in
decision making process

19
Data Curation
 Data curation is an active management of data over its life cycle
to ensure it meets the necessary data quality requirements for
its effective usage.
 It’s process can be categorized into different activities such
as content creation, selection, classification, transformation,
validation, and preservation.
 Data curation is performed by expert curators or annotators
that are responsible for improving the accessibility and quality
of data.

20
Data Storage
 Data storage is the persistence and management of data in a
scalable way that satisfies the needs of applications that require
fast access to the data.
 Relational database system has been used as a storage paradigm
for over 40 years.
 Following the volume and complexity of data recently highly
scalable NoSQL technologies is applied for big data storage model.

21
Data Usage
 Data usage covers the data-driven business activities that need
access to data, its analysis, and the tools needed to integrate
the data analysis within the business activity.
 It enhances business decision making competitiveness through
the reduction of costs, increased added value, or any other
parameter that can be measured against existing performance
criteria.

22
2.5. Basic concepts of big data
Big Data Every Where!
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Social Network

23
Con’t…
 Big data is the term for a collection of data sets so large and complex that
it becomes difficult to process using on-hand database management tools or
traditional data processing applications.
 The common scale of big datasets is constantly shifting and may vary
significantly from organization to organization.
 Big data is characterized by 4V and more:
 Volume: Machine generated data is produced in larger quantities than non
traditional data.
 large amounts of data Zeta bytes/Massive datasets

 Velocity: the speed of data processing.

 Data is live streaming or in motion

 Variety: large variety of input data which in turn generates large amount of
data as output.
 data comes in many different forms from diverse sources

 Veracity: can we trust the data? How accurate is it? etc.

24
Con’t…

25
Cont..
 The following figure depicts the 5 major use cases of big data

26
Cont…
 An example of the big data platform in practice

27
How much data?
 Google processes more than 20 PB a day (2018)
 Wayback Machine has 3 PB + 100 TB/month (3/2009)
 Facebook process 500+TB/day (2019)
 eBay has 6.5 PB of user data + 50 TB/day (5/2009)
 CERN’s Large Hydron Collider (LHC) generates 15 PB a year
 NASA Climate Simulation 32 petabytes

28
Clustered Computing and Hadoop Ecosystem
• Clustered Computing
 Because of the qualities of big data, individual computers are
often inadequate for handling the data at most stages. To better
address the high storage and computational needs of big data,
computer clusters are a better fit.
 Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits such as:
 Resource Pooling: combining storage space and cpu to process
large dataset
 High Availability: Clusters can provide varying levels of fault
tolerance and availability
 Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group.

29
Cont…
 Employing clustered resources may require managing cluster
membership, coordinating resource sharing, and scheduling actual
work on individual nodes or computers.
 The cluster membership and resource allocation task is done by
apache open source framework software's like Hadoop's
YARN(which stands for Yet Another Resource Negotiator.)
 The assembled cluster machines act seamlessly and help other
software interfaces to process the data.

30
Hadoop and its Ecosystem
 What is Hadoop?
 It’s apache open source software framework for reliable,
scalable, distributed computing of massive amount of data
 Hides underlying system details and complexities from user
 Developed in Java
 Flexible, enterprise-class support for processing large
volumes of data
 Inspired by Google technologies (MapReduce, GFS,
BigTable, …)
 Initiated at Yahoo : to address scalability problems of an
open source web technology(Nutch)
 Supports wide variety of data

31
Cont…
 Hadoop enables applications to work with thousands of nodes and
petabytes of data in a highly parallel, cost effective manner
 CPU + disks = “node”
 Nodes can be combined into clusters
 New nodes can be added as needed without changing
 Data formats
 How data is loaded
 How jobs are written

32
Cont…
 Hadoop has an ecosystem that has evolved from its four core
components: data management, access, processing, and storage.
 Hadoop is supplemented by an ecosystem of open source projects
such as:
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query-based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

33
Cont…
 The following figure depict Hadoop ecosystem

34
Life cycle of big data with Hadoop
 Ingesting data into the system
 First the data is ingested or transferred to Hadoop from
various sources such as relational databases, systems, or local
files.
 Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data
 Processing the data in storage
 The second stage is Processing. In this stage, the data is
stored and processed.
 The data is stored in the distributed file system, HDFS, and
the NoSQL distributed data, HBase. Spark and MapReduce
perform data processing

35
Cont…
 Computing and analyzing data
 The third stage is analyzing and processing data using open
source frameworks such as Pig, Hive, and Impala.
 Pig converts the data using a map and reduce and then
analyzes it.
 Hive is also based on the map and reduce programming and
is most suitable for structured data
 Visualizing the results
 The fourth stage is Access, which is performed by tools such
as Hue and Cloudera Search.
 In this stage, the analyzed data can be accessed by users.

36
Laboratory Tools
 Python, Jupyter notebook[Python
version>2.7,Anaconda](Recommended)
 IBM SPSS Statistics

Data Collection Manager System Administrator's Guide
No ratings yet
Data Collection Manager System Administrator's Guide
486 pages
FDS - Unit 1 Question Bank
No ratings yet
FDS - Unit 1 Question Bank
16 pages
Sua RT 3kva To 10kva PDF
50% (2)
Sua RT 3kva To 10kva PDF
35 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Sample Security Plan
No ratings yet
Sample Security Plan
9 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Chapter 2 Introduction to Data Science_for Extension
No ratings yet
Chapter 2 Introduction to Data Science_for Extension
51 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2 - Intro to Data Sciences[2]
No ratings yet
Chapter 2 - Intro to Data Sciences[2]
41 pages
Chapter Two2
No ratings yet
Chapter Two2
21 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
CH-2 Introduction To Data Science
No ratings yet
CH-2 Introduction To Data Science
26 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2 - Overview for Data Science
No ratings yet
Chapter 2 - Overview for Data Science
31 pages
BIG DATA ANALYTICS
No ratings yet
BIG DATA ANALYTICS
10 pages
TP 4 2docuatrimestre
No ratings yet
TP 4 2docuatrimestre
10 pages
AD3491 - FDSA - Unit I - Introduction - Part I
100% (2)
AD3491 - FDSA - Unit I - Introduction - Part I
23 pages
data science
No ratings yet
data science
23 pages
ds
No ratings yet
ds
38 pages
bda ans
No ratings yet
bda ans
18 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
BI_Unit_2
No ratings yet
BI_Unit_2
113 pages
Business Analytics
100% (5)
Business Analytics
46 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Data in Enterprise End Term Cheat Sheet
No ratings yet
Data in Enterprise End Term Cheat Sheet
13 pages
Unit 1
No ratings yet
Unit 1
19 pages
DWM Unit II
No ratings yet
DWM Unit II
76 pages
data science
No ratings yet
data science
23 pages
AD3491-Unit 1
No ratings yet
AD3491-Unit 1
32 pages
M.E.-ISE-2023-25-60 PIS E31-RSA-Best Practices in Data Mining
No ratings yet
M.E.-ISE-2023-25-60 PIS E31-RSA-Best Practices in Data Mining
3 pages
What is Big Data
No ratings yet
What is Big Data
4 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
Module 1 Introduction To DataScience and Analytics
No ratings yet
Module 1 Introduction To DataScience and Analytics
10 pages
Data mining 3
No ratings yet
Data mining 3
31 pages
Unit 1
No ratings yet
Unit 1
11 pages
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
From Everand
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
Marlowe Reyes
No ratings yet
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
HCS 12 5 SRND PDF
No ratings yet
HCS 12 5 SRND PDF
134 pages
Seal Types
No ratings yet
Seal Types
3 pages
Demo Script
No ratings yet
Demo Script
5 pages
ICEG 2005 Abstracts
No ratings yet
ICEG 2005 Abstracts
34 pages
Eot Crane Monthly Checksheet
No ratings yet
Eot Crane Monthly Checksheet
4 pages
Orthopaedic Joint Prostheses
No ratings yet
Orthopaedic Joint Prostheses
8 pages
Operational Qualification Template
No ratings yet
Operational Qualification Template
9 pages
PowerEdge R740 Server Specification
No ratings yet
PowerEdge R740 Server Specification
2 pages
OPNAVINST - 9420 - 2 Req - & - Test - Methods
No ratings yet
OPNAVINST - 9420 - 2 Req - & - Test - Methods
62 pages
Targo II: TM-Helmet Mounted Display System
No ratings yet
Targo II: TM-Helmet Mounted Display System
4 pages
How Car Cooling Systems Work
No ratings yet
How Car Cooling Systems Work
4 pages
Report
No ratings yet
Report
48 pages
Resilience-In-Transport-And-Logistics-02-20 McKinsey
No ratings yet
Resilience-In-Transport-And-Logistics-02-20 McKinsey
8 pages
GP3600 Olt PDF
No ratings yet
GP3600 Olt PDF
6 pages
CPEMS F06 - Digital Certificate Installation Guide
No ratings yet
CPEMS F06 - Digital Certificate Installation Guide
9 pages
SMC HF 24V 30a
0% (1)
SMC HF 24V 30a
2 pages
FW7.6.308 Release Notes
No ratings yet
FW7.6.308 Release Notes
9 pages
Carroll, James - Failure Rate Repair Time and Unscheduled O. and M. Cost Analysis of Offshore
No ratings yet
Carroll, James - Failure Rate Repair Time and Unscheduled O. and M. Cost Analysis of Offshore
25 pages
Real-Time Energy Management System For A Hybrid Ac MG
No ratings yet
Real-Time Energy Management System For A Hybrid Ac MG
13 pages
Jiofiber Ott Apps Key Info
No ratings yet
Jiofiber Ott Apps Key Info
2 pages
Micom P125/P126/P127 Directional Overcurrent and Directional Earth Fault Relays
No ratings yet
Micom P125/P126/P127 Directional Overcurrent and Directional Earth Fault Relays
28 pages
Memory Allocation in C: Segments Segments
No ratings yet
Memory Allocation in C: Segments Segments
24 pages
Error Report 6339780365499124142
No ratings yet
Error Report 6339780365499124142
92 pages
ĐỀ MOCK TEST 4
No ratings yet
ĐỀ MOCK TEST 4
14 pages
Shmushkis Resume Weeblys20
No ratings yet
Shmushkis Resume Weeblys20
1 page
ENG3202 Lab Report 4 - Switch
No ratings yet
ENG3202 Lab Report 4 - Switch
16 pages
Iball Clickscan (A3) User Manual
No ratings yet
Iball Clickscan (A3) User Manual
24 pages
Toyota Under Fire Summary
No ratings yet
Toyota Under Fire Summary
12 pages