0% found this document useful (0 votes)
18 views

Chapter 2 - Introduction to Data Science

Uploaded by

Yekeber Aklil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Chapter 2 - Introduction to Data Science

Uploaded by

Yekeber Aklil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Introduction to Emerging Technologies

(EmTe1012)

Chapter 2
Data Science
1
Unit objectives
 Differentiate and data and information
 Describe the essence of data science and the role of data
scientist
 Describe data processing life cycle
 Understand different data types from diverse perspectives
 Describe data value chain in emerging era of big data.
 Understand the basics of Big Data.
 Describe the purpose of the Hadoop ecosystem
components.

2
An Overview of Data Science
 Data science is a multi-disciplinary field which involves
extracting insights from vast amounts of data using
scientific methods, algorithms, and processes.
 It helps to extract knowledge and insights from structured,
semi structured and unstructured data.
 More importantly, it enables you to translate a business
problem into a research project and then translate it back
into a practical solution.
 It offers a range of roles and requires a range of skills .
 As an academic discipline and profession, data science
continues to evolve as one of the most promising and in-
demand career paths for skilled professionals.

3
Con’t…

 Data science is much more than simply analyzing data


E.g Supermarket data- On Thursday nights people who buy
milk also tend to buy beer .
 catalog order of items,
 Know the profile of your best customer,
 Predict more wanted item for the next 2 month,
 The probability of customer who buy computer
also buy pen drive.

4
Methodology of data science

Data science methodology

5
Application of Data science
 Data science is much more than simply analyzing data;
 It plays wide range of roles as follows;
 Data is the oil for today's world. With the right tools,
technologies, algorithms, we can use data and convert it into a
distinctive business advantage
 Can help you to detect fraud using advanced machine learning
algorithms
 It could also helps you to prevent any significant monetary
losses
 Allows to build intelligence ability in machines
 You can perform sentiment analysis to gauge customer brand
loyalty
 It enables you to take better and faster decisions
 Helps you to recommend the right product to the right
customer to enhance your business

6
Data and Information
 Data can be defined as a representation of facts,
concepts, or instructions in a formalized manner with
help of characters such as alphabets (A-Z, a-z), digits
(0-9) or special characters (+, -, /, *, <,>, =, etc.)
 Those facts are suitable for communication, interpretation, or processing, by
human or electronic
machines.

 Information is interpreted data; created from


organized, structured, and processed data in a
particular context on which decisions and actions are
based.

7
Con’t…

Data Information
 raw facts  data with context
 no context  processed data
 just numbers and  value-added to data
text  summarized
 organized
 analyzed

8
Con’t…
 Data: 51012
 Information:
 5/10/11 The date of your final exam.
 51,012 Birr The average starting salary of an accounting major.
 51011 Zip code of Jimma.

9
Data Processing Cycle
 Data processing is the re-structuring or re-ordering of data by people or
machines to increase their usefulness and add values for a particular
purpose.
 The following are basic steps of data processing:

 Input - in this step, the input data is prepared in some convenient form for
processing.
 example, when electronic computers are used, the input data can be recorded on any one of the
several types of storage medium, such as hard disk, CD, flash disk and so on.

 Processing - in this step, the input data is changed to produce data in a more
useful form.
 example, a summary of sales for the month can be calculated from the sales orders.
 Output − at this stage, the result of the proceeding processing step is collected.
 Example, output data may be payroll for employees.

10
Data types and their representation

 In computer science and computer programming, a data type is


simply an attribute of data that tells the compiler or interpreter
how the programmer intends to use the data.
 A data type makes the values that expression, such as a variable or
a function, might take.
 This data type defines the operations that can be done on the data,
the meaning of the data, and the way values of that type can be
stored.

11
Data types from computer programming perspective

 Almost all programming languages explicitly include


the notion of data type with different terminology.
Common data types include the following;
 Integers(int)- is used to store whole numbers,
mathematically known as integers
 Booleans(bool)- is used to represent restricted to one of
two values: true or false
 Characters(char)- is used to store a single character
 Floating- point numbers(float)- is used to store real
numbers
 Alphanumeric strings+(string)- used to store a combination
of characters and numbers

12
Data types from Data Analytics perspective
 From a data analytics point of view, it is important to understand
that there are three common types of data types or structures:
 Structured
 Semi-structured, and
 Unstructured data types

13
Con’t…
 Structured data:-
 Structured data is data that adheres to a pre-defined data model
and is therefore straightforward to analyze. Structured data
conforms to a tabular format with a relationship between the
different rows and columns
 E.g. Excel files, SQL databases
 Semi structured data :-
 Semi-structured data is a form of structured data that does not
conform with the formal structure. However, such files contains
tags or other markers to separate semantic elements and enforce
hierarchies of records and fields within the data.
 E.g. XML, JSON

14
Con’t…
 Unstructured data:-
 Unstructured data is information that either does not have a
predefined data model or is not organized in a pre-defined manner.
 Usually it is typically text-heavy but may contain data such as
dates, numbers, and facts as well
 E.g. audio, video files or NoSQL databases

15
Metadata
 Technically metadata is not a separate data structure, but it is
one of the most important elements for Big Data analysis and big
data solutions.
 It provides additional information about a specific set of data;
conveniently it can be said data about data
 In a set of photographs, for example, metadata could describe
when and where the photos were taken i.e.Date and location
 The metadata then provides fields for dates and locations which,
by themselves, can be considered structured data. Because of
this reason, metadata is frequently used by Big Data solutions
for initial analysis

16
Data value Chain
 Data Value Chain describes the information flow within a big data
system as a series of steps needed to generate value and useful
insights from data.
 The Big Data Value Chain identifies the following key high-level
activities:

17
Data Acquisition
 Data acquisition is the process of gathering, filtering, and cleaning
data before it is put in a data warehouse or any other storage
solution on which data analysis can be carried out.
 The infrastructure required for big data acquisition must deliver
low, predictable latency in both capturing data and in executing
queries.
 Moreover, the infrastructure handle very high transaction volumes,
often in a distributed environment; and support flexible and dynamic
data structures.
 Data acquisition is major challenges in big data because of it’s high-
end infrastructure requirement.

18
Data Analysis
 Data analysis involves exploring, transforming, and modeling data
with the goal of highlighting relevant data, synthesizing and
extracting useful hidden information with high potential from a
business point of view.
 It also deals with making the raw data acquired amenable to use in
decision making process

19
Data Curation
 Data curation is an active management of data over its life cycle
to ensure it meets the necessary data quality requirements for
its effective usage.
 It’s process can be categorized into different activities such
as content creation, selection, classification, transformation,
validation, and preservation.
 Data curation is performed by expert curators or annotators
that are responsible for improving the accessibility and quality
of data.

20
Data Storage
 Data storage is the persistence and management of data in a
scalable way that satisfies the needs of applications that require
fast access to the data.
 Relational database system has been used as a storage paradigm
for over 40 years.
 Following the volume and complexity of data recently highly
scalable NoSQL technologies is applied for big data storage model.

21
Data Usage
 Data usage covers the data-driven business activities that need
access to data, its analysis, and the tools needed to integrate
the data analysis within the business activity.
 It enhances business decision making competitiveness through
the reduction of costs, increased added value, or any other
parameter that can be measured against existing performance
criteria.

22
2.5. Basic concepts of big data
Big Data Every Where!
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Social Network

23
Con’t…
 Big data is the term for a collection of data sets so large and complex that
it becomes difficult to process using on-hand database management tools or
traditional data processing applications.
 The common scale of big datasets is constantly shifting and may vary
significantly from organization to organization.
 Big data is characterized by 4V and more:
 Volume: Machine generated data is produced in larger quantities than non
traditional data.
 large amounts of data Zeta bytes/Massive datasets

 Velocity: the speed of data processing.


 Data is live streaming or in motion

 Variety: large variety of input data which in turn generates large amount of
data as output.
 data comes in many different forms from diverse sources

 Veracity: can we trust the data? How accurate is it? etc.

24
Con’t…

25
Cont..
 The following figure depicts the 5 major use cases of big data

26
Cont…
 An example of the big data platform in practice

27
How much data?
 Google processes more than 20 PB a day (2018)
 Wayback Machine has 3 PB + 100 TB/month (3/2009)
 Facebook process 500+TB/day (2019)
 eBay has 6.5 PB of user data + 50 TB/day (5/2009)
 CERN’s Large Hydron Collider (LHC) generates 15 PB a year
 NASA Climate Simulation 32 petabytes

28
Clustered Computing and Hadoop Ecosystem
• Clustered Computing
 Because of the qualities of big data, individual computers are
often inadequate for handling the data at most stages. To better
address the high storage and computational needs of big data,
computer clusters are a better fit.
 Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits such as:
 Resource Pooling: combining storage space and cpu to process
large dataset
 High Availability: Clusters can provide varying levels of fault
tolerance and availability
 Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group.

29
Cont…
 Employing clustered resources may require managing cluster
membership, coordinating resource sharing, and scheduling actual
work on individual nodes or computers.
 The cluster membership and resource allocation task is done by
apache open source framework software's like Hadoop's
YARN(which stands for Yet Another Resource Negotiator.)
 The assembled cluster machines act seamlessly and help other
software interfaces to process the data.

30
Hadoop and its Ecosystem
 What is Hadoop?
 It’s apache open source software framework for reliable,
scalable, distributed computing of massive amount of data
 Hides underlying system details and complexities from user
 Developed in Java
 Flexible, enterprise-class support for processing large
volumes of data
 Inspired by Google technologies (MapReduce, GFS,
BigTable, …)
 Initiated at Yahoo : to address scalability problems of an
open source web technology(Nutch)
 Supports wide variety of data

31
Cont…
 Hadoop enables applications to work with thousands of nodes and
petabytes of data in a highly parallel, cost effective manner
 CPU + disks = “node”
 Nodes can be combined into clusters
 New nodes can be added as needed without changing
 Data formats
 How data is loaded
 How jobs are written

32
Cont…
 Hadoop has an ecosystem that has evolved from its four core
components: data management, access, processing, and storage.
 Hadoop is supplemented by an ecosystem of open source projects
such as:
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query-based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

33
Cont…
 The following figure depict Hadoop ecosystem

34
Life cycle of big data with Hadoop
 Ingesting data into the system
 First the data is ingested or transferred to Hadoop from
various sources such as relational databases, systems, or local
files.
 Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data
 Processing the data in storage
 The second stage is Processing. In this stage, the data is
stored and processed.
 The data is stored in the distributed file system, HDFS, and
the NoSQL distributed data, HBase. Spark and MapReduce
perform data processing

35
Cont…
 Computing and analyzing data
 The third stage is analyzing and processing data using open
source frameworks such as Pig, Hive, and Impala.
 Pig converts the data using a map and reduce and then
analyzes it.
 Hive is also based on the map and reduce programming and
is most suitable for structured data
 Visualizing the results
 The fourth stage is Access, which is performed by tools such
as Hue and Cloudera Search.
 In this stage, the analyzed data can be accessed by users.

36
Laboratory Tools
 Python, Jupyter notebook[Python
version>2.7,Anaconda](Recommended)
 IBM SPSS Statistics

37

You might also like