CH-2 Data Science
CH-2 Data Science
Chapter Two
Data Science
1
Overview of Data Science
• Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms, and systems to extract
knowledge and insights from structured, semi-structured and
unstructured data.
3
Data and Information
• Information is the processed data on which decisions and
actions are based.
5
Data Processing Cycle
• Input − in this step, the input data is prepared in some convenient form
for processing. The form will depend on the processing machine.
9
Data types and their representation
Cont.
• Unstructured data: is data that either does not have a predefined
data model or is not organized in a pre-defined manner.
10
Data types and their representation
Metadata
• From a technical point of view, this is not a separate data structure,
but it is one of the most important elements for Big Data analysis and
big data solutions.
12
Big Data
• Big data is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
14
2019-10-20 15
2019-10-20 16
2019-10-20 17
2019-10-20 18
Big Data
2019-10-20 19
Data Value Chain
• The Data Value Chain is introduced to describe the information flow
within a big data system as a series of steps needed to generate value
and useful insights from data.
• The Big Data Value Chain identifies the following key high-level
activities:
• Data Acquisition
• Data Analysis
• Data Curation
• Data Storage
• Data Usage 20
Data Value Chain
Data Acquisition
• It is the process of gathering, filtering, and cleaning data before
it is put in a data warehouse or any other storage solution on
which data analysis can be carried out.
• RDBMS have been the main, and almost unique, a solution to the
storage paradigm for nearly 40 years. However, the ACID (Atomicity,
Consistency, Isolation, and Durability) properties that guarantee
database transactions lack flexibility with regard to schema changes
and the performance and fault tolerance when data volumes and
complexity grow, making them unsuitable for big data scenarios.
25
Data Value Chain
Big Data Value Chain
26
Clustered Computing
• High Availability: Clusters can provide varying levels of fault tolerance and
availability guarantees to prevent hardware or software failures from
affecting access to data and processing. This becomes increasingly important
as we continue to emphasize the importance of real-time analytics.
machine.
Clustered Computing
• Using clusters requires a solution for managing cluster membership,
coordinating resource sharing, and scheduling actual work on
individual nodes.
29
Hadoop and its Ecosystem
30
Hadoop and its Ecosystem
The four key characteristics of Hadoop are:
33
Hadoop and its Ecosystem
34
Hadoop and its Ecosystem
• MapReduce:
• MapReduce:
36
Hadoop and its Ecosystem
• YARN: Yet Another Resource Negotiator
• YARN is a one of the most important component of Hadoop
Ecosystem that provides the resource management.
• YARN is called as the operating system of Hadoop as it is
responsible for managing and monitoring workloads.
37
Hadoop and its Ecosystem
38
Hadoop and its Ecosystem
• HIVE:
• Apache Hive, is an open source data warehouse system for querying
and analyzing large datasets stored in Hadoop files
• Hive use language called HiveQL (HQL), which is similar to SQL.
HiveQL automatically translates SQL-like queries into MapReduce
jobs which will execute on Hadoop..
39
Hadoop and its Ecosystem
• PIG:
40
Hadoop and its Ecosystem
41
Hadoop and its Ecosystem
42
Hadoop and its Ecosystem
43
Big Data Life Cycle with Hadoop
Ingesting data into the system
• The first stage of Big Data processing is Ingest. The data is ingested or
transferred to Hadoop from various sources such as relational databases,
systems, or local files. Sqoop transfers data from RDBMS to HDFS,
whereas Flume transfers event data.
• The second stage is Processing. In this stage, the data is stored and
processed. The data is stored in the distributed file system, HDFS, and
the NoSQL distributed data, HBase.
• Pig converts the data using a map and reduce and then analyzes it.
• Hive is also based on the map and reduce programming and is most
suitable for structured data.