Chapter 2 - Data Science
Chapter 2 - Data Science
Science
What is Data science
• Data science is an interdisciplinary field that uses scientific methods,
processes, algorithms, and systems to extract knowledge and
insights from structured and unstructured data.
• Data Science is about data gathering, analysis and decision-making.
• Data scientists
– Must master the full spectrum of the data science life cycle
• Data type is Simply an attribute of data which tells the compiler or interpreter
9
Data types and its representation – Data Analytics Perspective
– Structured data,
– Semi-structured data.
10
Structured Data …
11
Unstructured Data …
The ability to store and process unstructured data has greatly grown in
recent years, with many new technologies and tools coming to the market
For example:
– MongoDB is optimized to store documents.
– Apache Graph - is optimized for storing relationships between nodes.
• The ability to analyze unstructured data is especially relevant in the
context of Big Data, since a large part of data in organizations is
unstructured.
• Think about pictures, videos or PDF documents.
• The ability to extract value from unstructured data is one of main drivers
behind the quick growth of Big Data.
13
Semi-structured Data …
• A form of structured data that does not conform with the formal
21
Data Acquisition ...
It is the process of gathering, filtering, and cleaning data before it is put in a
data warehouse or any other storage solution on which data analysis can be
carried out.
Data acquisition is one of the major big data challenges in terms of
infrastructure requirements.
The infrastructure required to support the acquisition of big data must deliver
• Low, predictable latency in both capturing data and in executing queries;
― Be able to handle very high transaction volumes often in a distributed
environment; and
― Support flexible and dynamic data structures.
22
Data Analysis …
Related areas include data mining, business intelligence, and machine learning.
23
Data Curation…
The active management of data over its life cycle to ensure it meets the
preservation.
It can performed by expert curators that are responsible for improving the
Data curators (also known as scientific curators, or data annotators) hold the
It is the persistence and management of data in a scalable way that satisfies the
needs of applications that require fast access to the data.
Relational Database Management Systems (RDBMS) have been the main, and
almost unique, solution to the storage paradigm for nearly 40 years.
However, the ACID (Atomicity, Consistency, Isolation, and Durability) properties that
guarantee database transactions lack flexibility with regard to
Schema changes and the performance and fault tolerance when data
volumes and complexity grow, making them unsuitable for big data
scenarios.
NoSQL technologies have been designed with the scalability goal in mind and
present a wide range of solutions based on alternative data models.
25
26
Data Usage…
Access to data,
The tools needed to integrate the data analysis within the business
activity.
reduction of costs, increased added value, or any other parameter that can be
27
Basic concepts of big data …
This means that the common scale of big datasets is constantly shifting and
28
What is big data …
A large and diverse set of information that grows at ever increasing rate.
A collection of dataset which is so large and Complex that it becomes difficult
to process using on hand Database management tools or traditional data
processing application.
It Encompass
The Volume of information
The Velocity of Speed at which it was created or collected.
The Scope of the data points being covered
29
Characteristics of big data:
Volume:-
― Volume refers to the huge amount of data that’s generated and stored.
― Traditional data is measured in familiar sizes like megabytes, gigabytes and
terabytes where as Big data is stored in petabytes and zettabytes.
Variety:-
― Variety refers to the different types of data being collected from various sources,
including text, video, images and audio.
― Most data is unstructured, meaning it’s unorganized and difficult for
conventional data tools to analyse.
― Everything from emails and videos to scientific and meteorological data can
constitute a big data stream, each with their own unique attributes.
30
Characteristics of big data:
Velocity:-
― Big data is generated, processed and analysed at high speeds.
― Companies and organizations must have the capabilities to harness this data and
generate insights from it in real-time, otherwise it’s not very useful. Real-time
processing allows decision makers to act quickly.
Veracity:- the degree of accuracy in data sets and how trustworthy they are.
― Raw data collected from various sources can cause data quality issues that might
be difficult to pinpoint.
― If they aren't fixed through data cleansing processes, bad data leads to analysis
errors that can undermine the value of business analytics initiatives.
― Data management and analytics teams also need to ensure that they have
enough accurate data available to produce valid results.
31
Contd..
Contd..
Characteristics of Big Data – Variety
Big data problems are often unique because of the wide range of both the
sources being processed and their relative quality.
Data can be ingested from
Internal systems like application and server logs,
From social media feeds and other external APIs,
From physical device sensors, and from other providers.
The formats and types of media can vary significantly as well.
Rich media like images, video files, and audio recordings are ingested alongside
text files, structured logs, etc.
Traditional data processing systems expects data to entered to the pipeline
that are labeled, formatted, and organized
Where as big data systems usually accept and store data closer to its raw
state. 34
Clustered Computing …
• Because of the quantities of big data, individual computers are often inadequate for
handling the data at most stages.
• Therefore, to address the high storage and computational needs of big data,
computer clusters are a better fit.
• Big data clustering software combines the resources of many smaller machines, to
provide a number of benefits:
1. Resource Pooling:
• Combining the available storage space to hold data, but CPU and memory
pooling is also extremely important.
• Processing large datasets requires large amounts these three resources .
35
Clustered Computing …
2. High Availability:
36
Clustered Computing …
1. Cluster membership,
software:
– Apache Mesos.
37
Hadoop and its Ecosystem …
hardware failure.
Flexible: It is flexible and you can store as much structured and unstructured
Hadoop has an ecosystem that has evolved from its four core components:
1) Data management,
2) Data access
3) Data processing, and
4) Data storage.
39
Hadoop and its Ecosystem …
40
HDFS
In the traditional approach, all data was stored in a single central database.
― With the rise of big data, a single database was not enough to handle the task.
― The solution was to use a distributed approach to store the massive volume of
information.
― Data was divided up and allocated to many individual databases.
― HDFS is a specially designed file system for storing huge datasets in commodity hardware,
storing information in different formats on various machines.
41
• YARN (Yet Another Resource Negotiator)
43
Sqoop…
Sqoop
― Is used to transfer data between Hadoop and external datastores such
as relational databases and enterprise data warehouses.
― It imports data from external data stores into HDFS, Hive, and HBase.
which processes large volumes of data in a parallelly distributed
manner.
44
Flume…
Flume ..
― Flume is another data collection and ingestion tool
― A distributed service for collecting, aggregating, and moving large
amounts of log data.
― It ingests online streaming data from social media, logs files, web server
into HDFS.
45
Hive
― Hive uses SQL (Structured Query Language) to facilitate the reading,
writing, and management of large datasets residing in distributed
storage.
― The hive was developed with a vision of incorporating the concepts
of tables and columns with SQL since users were comfortable with
writing queries in SQL.
― Apache Hive has two major components:
― Hive Command Line
― JDBC/ ODBC driver
46
Spark
― Spark is a huge framework in and of itself, an open-source distributed
computing engine for processing and analysing vast volumes of real-
time data.
― It runs 100 times faster than MapReduce.
― Spark provides an in-memory computation of data, used to process and
analyse real-time streaming data such as stock market and banking
data, among other things.
47
Mahout
― Mahout is used to create scalable and distributed machine
learning algorithms such as clustering, linear regression,
classification, and so on.
― It has a library that contains built-in algorithms for
collaborative filtering, classification, and clustering.
48
Hadoop and its Ecosystem …
49
Big Data Life Cycle …
50
Step 1: Ingesting Data into the System
• The data is stored in the distributed file system, HDFS, and the
52
Step 3: Computing and analysing data
• Pig converts the data using a map and reduce and then
analyses it.
• Apache Storm, Apache Flink, and Apache Spark provide different ways of achieving
real-time or near real-time processing.
• In general, real-time processing is best suited for analyzing smaller chunks of data
that are changing or being added to the system rapidly.
Big Data Life Cycle with Hadoop
Step 4: Visualizing the results
56
End of chapter
Two
57