ADET - Lesson 2
ADET - Lesson 2
Emerging Technologies
Clifford Togonon
Today's Topic
1. An Overview of Data Science.
2. History of Data Science
3. Data Mining
4. Data Science Hierarchy of Needs
5. Differences between a Data Engineer
and a Data Scientist?
6. Data Processing Cycle.
7. Data Science Value Chain.
8. Basic Concepts of Big Data.
9. Big Data Life Cycle with Hadoop
Data Science
01 02 03 04
Data science combines Data science is the study of Data science is a subset of Turning Data into
multiple fields including data. It involves developing AI, and it refers more to the Information
statistics, scientific methods, methods of recording, overlapping areas of Identifying Trends,
and data analysis to extract storing, and analyzing data to statistics, scientific methods, Patterns, and
value from data. effectively extract useful and data analysis—all of Correlations
information. The goal of data which are used to extract Analyzing Data to get
Those who practice data science is to gain insights meaning and insights from Insights.
science are called data and knowledge from any data. Contextualizing, Applying
scientists, and they combine type of data — both and Understanding them.
a range of skills to analyze structured and unstructured.
data collected from the web,
smartphones, customers,
sensors, and other sources.
History of Data Science
From Data Mining to Knowledge Discovery of Databases
He combined:
DATA COLLECTION
To prepare machine learning models we need to collect data for the required purpose. For a particular type of problem statement
we should have corresponding dataset. The collected data for a particular problem in a proper format is known as the dataset.
There are many ways to collect data such as: ONLINE SURVEY, OBSERVATION, INTERVIEW, CASE STUDY
METHODS, QUESTIONNAIRE, GOOGLE FORMS
It is not necessary that the collected data is in the desired format ,therefore we need data preprocessing
techniques
DATA PREPROCESSING
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model.
Data preprocessing techniques includes: Formatting of data, Cleaning of data, Sampling of data
Data Transformation
Big Data
The term “big data” refers to data that is so large,
100
fast or complex that it’s difficult or impossible to
process using traditional methods. The act of
accessing and storing large amounts of information
for analytics has been around a long time. But the 75
concept of big data gained momentum in the early
2000s when industry analyst Doug Laney articulated
the now-mainstream definition of big data as the 50
three V’s:
1. Volume
2. Velocity
3. Variety 25
0
Item 1 Item 2 Item 3 Item 4 Item 5
Volume Velocity Variety
Organizations collect data With the growth in the Data comes in all types of
from a variety of sources, Internet of Things, data formats – from structured,
including business streams in to businesses at numeric data in traditional
transactions, smart (IoT) an unprecedented speed and databases to unstructured
devices, industrial equipment, must be handled in a timely text documents, emails,
videos, social media and manner. RFID tags, sensors videos, audios, stock ticker
more. In the past, storing it and smart meters are driving data and financial
would have been a problem – the need to deal with these transactions.
but cheaper storage on torrents of data in near-real
platforms like data lakes and time.
Hadoop have eased the
burden.
Big Data with Hadoop
Tools to store and analyze data in Data Processing
Apache Hadoop
Apache Hadoop is an open-source software framework based on java capable of storing
a great amount of data in a cluster. It can process large sets of data in parallel across
clusters of computers. The concept is to scale up from a single server to several
thousands of machines, each with a capability to perform local computation and provide
storage. It eliminated the dependency on hardware for delivering high-availability. The
detection and handling of failures are possible through the library at the application
layer.
Apache Hadoop offers below modules:
Hadoop Common: This module consists of the utilities to support other modules.
Hadoop Distributed File System (HDFS): High-throughput access to the application data
is provided by the distributed file system of Hadoop.
Hadoop YARN: cluster resource management and job scheduling are achieved by this
framework.
Big Data with Hadoop
Tools to store and analyze data in Data Processing
Apache Hadoop
Hadoop MapReduce: It involves parallel processing of large sets of data or Big data.
Hadoop Distributed File System (HDFS) is the main storage system of Hadoop. The HDFS
splits the large data sets across several machines to be processed in parallel. There is
also replication of data in a cluster, performed by HDFS, thus, enabling high availability of
data.
Several projects running in relation to Hadoop at Apache, catering to data storage
include:
Cassandra™: This scalable multi-master database does not allow any single points of
failure.
Chukwa™: Large distributed systems require management which is achieved by a data
collection system called Chukwa.
HBase™: HBase provides the capability of structured data storage through a distributed
and scalable database for large tables.
Hive™: Hive is a data warehouse that provides the capability of data summarization and
ad hoc querying.
References:
https://www.linkedin.com/pulse/data-science-value-chain-m-maruf-
hossain-phd/
https://www.sas.com/en_ph/insights/big-data/what-is-big-
data.html#:~:text=Big%20data%20is%20a%20term,day%2Dto%2Dday%
20basis.&text=It's%20what%20organizations%20do%20with,decisions
%20and%20strategic%20business%20moves.
http://loginwork.over-blog.com/2018/09/top-big-data-tools-to-store-
data-in-data-processing-cycle.html
https://www.youtube.com/watch?v=xC-c7E5PK0Y
https://towardsdatascience.com/data-engineer-vs-data-scientist-
bc8dab5ac124
Thank you!