ADET - Lesson 2

Application Development and
Emerging Technologies
Clifford Togonon
Today's Topic
1. An Overview of Data Science.
2. History of Data Science
3. Data Mining
4. Data Science Hierarchy of Needs
5. Differences between a Data Engineer
and a Data Scientist?
6. Data Processing Cycle.
7. Data Science Value Chain.
8. Basic Concepts of Big Data.
9. Big Data Life Cycle with Hadoop
Data Science
01 02 03 04
Data science combines Data science is the study of Data science is a subset of Turning Data into
multiple fields including data. It involves developing AI, and it refers more to the Information
statistics, scientific methods, methods of recording, overlapping areas of Identifying Trends,
and data analysis to extract storing, and analyzing data to statistics, scientific methods, Patterns, and
value from data. effectively extract useful and data analysis—all of Correlations
information. The goal of data which are used to extract Analyzing Data to get
Those who practice data science is to gain insights meaning and insights from Insights.
science are called data and knowledge from any data. Contextualizing, Applying
scientists, and they combine type of data — both and Understanding them.
a range of skills to analyze structured and unstructured.
data collected from the web,
smartphones, customers,
sensors, and other sources.
History of Data Science
From Data Mining to Knowledge Discovery of Databases
In 2001, William S. Cleveland wanted

to bring data mining to another level.
He combined:
COMPUTER SCIENCE + DATA MINING = DATA SCIENCE

WHAT IS DATA MINING?
Data mining is the process of analysing data from
different perspectives and summarising it into useful
information, including discovery of previously unknown
interesting patterns, unusual records or dependencies.
The overall goal of the data mining process is to extract

information from a data set and transform it into an
understandable structure for further use. Aside from the
raw analysis step, it involves database and data
management aspects, data pre-processing, model and
inference considerations, interestingness metrics,
complexity considerations, post-processing of
discovered structures, visualization, and online updating.
Data mining is the analysis step of the "knowledge

discovery in databases" process, or KDD.
DATA ANALYSIS
Data Analysis stands for human activities aimed at gaining some insight on a dataset. An
analyst can use some Data Analytics tools to obtain desired results, but in principle, Data
Analysis can be
performed without special data processing. For example, a Forex trader can rely on
his/her experience to open or close a trading position.
There are three types of data analysis:

Predictive (forecasting) - Predictive analytics turns data into valuable, actionable
information. Predictive analytics uses data to determine the probable future outcome
of an event or a likelihood of a situation occurring.
Descriptive (business intelligence and data mining) - Descriptive analytics looks at

data and analyzes past events for insight as to how to approach the future. Descriptive
analytics looks at past performance and understands that performance by mining
historical data to look for the reasons behind past success or failure. Almost all
management reporting such as sales, marketing, operations, and finance, uses this type
of post-mortem analysis.
Prescriptive (optimization and simulation) - Prescriptive analytics automatically

synthesizes big data, mathematical sciences, business rules, and machine learning to
make predictions and then suggests decision options to take advantage of the
predictions.
How Data Mining Can Help a Business
Improve Competitiveness
Sales forecasting: analysing when customers

bought to predict when they will buy again
Database marketing: examining customer

purchasing patterns and looking at the
demographics and psychographics of
customers to build predictive profiles
Market segmentation: a classic use of data

mining, using data to break down a market
into meaningful segments like age, income,
occupation or gender
E-commerce basket analysis: using mined

data to predict future customer behavior
by past performance, including purchases
and preferences
DATA PREPROCESSING
Data preprocessing is part of data preparation method and a data mining technique in which it
used to transform the raw data in a useful and efficient format.
DATA COLLECTION
To prepare machine learning models we need to collect data for the required purpose. For a particular type of problem statement
we should have corresponding dataset. The collected data for a particular problem in a proper format is known as the dataset.
There are many ways to collect data such as: ONLINE SURVEY, OBSERVATION, INTERVIEW, CASE STUDY
METHODS, QUESTIONNAIRE, GOOGLE FORMS
It is not necessary that the collected data is in the desired format ,therefore we need data preprocessing
techniques
DATA PREPROCESSING
Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model.
Data preprocessing techniques includes: Formatting of data, Cleaning of data, Sampling of data
Data Transformation
Data aggregation is a type of data and information

mining process where data is searched, gathered and
presented in a report-based, summarized format to
achieve specific business objectives or processes and/or
conduct human analysis. Data aggregation may be
performed manually or through specialized software.
DATA CLEANING TASKS

a. Data acquisition and metadata
b. Fill in missing values
c. Unified date format
d. Converting nominal to numeric
e. Identify outliers and smooth out noisy data
f. Correct inconsistent data
Simple ways of cleaning the data
• TRIM: Remove extra spaces
• PROPER: Makes first letter in each word uppercase
• CLEAN: Removes all non-printable characters from text
• VALUE: Converts text to number
• TEXT: Converts number or text to new format
A real-world data generally contains noises, missing values, and maybe i n an

unusable format which cannot be directly used for machine learning model s.
Data preprocessing is required tasks for cleaning the data and maki ng i t
suitable for a machine learning model which also increases the accuracy and
efficiency of a machine learning model.
After cleaning and proper formatting of data we need to scaling of data

.Scaling generally bound the all features of a dataset in a fixed range and thi s
range is same for all features,this increase the accuracy of our machi ne
learning model with a great margin.
Data Science Hierarchy of Needs
Differences between a Data Engineer and a Data Scientist
What is Data
Processing
Cycle?
125
Data Processing Cycle

is the process of changing or converting information 100
into meaningful information. Information is
processed, organized or classified data which is
useful for the receiver. Information is the processed
75
data which may be used “as it is” or may be put to
use along with more data or information. The
receiver of information takes actions and decisions
based on the information received. Collected data 50
must be processed to get meaning out of it, and this
meaning is obtained in the form of information.
Information processing is read as a part of other
25
topics such as information processing theory,
information science, information technology, data
science and statistics etc.
0
Item 1 Item 2 Item 3 Item 4 Item 5
Data Science Value Chain
125
Big Data
The term “big data” refers to data that is so large,
100
fast or complex that it’s difficult or impossible to
process using traditional methods. The act of
accessing and storing large amounts of information
for analytics has been around a long time. But the 75
concept of big data gained momentum in the early
2000s when industry analyst Doug Laney articulated
the now-mainstream definition of big data as the 50
three V’s:
1. Volume
2. Velocity
3. Variety 25
0
Item 1 Item 2 Item 3 Item 4 Item 5
Volume Velocity Variety
Organizations collect data With the growth in the Data comes in all types of
from a variety of sources, Internet of Things, data formats – from structured,
including business streams in to businesses at numeric data in traditional
transactions, smart (IoT) an unprecedented speed and databases to unstructured
devices, industrial equipment, must be handled in a timely text documents, emails,
videos, social media and manner. RFID tags, sensors videos, audios, stock ticker
more. In the past, storing it and smart meters are driving data and financial
would have been a problem – the need to deal with these transactions.
but cheaper storage on torrents of data in near-real
platforms like data lakes and time.
Hadoop have eased the
burden.
Big Data with Hadoop
Tools to store and analyze data in Data Processing
Apache Hadoop
Apache Hadoop is an open-source software framework based on java capable of storing
a great amount of data in a cluster. It can process large sets of data in parallel across
clusters of computers. The concept is to scale up from a single server to several
thousands of machines, each with a capability to perform local computation and provide
storage. It eliminated the dependency on hardware for delivering high-availability. The
detection and handling of failures are possible through the library at the application
layer.
Apache Hadoop offers below modules:
Hadoop Common: This module consists of the utilities to support other modules.
Hadoop Distributed File System (HDFS): High-throughput access to the application data
is provided by the distributed file system of Hadoop.
Hadoop YARN: cluster resource management and job scheduling are achieved by this
framework.
Big Data with Hadoop
Tools to store and analyze data in Data Processing
Apache Hadoop
Hadoop MapReduce: It involves parallel processing of large sets of data or Big data.
Hadoop Distributed File System (HDFS) is the main storage system of Hadoop. The HDFS
splits the large data sets across several machines to be processed in parallel. There is
also replication of data in a cluster, performed by HDFS, thus, enabling high availability of
data.
Several projects running in relation to Hadoop at Apache, catering to data storage
include:
Cassandra™: This scalable multi-master database does not allow any single points of
failure.
Chukwa™: Large distributed systems require management which is achieved by a data
collection system called Chukwa.
HBase™: HBase provides the capability of structured data storage through a distributed
and scalable database for large tables.
Hive™: Hive is a data warehouse that provides the capability of data summarization and
ad hoc querying.
References:
https://www.linkedin.com/pulse/data-science-value-chain-m-maruf-
hossain-phd/
https://www.sas.com/en_ph/insights/big-data/what-is-big-
data.html#:~:text=Big%20data%20is%20a%20term,day%2Dto%2Dday%
20basis.&text=It's%20what%20organizations%20do%20with,decisions
%20and%20strategic%20business%20moves.
http://loginwork.over-blog.com/2018/09/top-big-data-tools-to-store-
data-in-data-processing-cycle.html
https://www.youtube.com/watch?v=xC-c7E5PK0Y
https://towardsdatascience.com/data-engineer-vs-data-scientist-
bc8dab5ac124
Thank you!

ADET - Lesson 2

Uploaded by

ADET - Lesson 2

Uploaded by

Application Development and

In 2001, William S. Cleveland wanted

COMPUTER SCIENCE + DATA MINING = DATA SCIENCE

The overall goal of the data mining process is to extract

Data mining is the analysis step of the "knowledge

There are three types of data analysis:

Descriptive (business intelligence and data mining) - Descriptive analytics looks at

Prescriptive (optimization and simulation) - Prescriptive analytics automatically

Sales forecasting: analysing when customers

Database marketing: examining customer

Market segmentation: a classic use of data

E-commerce basket analysis: using mined

Data aggregation is a type of data and information

DATA CLEANING TASKS

A real-world data generally contains noises, missing values, and maybe i n an

After cleaning and proper formatting of data we need to scaling of data

Data Processing Cycle

You might also like