L2 - Machine Learning Process

The document provides an overview of the machine learning process, including gathering data, data preparation such as exploration and preprocessing, data wrangling to clean the data, data analysis to select techniques and build models, training the model, testing the model, and deploying the trained model. Key steps include collecting and integrating data from various sources, understanding the data characteristics, cleaning missing and invalid values, selecting machine learning algorithms like classification and regression, and evaluating the model's performance on test data.

Uploaded by

Kinya Kageni

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

141 views

L2 - Machine Learning Process

Uploaded by

Kinya Kageni

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Machine Learning Process

Introduction to Python
 Python was developed by Guido van Rossum at Stichting Mathematisch
Centrum in the Netherlands.
 It was written as the successor of programming language named ‘ABC’.
 It’s first version was released in 1991.
 The name Python was picked by Guido van Rossum from a TV show
named Monty Python’s Flying Circus.
 It is an open source programming language which means that we can
freely download it and use it to develop programs. It can be downloaded
from www.python.org.
 Python programming language is having the features of Java and C both.
It is having the elegant ‘C’ code and on the other hand, it is having
classes and objects like Java for object-oriented programming.
 It is an interpreted language, which means the source code of Python
program would be first converted into bytecode and then executed by
Python virtual machine.
Why use python in Machine Learning?
1. Extensive set of packages
Python has an extensive and powerful set of packages which are ready to be used
in various domains. It also has packages like numpy, scipy, pandas, scikit-
learn etc. which are required for machine learning and data science.
2. Easy prototyping
Another important feature of Python that makes it the choice of language for data
science is the easy and fast prototyping. This feature is useful for developing new
algorithm.
3. Collaboration feature
The field of data science basically needs good collaboration and Python provides
many useful tools that make this extremely.
4. One language for many domains
A typical data science project includes various domains like data extraction, data
manipulation, data analysis, feature extraction, modeling, evaluation, deployment
and updating the solution. As Python is a multi-purpose language, it allows the
data scientist to address all these domains from a common platform.
Components of Python ML Ecosystem

These are some of the core Data Science libraries that form the
components of Python Machine learning ecosystem.
1. Jupyter Notebook

Jupyter notebooks basically provides an interactive computational

environment for developing Python based Data Science
applications.
They are formerly known as ipython notebooks.
The following are some of the features of Jupyter notebooks that
makes it one of the best components of Python ML ecosystem −
1. Jupyter notebooks can illustrate the analysis process step by step
by arranging the stuff like code, images, text, output etc. in a step
by step manner.
2. It helps a data scientist to document the thought process while
developing the analysis process.
3. One can also capture the result as the part of the notebook.
4. With the help of jupyter notebooks, we can share our work with a
peer also.
NumPy

It is another useful component that makes Python as one

of the favorite languages for Data Science.
It basically stands for Numerical Python and consists of
multidimensional array objects.
By using NumPy, we can perform the following important
operations −
1. Mathematical and logical operations on arrays.
2. Fourier transformation
3. Operations associated with linear algebra.
4. We can also see NumPy as the replacement of MatLab
because NumPy is mostly used along with Scipy
(Scientific Python) and Mat-plotlib (plotting library).
Pandas
It is another useful Python library that makes Python one
of the favorite languages for Data Science.
Pandas is basically used for data manipulation, wrangling
and analysis.
It was developed by Wes McKinney in 2008.
With the help of Pandas, in data processing we can
accomplish the following five steps −
1. Load
2. Prepare
3. Manipulate
4. Model
5. Analyze
Scikit-learn

 Another useful and most important python library for Data Science and
machine learning in Python is Scikit-learn.
 The following are some features of Scikit-learn that makes it so useful −

1. It is built on NumPy, SciPy, and Matplotlib.

2. It is an open source and can be reused under BSD license.

3. It is accessible to everybody and can be reused in various contexts.

4. Wide range of machine learning algorithms covering major areas of

ML like classification, clustering, regression, dimensionality reduction,
model selection etc. can be implemented with the help of it.
Machine Learning Cycle
Gathering Data:
The most important thing in machine learning is to first understand the problem and to
know the purpose of the problem. Understanding the problem results in good results.
In the complete life cycle process, to solve a problem, we create a machine learning
system called "model", and this model is created by providing "training". But to train a
model, we need data.
Therefore, Data Gathering is the first step of the machine learning life cycle. The goal
of this step is to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the
most important steps of the life cycle.
The quantity and quality of the collected data will determine the efficiency of the
output. The more will be the data, the more accurate will be the prediction.
This step includes the below tasks:
1. Identify various data sources
2. Collect data
3. Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It
will be used in further steps.
Data preparation
After collecting the data, we need to prepare it for further steps.
Data preparation is a step where we put our data into a suitable
place and prepare it to use in our machine learning training.
In this step, first, we put all data together, and then randomize the
ordering of data.
This step can be further divided into two processes:
1. Data exploration:
It is used to understand the nature of data that we have to work
with. We need to understand the characteristics, format, and
quality of data.
A better understanding of data leads to an effective outcome. In
this, we find Correlations, general trends, and outliers.
2. Data pre-processing:
Now the next step is preprocessing of data for its analysis.
Data Wrangling

Data wrangling is the process of cleaning and converting raw data into a
useable format. It is the process of cleaning the data, selecting the variable
to use, and transforming the data in a proper format to make it more suitable
for analysis in the next step. It is one of the most important steps of the
complete process. Cleaning of data is required to address the quality issues.
It is not necessary that data we have collected is always of our use as some
of the data may not be useful. In real-world applications, collected data may
have various issues, including:
Missing Values
Duplicate data
Invalid data
Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can
negatively affect the quality of the outcome.
Data Analysis

Now the cleaned and prepared data is passed on to the

analysis step. This step involves:
1. Selection of analytical techniques
2. Building models
3. Review the result
The aim of this step is to build a machine learning model
to analyze the data using various analytical techniques and
review the outcome. It starts with the determination of the
type of the problems, where we select the machine
learning techniques such
as Classification, Regression, Cluster
analysis, Association, etc. then build the model using
prepared data, and evaluate the model.
Train Model

Now the next step is to train the model, in

this step we train our model to improve its
performance for better outcome of the
problem.
We use datasets to train the model using
various machine learning algorithms.
Training a model is required so that it can
understand the various patterns, rules,
and, features.
Test Model

Once our machine learning model has

been trained on a given dataset, then we
test the model. In this step, we check for
the accuracy of our model by providing a
test dataset to it.
Testing the model determines the
percentage accuracy of the model as per
the requirement of project or problem.
Deployment

The last step of machine learning life cycle is

deployment, where we deploy the model in the
real-world system.
If the above-prepared model is producing an
accurate result as per our requirement with
acceptable speed, then we deploy the model in
the real system. But before deploying the project,
we will check whether it is improving its
performance using available data or not. The
deployment phase is similar to making the final
report for a project.

MAchine Learning
No ratings yet
MAchine Learning
120 pages
IE 555 - Programming For Analytics: Due Date: To Be Determined
No ratings yet
IE 555 - Programming For Analytics: Due Date: To Be Determined
2 pages
ML Interview Questions and Answers
100% (1)
ML Interview Questions and Answers
25 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Machine Learning Hands-On
100% (1)
Machine Learning Hands-On
18 pages
Bagging and Boosting Regression Algorithms
100% (1)
Bagging and Boosting Regression Algorithms
84 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
An Introduction of Ensemble Learning
100% (1)
An Introduction of Ensemble Learning
40 pages
Churn For Bank Customers
No ratings yet
Churn For Bank Customers
28 pages
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
Data Science With Python
No ratings yet
Data Science With Python
4 pages
Classification Algorithms
100% (2)
Classification Algorithms
23 pages
Predictive Model For E-Commerce
100% (1)
Predictive Model For E-Commerce
3 pages
Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites
No ratings yet
Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites
27 pages
Data Science Theory: Analysis and Analytics
No ratings yet
Data Science Theory: Analysis and Analytics
14 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
26 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Essentials of Machine Learning Algorithms (With Python and R Codes) PDF
100% (1)
Essentials of Machine Learning Algorithms (With Python and R Codes) PDF
20 pages
02 - Decision Tree Classification On Iris Dataset
No ratings yet
02 - Decision Tree Classification On Iris Dataset
6 pages
Answer Book (Ashish)
100% (1)
Answer Book (Ashish)
21 pages
05 Logistic - Regression
No ratings yet
05 Logistic - Regression
7 pages
Variable Selection
No ratings yet
Variable Selection
15 pages
Data Visualization Using Pyplot
100% (2)
Data Visualization Using Pyplot
8 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Regression Analysis
100% (2)
Regression Analysis
9 pages
ML UNIT-2 Notes
No ratings yet
ML UNIT-2 Notes
15 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
9 pages
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
Pandas
100% (1)
Pandas
1,131 pages
Handout9 Trees Bagging Boosting
100% (1)
Handout9 Trees Bagging Boosting
23 pages
Ensemble Learning: Wisdom of The Crowd
100% (1)
Ensemble Learning: Wisdom of The Crowd
12 pages
Data Mining
No ratings yet
Data Mining
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
38 pages
Loading The Dataset: 'Churn - Modelling - CSV'
No ratings yet
Loading The Dataset: 'Churn - Modelling - CSV'
6 pages
3 Regression Diagnostics
100% (1)
3 Regression Diagnostics
53 pages
Machine Learning Approachs (AI)
100% (1)
Machine Learning Approachs (AI)
11 pages
365 Data Science R Course Notes
No ratings yet
365 Data Science R Course Notes
20 pages
Data Pre-Processing (Pandas)
No ratings yet
Data Pre-Processing (Pandas)
19 pages
Figure Style and Scale: Darkgrid Whitegrid Dark White Ticks Darkgrid
No ratings yet
Figure Style and Scale: Darkgrid Whitegrid Dark White Ticks Darkgrid
15 pages
Introduction To Data Visualization in Python
No ratings yet
Introduction To Data Visualization in Python
16 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Data Science
100% (2)
Data Science
38 pages
Unit-5 Decision Trees and Ensemble Learning
100% (1)
Unit-5 Decision Trees and Ensemble Learning
162 pages
Prediction of Company Bankruptcy: Amlan Nag
100% (2)
Prediction of Company Bankruptcy: Amlan Nag
16 pages
Lead Scoring Case Study Presentation
100% (2)
Lead Scoring Case Study Presentation
11 pages
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
100% (1)
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
6 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
Presentation GPT 4
100% (1)
Presentation GPT 4
25 pages
Variosalgoritmos - Jupyter Notebook
100% (1)
Variosalgoritmos - Jupyter Notebook
9 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Programming On Parallel Machines
100% (1)
Programming On Parallel Machines
344 pages
Pandas Plotting Capabilities
No ratings yet
Pandas Plotting Capabilities
27 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Machine Learning Python
100% (1)
Machine Learning Python
9 pages
List of Deep Learning and NLP Resources
No ratings yet
List of Deep Learning and NLP Resources
69 pages
Churn Modeling
100% (1)
Churn Modeling
11 pages
Artificial Intelligence and Deep Learning
0% (1)
Artificial Intelligence and Deep Learning
9 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
Unit 1
No ratings yet
Unit 1
32 pages
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
L4b - Perfomance Evaluation Metric - Regression
No ratings yet
L4b - Perfomance Evaluation Metric - Regression
6 pages
L4a - Supervised Learning
No ratings yet
L4a - Supervised Learning
25 pages
L4b - Perfomance Evaluation Metric - Regression
No ratings yet
L4b - Perfomance Evaluation Metric - Regression
6 pages
L3 - Data Exploration Visualization and Pre-Processing
No ratings yet
L3 - Data Exploration Visualization and Pre-Processing
99 pages
Applied Data Science With Python-N
No ratings yet
Applied Data Science With Python-N
17 pages
Numerical and Scientific Computing in Python: v0.1 Spring 2019
No ratings yet
Numerical and Scientific Computing in Python: v0.1 Spring 2019
46 pages
Zuo Xie 2015 Python Code Published PDF
No ratings yet
Zuo Xie 2015 Python Code Published PDF
11 pages
C1 W2 Lab05 Sklearn GD Soln
No ratings yet
C1 W2 Lab05 Sklearn GD Soln
3 pages
Project Report PDF
100% (1)
Project Report PDF
38 pages
L T P C Continuous Internal Assessment End Exam Total
No ratings yet
L T P C Continuous Internal Assessment End Exam Total
2 pages
CV Lab
No ratings yet
CV Lab
14 pages
A Datamining Model For Detection of Fraudulent Behaviour in Water
No ratings yet
A Datamining Model For Detection of Fraudulent Behaviour in Water
36 pages
Introduction To Python
No ratings yet
Introduction To Python
35 pages
Python Assignment (Sem 2) MBA
No ratings yet
Python Assignment (Sem 2) MBA
60 pages
Fake Account Detection Using Machine Learning and Data Science
No ratings yet
Fake Account Detection Using Machine Learning and Data Science
58 pages
ML_LAB_Mannual-1
No ratings yet
ML_LAB_Mannual-1
79 pages
Technical Internship Report - HR Dataset
No ratings yet
Technical Internship Report - HR Dataset
52 pages
Numpy User
No ratings yet
Numpy User
565 pages
Python 3 Labo
No ratings yet
Python 3 Labo
30 pages
10 Minutes To Pandas - Pandas 0.21
No ratings yet
10 Minutes To Pandas - Pandas 0.21
23 pages
PYTHON LAB MANUAL EEE DEPT
No ratings yet
PYTHON LAB MANUAL EEE DEPT
28 pages
Top 99+ Data Science Interview Questions in 2024
No ratings yet
Top 99+ Data Science Interview Questions in 2024
60 pages
4.3.1.4 Lab - Internet Traffic Data Linear Regression
No ratings yet
4.3.1.4 Lab - Internet Traffic Data Linear Regression
14 pages
AI Document
No ratings yet
AI Document
7 pages
Data science and analytics questions
No ratings yet
Data science and analytics questions
11 pages
How To Make An Object Tracking Robot Using Raspberry Pi - Automatic Addisonasdfsdf
100% (1)
How To Make An Object Tracking Robot Using Raspberry Pi - Automatic Addisonasdfsdf
10 pages
Fundamentals of Data Science Lab Manual
No ratings yet
Fundamentals of Data Science Lab Manual
34 pages
Blood Bank Management System: Project On
No ratings yet
Blood Bank Management System: Project On
28 pages
Project report sf4 final
No ratings yet
Project report sf4 final
46 pages
FDSA MANUAL
No ratings yet
FDSA MANUAL
53 pages
PyPLUTO
No ratings yet
PyPLUTO
9 pages
Python Mini - Project - Reprot Final-1
No ratings yet
Python Mini - Project - Reprot Final-1
41 pages
Numpy Numpy NP NP: Mylist (, ,)
No ratings yet
Numpy Numpy NP NP: Mylist (, ,)
33 pages