0% found this document useful (0 votes)
141 views

L2 - Machine Learning Process

The document provides an overview of the machine learning process, including gathering data, data preparation such as exploration and preprocessing, data wrangling to clean the data, data analysis to select techniques and build models, training the model, testing the model, and deploying the trained model. Key steps include collecting and integrating data from various sources, understanding the data characteristics, cleaning missing and invalid values, selecting machine learning algorithms like classification and regression, and evaluating the model's performance on test data.

Uploaded by

Kinya Kageni
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views

L2 - Machine Learning Process

The document provides an overview of the machine learning process, including gathering data, data preparation such as exploration and preprocessing, data wrangling to clean the data, data analysis to select techniques and build models, training the model, testing the model, and deploying the trained model. Key steps include collecting and integrating data from various sources, understanding the data characteristics, cleaning missing and invalid values, selecting machine learning algorithms like classification and regression, and evaluating the model's performance on test data.

Uploaded by

Kinya Kageni
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Machine Learning Process

Introduction to Python
 Python was developed by Guido van Rossum at Stichting Mathematisch
Centrum in the Netherlands.
 It was written as the successor of programming language named ‘ABC’.
 It’s first version was released in 1991.
 The name Python was picked by Guido van Rossum from a TV show
named Monty Python’s Flying Circus.
 It is an open source programming language which means that we can
freely download it and use it to develop programs. It can be downloaded
from www.python.org.
 Python programming language is having the features of Java and C both.
It is having the elegant ‘C’ code and on the other hand, it is having
classes and objects like Java for object-oriented programming.
 It is an interpreted language, which means the source code of Python
program would be first converted into bytecode and then executed by
Python virtual machine.
Why use python in Machine Learning?
1. Extensive set of packages
Python has an extensive and powerful set of packages which are ready to be used
in various domains. It also has packages like numpy, scipy, pandas, scikit-
learn etc. which are required for machine learning and data science.
2. Easy prototyping
Another important feature of Python that makes it the choice of language for data
science is the easy and fast prototyping. This feature is useful for developing new
algorithm.
3. Collaboration feature
The field of data science basically needs good collaboration and Python provides
many useful tools that make this extremely.
4. One language for many domains
A typical data science project includes various domains like data extraction, data
manipulation, data analysis, feature extraction, modeling, evaluation, deployment
and updating the solution. As Python is a multi-purpose language, it allows the
data scientist to address all these domains from a common platform.
Components of Python ML Ecosystem

These are some of the core Data Science libraries that form the
components of Python Machine learning ecosystem.
1. Jupyter Notebook

Jupyter notebooks basically provides an interactive computational


environment for developing Python based Data Science
applications.
They are formerly known as ipython notebooks.
The following are some of the features of Jupyter notebooks that
makes it one of the best components of Python ML ecosystem −
1. Jupyter notebooks can illustrate the analysis process step by step
by arranging the stuff like code, images, text, output etc. in a step
by step manner.
2. It helps a data scientist to document the thought process while
developing the analysis process.
3. One can also capture the result as the part of the notebook.
4. With the help of jupyter notebooks, we can share our work with a
peer also.
NumPy

It is another useful component that makes Python as one


of the favorite languages for Data Science.
It basically stands for Numerical Python and consists of
multidimensional array objects.
By using NumPy, we can perform the following important
operations −
1. Mathematical and logical operations on arrays.
2. Fourier transformation
3. Operations associated with linear algebra.
4. We can also see NumPy as the replacement of MatLab
because NumPy is mostly used along with Scipy
(Scientific Python) and Mat-plotlib (plotting library).
Pandas
It is another useful Python library that makes Python one
of the favorite languages for Data Science.
Pandas is basically used for data manipulation, wrangling
and analysis.
It was developed by Wes McKinney in 2008.
With the help of Pandas, in data processing we can
accomplish the following five steps −
1. Load
2. Prepare
3. Manipulate
4. Model
5. Analyze
Scikit-learn

 Another useful and most important python library for Data Science and
machine learning in Python is Scikit-learn.
 The following are some features of Scikit-learn that makes it so useful −

1. It is built on NumPy, SciPy, and Matplotlib.

2. It is an open source and can be reused under BSD license.

3. It is accessible to everybody and can be reused in various contexts.

4. Wide range of machine learning algorithms covering major areas of


ML like classification, clustering, regression, dimensionality reduction,
model selection etc. can be implemented with the help of it.
Machine Learning Cycle
Gathering Data:
The most important thing in machine learning is to first understand the problem and to
know the purpose of the problem. Understanding the problem results in good results.
In the complete life cycle process, to solve a problem, we create a machine learning
system called "model", and this model is created by providing "training". But to train a
model, we need data.
Therefore, Data Gathering is the first step of the machine learning life cycle. The goal
of this step is to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the
most important steps of the life cycle.
The quantity and quality of the collected data will determine the efficiency of the
output. The more will be the data, the more accurate will be the prediction.
This step includes the below tasks:
1. Identify various data sources
2. Collect data
3. Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It
will be used in further steps.
Data preparation
After collecting the data, we need to prepare it for further steps.
Data preparation is a step where we put our data into a suitable
place and prepare it to use in our machine learning training.
In this step, first, we put all data together, and then randomize the
ordering of data.
This step can be further divided into two processes:
1. Data exploration:
It is used to understand the nature of data that we have to work
with. We need to understand the characteristics, format, and
quality of data.
A better understanding of data leads to an effective outcome. In
this, we find Correlations, general trends, and outliers.
2. Data pre-processing:
Now the next step is preprocessing of data for its analysis.
Data Wrangling

Data wrangling is the process of cleaning and converting raw data into a
useable format. It is the process of cleaning the data, selecting the variable
to use, and transforming the data in a proper format to make it more suitable
for analysis in the next step. It is one of the most important steps of the
complete process. Cleaning of data is required to address the quality issues.
It is not necessary that data we have collected is always of our use as some
of the data may not be useful. In real-world applications, collected data may
have various issues, including:
Missing Values
Duplicate data
Invalid data
Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can
negatively affect the quality of the outcome.
Data Analysis

Now the cleaned and prepared data is passed on to the


analysis step. This step involves:
1. Selection of analytical techniques
2. Building models
3. Review the result
The aim of this step is to build a machine learning model
to analyze the data using various analytical techniques and
review the outcome. It starts with the determination of the
type of the problems, where we select the machine
learning techniques such
as Classification, Regression, Cluster
analysis, Association, etc. then build the model using
prepared data, and evaluate the model.
Train Model

Now the next step is to train the model, in


this step we train our model to improve its
performance for better outcome of the
problem.
We use datasets to train the model using
various machine learning algorithms.
Training a model is required so that it can
understand the various patterns, rules,
and, features.
Test Model

Once our machine learning model has


been trained on a given dataset, then we
test the model. In this step, we check for
the accuracy of our model by providing a
test dataset to it.
Testing the model determines the
percentage accuracy of the model as per
the requirement of project or problem.
Deployment

The last step of machine learning life cycle is


deployment, where we deploy the model in the
real-world system.
If the above-prepared model is producing an
accurate result as per our requirement with
acceptable speed, then we deploy the model in
the real system. But before deploying the project,
we will check whether it is improving its
performance using available data or not. The
deployment phase is similar to making the final
report for a project.

You might also like