L2 - Machine Learning Process
L2 - Machine Learning Process
Introduction to Python
Python was developed by Guido van Rossum at Stichting Mathematisch
Centrum in the Netherlands.
It was written as the successor of programming language named ‘ABC’.
It’s first version was released in 1991.
The name Python was picked by Guido van Rossum from a TV show
named Monty Python’s Flying Circus.
It is an open source programming language which means that we can
freely download it and use it to develop programs. It can be downloaded
from www.python.org.
Python programming language is having the features of Java and C both.
It is having the elegant ‘C’ code and on the other hand, it is having
classes and objects like Java for object-oriented programming.
It is an interpreted language, which means the source code of Python
program would be first converted into bytecode and then executed by
Python virtual machine.
Why use python in Machine Learning?
1. Extensive set of packages
Python has an extensive and powerful set of packages which are ready to be used
in various domains. It also has packages like numpy, scipy, pandas, scikit-
learn etc. which are required for machine learning and data science.
2. Easy prototyping
Another important feature of Python that makes it the choice of language for data
science is the easy and fast prototyping. This feature is useful for developing new
algorithm.
3. Collaboration feature
The field of data science basically needs good collaboration and Python provides
many useful tools that make this extremely.
4. One language for many domains
A typical data science project includes various domains like data extraction, data
manipulation, data analysis, feature extraction, modeling, evaluation, deployment
and updating the solution. As Python is a multi-purpose language, it allows the
data scientist to address all these domains from a common platform.
Components of Python ML Ecosystem
These are some of the core Data Science libraries that form the
components of Python Machine learning ecosystem.
1. Jupyter Notebook
Another useful and most important python library for Data Science and
machine learning in Python is Scikit-learn.
The following are some features of Scikit-learn that makes it so useful −
Data wrangling is the process of cleaning and converting raw data into a
useable format. It is the process of cleaning the data, selecting the variable
to use, and transforming the data in a proper format to make it more suitable
for analysis in the next step. It is one of the most important steps of the
complete process. Cleaning of data is required to address the quality issues.
It is not necessary that data we have collected is always of our use as some
of the data may not be useful. In real-world applications, collected data may
have various issues, including:
Missing Values
Duplicate data
Invalid data
Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can
negatively affect the quality of the outcome.
Data Analysis