0% found this document useful (0 votes)

13 views15 pages

A Good Beginner Project With Logistic Regression by Jacob Toftgaard Rasmussen - Fragment

Uploaded by

Somya Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views15 pages

A Good Beginner Project With Logistic Regression by Jacob Toftgaard Rasmussen - Fragment

Uploaded by

Somya Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Open in app

This member-only story is on us. Upgrade to access all of Medium.

Member-only story

A Good Beginner Project With Logistic

Regression
An easy to understand guide for the “hello world” project in machine learning
and data science

Jacob Toftgaard Rasmussen · Follow

Published in The Startup
8 min read · Aug 7, 2020

Listen Share More

62 2

Welcome to this friendly beginner’s guide to creating a logistic regression model for
classification in python!

With this guide I want to give you an easy way to complete your first data
science project by creating a logistic regression machine learning model
used for classification. I will explain all the steps thoroughly to make sure
that you know what is going on. We will be working with a very famous and

The article consists of the following steps:

Introduction to the data

Binary- vs multiclass classification (an explanation)

Importing the data

Splitting the data into training and test data

Scaling the training and test data

Building and training the logistic regression for classification model

Evaluating the results

Conclusion and bye byes

I will be showing all the code, and also provide a strong explanation at each
step. Sounds good?
- Lets do it!

Introduction to the data

The data can be found here. You will be directed to at Kaggle site (see photo
below. I have highlighted the download button in yellow). After downloading
the file I have renamed it to dataset_iris.csv.

https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 3 of 24
:
The dataset consists of 150 entries of data about iris plants. There are exactly
50 of each type of iris described in the data. The data points for each flower
are:

1. SepalLengthCm

2. SepalWidthCm

3. PetalLengthCm

4. PetalWidthCm

5. Species

https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 4 of 24
:
In total we have 4 variables (1 to 4) that have an influence on which species of
plant a given plant might be. In machine learning terms these variables are
called features. The final variable (5) is called a label, and it is the
label/species that we would like to predict.

These are the 20 first rows of the data for you to familiarize with.

The first 20 rows of data

Binary- vs multiclass classification

For this project you will be creating a logistic regression model for
classification. To make the logistic regression model work as a classification

https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 5 of 24
:
model we have to apply a small trick (don’t worry it won’t be difficult).
Classification problems can be divided into two categories, binary- and
multiclass. The first is a classification problem where the outcome is limited
to two different classes. This could be yes or no classifications. Will the
customer buy the product or not? Whereas the latter is a classification
problem where the outcome can be more than 2 classes, as in the case of this
project where we have 3 different classes of plants.

Often logistic regression is not used for classification problems, and there are
other models that can naturally do classifications. However, the logistic
model is very common and a good model to get familiar with, which is why I
have chosen to implement it here anyway. And it does indeed work, we just
have to provide the model with a small power up.

Alright, enough talking already, let’s code!

Each code snippet will be followed with an explanation of what just
happened.

Importing the data

import pandas as pd
dataset = pd.read_csv('dataset_iris.csv')
dataset.head()

https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 6 of 24
:
Output

We need the pandas library to get access to the data and we import it to a
variable called pd.

pd (pandas) has a function called read_csv() which allows us to read data

from a comma separated file, and it returns a DataFrame which we save
to a variable that we call dataset.

A pandas DataFrame has a function called head() which displays the first
5 entries in the DataFrame. (See output photo).

Now we have all the data saved in a DataFrame format next we will divide the
data into features and labels (remember the features are the independent
variables that influence the dependent variable called the label).

x = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
print(x[:10])
print(y[:10])

https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 7 of 24
:
Output

We can access different rows and columns of a DataFrame by using iloc

followed by the index of the rows and columns we want, and finally
appending .values to get the actual values. We save this to variables x and
y.

x now contains 150 rows of features. We see the 10 first rows in the output
photo.

y now contains the 150 corresponding species/labels. We see the first 10

labels in the output photo.

We have two more steps to complete before we are ready to create and train
our model, let’s do those now!

Splitting the data into training and test data

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size = 0.2, random_state = 0)

https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 8 of 24
:
From the scikit-learn library (called sklearn. You will meet sklearn many
times!) and model_selection we import train_test_split.

train_test_split allows us to split our data intro training data and test data.
It takes 4 parameters: the features (x), the labels (y), test_size (what
fraction of the data should become test data) and finally random_state
(put any number here, but use 0 to get same results as me).

Why do we split the data?

When we train a model with data it will get familiar with that data and be
able to make perfect predictions when it is introduced to the same data again.
It won’t be a lot of fun to only use the model on data it has already seen as we
do not want to predict things we already know! Therefore, we split the data
and first introduce the test data to the model when it is time to test our
models performance.

Now to the final step before making the model…

Scaling the data

We need to scale the data through a method called standardization. (don’t
worry about how it works right now, that is a topic for another day).
Standardization will scale the values of the features so that almost all of them
are in the range of -1 to 1. Some machine learning models are sensitive to
whether or not data has been scaled, and logistic regression is one such
model. As an example: If we do not scale the data the model might consider
2000m larger than 3km. Scaling will help us get rid of this problem.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

We import the StandardScaler class from sklearn.preprocessing.

We create an instance of StandardScaler and call it sc.

We call the fit_transform() method from sc, which both fits the scaler to
the data and simultaneously scales the training data returning it to the
x_train variable.

Lastly we also scale the test data. At this point the scaler is already fitted
to the training data and should therefore not be fitted again. We simply
call the transform() method which returns the transformed test data.

In the output photo after the print statement we can see the first 10
entries of the scaled data. Indeed we can see that the values are mostly
between -1 and 1

Creating and training the logistic regression model

We are now ready to create the logistic regression model for a multiclass
problem. Up until now we have: imported the data, split it into training and
testing data and lastly scaled the data to make it more suitable for our
machine learning model.

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(multi_class='ovr',
random_state = 0)
classifier.fit(x_train, y_train)

We import LogisticRegression from sklearn.linear_model

We create an instance of the LogisticRegression class called classifier by

calling LogisticRegression(), as parameters we input multi_class=’ovr’
and random_state=0.

multi_class=’ovr’ is the trick I mentioned previously which makes the

logistic regression model work in a multiclass scenario. If our problem
had simply been binary we would have left this parameter out.
Random_state needs any number as input, you can enter 0 to get the
same results as me.

Now lets see how well our model performs.

Predicting results

from sklearn.metrics import confusion_matrix, accuracy_score

predictions = classifier.predict(x_test)
cm = confusion_matrix(predictions, y_test)
print(cm)
accuracy_score(predictions, y_test)

We import confusion_matrix and accuracy_score. They are kind of self

explanatory, however, I will comment a bit on them.

We use the fitted classifier to get the predictions and save them to the
variable predictions.

The confusion_matrix takes two parameters: Our predicted values and

the expected values, in this case predictions and y_test. The same applies
for the accuracy_score.

We print the confusion_matrix called cm

The accuracy_score automatically prints out its return value

Conclusion
We can see that our model has an accuracy score of 0.9, which means it is

https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 13 of 24
:
correct 90% of the time!

Congratulations you have now completed a machine learning and data science
project!

What to do now?
There are many ways to improve this project! I wanted to keep this guide as
simple as possible for everyone to be able to participate. Because of this I
have cut some corners and skipped some of the best practices in data
science. Normally you would do a lot more data analysis before training the
model, like visualizing the data, and also do more work to validate the
accuracy of the model. I recommend that you try to complete a project with
binary classification as well, so that you master both. Try to do a Google
search for linearRegression(multi_class=’ovr’) to completely understand what
is going on in this step. Also logistic regression might not be the best
machine learning model for this project, I simply chose it because it is so well
known and I consider it important to learn. You could try implementing
other models in this project and maybe you will achieve better results.

To get better at AI, machine learning and data science I recommend that you
keep practicing. This could include reading other guides here on medium,
taking a course on udemy or maybe applying to a boot camp, there are many
possibilities!
If you are interested in avoiding my biggest mistake for yourself, have a look
at my article called My biggest mistake learning machine learning.

I hope this article has been helpful to you! If you have any questions or
comments feel free to reach out to me in the response section below.

Keep learning!
— Jacob Toftgaard Rasmussen

Beginners Guide

Written by Jacob Toftgaard Rasmussen

217 Followers · Writer for The Startup

I explore the wonderful world of software engineering. I write guides, informative articles and silly
projects ;) — Keep learning!

A Good Beginner Project With Logistic Regression by Jacob Toftgaard Rasmussen - Fragment

Uploaded by

A Good Beginner Project With Logistic Regression by Jacob Toftgaard Rasmussen - Fragment

Uploaded by

Open in app

This member-only story is on us. Upgrade to access all of Medium.

A Good Beginner Project With Logistic

Jacob Toftgaard Rasmussen · Follow

Listen Share More

The article consists of the following steps:

Introduction to the data

Binary- vs multiclass classification (an explanation)

Importing the data

Splitting the data into training and test data

Scaling the training and test data

Building and training the logistic regression for classification model

Evaluating the results

Conclusion and bye byes

Introduction to the data

The first 20 rows of data

Binary- vs multiclass classification

Alright, enough talking already, let’s code!

Importing the data

pd (pandas) has a function called read_csv() which allows us to read data

We can access different rows and columns of a DataFrame by using iloc

y now contains the 150 corresponding species/labels. We see the first 10

Splitting the data into training and test data

from sklearn.model_selection import train_test_split

Why do we split the data?

Now to the final step before making the model…

Scaling the data

from sklearn.preprocessing import StandardScaler

We import the StandardScaler class from sklearn.preprocessing.

We create an instance of StandardScaler and call it sc.

Creating and training the logistic regression model

from sklearn.linear_model import LogisticRegression

We import LogisticRegression from sklearn.linear_model

We create an instance of the LogisticRegression class called classifier by

multi_class=’ovr’ is the trick I mentioned previously which makes the

Now lets see how well our model performs.

from sklearn.metrics import confusion_matrix, accuracy_score

We import confusion_matrix and accuracy_score. They are kind of self

The confusion_matrix takes two parameters: Our predicted values and

We print the confusion_matrix called cm

The accuracy_score automatically prints out its return value

Written by Jacob Toftgaard Rasmussen

More from Jacob Toftgaard Rasmussen and The Startup

You might also like