Open in app
This member-only story is on us. Upgrade to access all of Medium.
Member-only story
A Good Beginner Project With Logistic
Regression
An easy to understand guide for the “hello world” project in machine learning
and data science
Jacob Toftgaard Rasmussen · Follow
Published in The Startup
8 min read · Aug 7, 2020
Listen Share More
62 2
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 1 of 24
:
Photo by Olga Mandel on Unsplash
Welcome to this friendly beginner’s guide to creating a logistic regression model for
classification in python!
With this guide I want to give you an easy way to complete your first data
science project by creating a logistic regression machine learning model
used for classification. I will explain all the steps thoroughly to make sure
that you know what is going on. We will be working with a very famous and
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 2 of 24
:
known data set originating from 1936. I have used Google Colab to write the
code.
The article consists of the following steps:
Introduction to the data
Binary- vs multiclass classification (an explanation)
Importing the data
Splitting the data into training and test data
Scaling the training and test data
Building and training the logistic regression for classification model
Evaluating the results
Conclusion and bye byes
I will be showing all the code, and also provide a strong explanation at each
step. Sounds good?
- Lets do it!
Introduction to the data
The data can be found here. You will be directed to at Kaggle site (see photo
below. I have highlighted the download button in yellow). After downloading
the file I have renamed it to dataset_iris.csv.
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 3 of 24
:
The dataset consists of 150 entries of data about iris plants. There are exactly
50 of each type of iris described in the data. The data points for each flower
are:
1. SepalLengthCm
2. SepalWidthCm
3. PetalLengthCm
4. PetalWidthCm
5. Species
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 4 of 24
:
In total we have 4 variables (1 to 4) that have an influence on which species of
plant a given plant might be. In machine learning terms these variables are
called features. The final variable (5) is called a label, and it is the
label/species that we would like to predict.
These are the 20 first rows of the data for you to familiarize with.
The first 20 rows of data
Binary- vs multiclass classification
For this project you will be creating a logistic regression model for
classification. To make the logistic regression model work as a classification
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 5 of 24
:
model we have to apply a small trick (don’t worry it won’t be difficult).
Classification problems can be divided into two categories, binary- and
multiclass. The first is a classification problem where the outcome is limited
to two different classes. This could be yes or no classifications. Will the
customer buy the product or not? Whereas the latter is a classification
problem where the outcome can be more than 2 classes, as in the case of this
project where we have 3 different classes of plants.
Often logistic regression is not used for classification problems, and there are
other models that can naturally do classifications. However, the logistic
model is very common and a good model to get familiar with, which is why I
have chosen to implement it here anyway. And it does indeed work, we just
have to provide the model with a small power up.
Alright, enough talking already, let’s code!
Each code snippet will be followed with an explanation of what just
happened.
Importing the data
import pandas as pd
dataset = pd.read_csv('dataset_iris.csv')
dataset.head()
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 6 of 24
:
Output
We need the pandas library to get access to the data and we import it to a
variable called pd.
pd (pandas) has a function called read_csv() which allows us to read data
from a comma separated file, and it returns a DataFrame which we save
to a variable that we call dataset.
A pandas DataFrame has a function called head() which displays the first
5 entries in the DataFrame. (See output photo).
Now we have all the data saved in a DataFrame format next we will divide the
data into features and labels (remember the features are the independent
variables that influence the dependent variable called the label).
x = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
print(x[:10])
print(y[:10])
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 7 of 24
:
Output
We can access different rows and columns of a DataFrame by using iloc
followed by the index of the rows and columns we want, and finally
appending .values to get the actual values. We save this to variables x and
y.
x now contains 150 rows of features. We see the 10 first rows in the output
photo.
y now contains the 150 corresponding species/labels. We see the first 10
labels in the output photo.
We have two more steps to complete before we are ready to create and train
our model, let’s do those now!
Splitting the data into training and test data
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size = 0.2, random_state = 0)
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 8 of 24
:
From the scikit-learn library (called sklearn. You will meet sklearn many
times!) and model_selection we import train_test_split.
train_test_split allows us to split our data intro training data and test data.
It takes 4 parameters: the features (x), the labels (y), test_size (what
fraction of the data should become test data) and finally random_state
(put any number here, but use 0 to get same results as me).
Why do we split the data?
When we train a model with data it will get familiar with that data and be
able to make perfect predictions when it is introduced to the same data again.
It won’t be a lot of fun to only use the model on data it has already seen as we
do not want to predict things we already know! Therefore, we split the data
and first introduce the test data to the model when it is time to test our
models performance.
Now to the final step before making the model…
Scaling the data
We need to scale the data through a method called standardization. (don’t
worry about how it works right now, that is a topic for another day).
Standardization will scale the values of the features so that almost all of them
are in the range of -1 to 1. Some machine learning models are sensitive to
whether or not data has been scaled, and logistic regression is one such
model. As an example: If we do not scale the data the model might consider
2000m larger than 3km. Scaling will help us get rid of this problem.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 9 of 24
:
Printing the 10 first entries of the scaled training data
We import the StandardScaler class from sklearn.preprocessing.
We create an instance of StandardScaler and call it sc.
We call the fit_transform() method from sc, which both fits the scaler to
the data and simultaneously scales the training data returning it to the
x_train variable.
Lastly we also scale the test data. At this point the scaler is already fitted
to the training data and should therefore not be fitted again. We simply
call the transform() method which returns the transformed test data.
In the output photo after the print statement we can see the first 10
entries of the scaled data. Indeed we can see that the values are mostly
between -1 and 1
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 10 of 24
:
And now to the step that we have all been waiting for!
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 11 of 24
:
Photo by Dilyara Garifullina on Unsplash
Creating and training the logistic regression model
We are now ready to create the logistic regression model for a multiclass
problem. Up until now we have: imported the data, split it into training and
testing data and lastly scaled the data to make it more suitable for our
machine learning model.
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(multi_class='ovr',
random_state = 0)
classifier.fit(x_train, y_train)
We import LogisticRegression from sklearn.linear_model
We create an instance of the LogisticRegression class called classifier by
calling LogisticRegression(), as parameters we input multi_class=’ovr’
and random_state=0.
multi_class=’ovr’ is the trick I mentioned previously which makes the
logistic regression model work in a multiclass scenario. If our problem
had simply been binary we would have left this parameter out.
Random_state needs any number as input, you can enter 0 to get the
same results as me.
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 12 of 24
:
Wow! That was actually it.
Now lets see how well our model performs.
Predicting results
from sklearn.metrics import confusion_matrix, accuracy_score
predictions = classifier.predict(x_test)
cm = confusion_matrix(predictions, y_test)
print(cm)
accuracy_score(predictions, y_test)
We import confusion_matrix and accuracy_score. They are kind of self
explanatory, however, I will comment a bit on them.
We use the fitted classifier to get the predictions and save them to the
variable predictions.
The confusion_matrix takes two parameters: Our predicted values and
the expected values, in this case predictions and y_test. The same applies
for the accuracy_score.
We print the confusion_matrix called cm
The accuracy_score automatically prints out its return value
Conclusion
We can see that our model has an accuracy score of 0.9, which means it is
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 13 of 24
:
correct 90% of the time!
Congratulations you have now completed a machine learning and data science
project!
What to do now?
There are many ways to improve this project! I wanted to keep this guide as
simple as possible for everyone to be able to participate. Because of this I
have cut some corners and skipped some of the best practices in data
science. Normally you would do a lot more data analysis before training the
model, like visualizing the data, and also do more work to validate the
accuracy of the model. I recommend that you try to complete a project with
binary classification as well, so that you master both. Try to do a Google
search for linearRegression(multi_class=’ovr’) to completely understand what
is going on in this step. Also logistic regression might not be the best
machine learning model for this project, I simply chose it because it is so well
known and I consider it important to learn. You could try implementing
other models in this project and maybe you will achieve better results.
To get better at AI, machine learning and data science I recommend that you
keep practicing. This could include reading other guides here on medium,
taking a course on udemy or maybe applying to a boot camp, there are many
possibilities!
If you are interested in avoiding my biggest mistake for yourself, have a look
at my article called My biggest mistake learning machine learning.
I hope this article has been helpful to you! If you have any questions or
comments feel free to reach out to me in the response section below.
Keep learning!
— Jacob Toftgaard Rasmussen
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 14 of 24
:
Machine Learning Data Science Logistic Regression Classification
Beginners Guide
Follow
Written by Jacob Toftgaard Rasmussen
217 Followers · Writer for The Startup
I explore the wonderful world of software engineering. I write guides, informative articles and silly
projects ;) — Keep learning!
More from Jacob Toftgaard Rasmussen and The Startup
https://medium.com/swlh/a-good-beginner-project-with-logistic-regression-60b78e38484e 10/25/23, 2 02 AM
Page 15 of 24
: