Week 4: Mini Project: Part 1: Familiarize Yourself With The Dataset
Week 4: Mini Project: Part 1: Familiarize Yourself With The Dataset
Abalone
Below is the dataset description from the UCI Machine Learning Repository.
localhost:8888/nbconvert/html/MiniProjectWeek4.ipynb?download=false 1/14
8/23/2020 MiniProjectWeek4
# We define a header ourselves since the dataset contains only the raw number
s.
dataset = []
header = ['Sex', 'Length', 'Diameter', 'Height', 'Whole Weight', 'Shucked Weig
ht', 'Viscera Weight',
'Shell Weight', 'Rings']
for line in all_lines:
d = dict(zip(header, line))
d['Length'] = float(d['Length'])
d['Diameter'] = float(d['Diameter'])
d['Height'] = float(d['Height'])
d['Whole Weight'] = float(d['Whole Weight'])
d['Shucked Weight'] = float(d['Shucked Weight'])
d['Viscera Weight'] = float(d['Viscera Weight'])
d['Shell Weight'] = float(d['Shell Weight'])
d['Rings'] = int(d['Rings'])
dataset.append(d)
Fill in the following cells with the requested information about the dataset. The answers are given so you can
check the output of your own code. For floating numbers, don't worry too much about the exact numbers as long
as they are quite close -- different systems may have different rounding protocols.
Feel free to import numpy if you want more practice with it, or just use Python's native structures to play
around with the numbers.
localhost:8888/nbconvert/html/MiniProjectWeek4.ipynb?download=false 2/14
8/23/2020 MiniProjectWeek4
Out[3]: 4177
Out[7]: 0.5239920995930099
Out[10]: 0.65
In [11]: # Q: What is the average number of rings of smaller abalones compared to that
of larger abalones? That
# is, do smaller abalones tend to be younger or older than larger abalones?
# We will count small abalones as abalones with lengths less than or equal
to the average length of
# an abalone. The average length of an abalone is 0.524.
# A: Small Abalones have on average 8.315645514223196 rings.
# Large Abalones have on average 11.192848020434228 rings.
ageSmall = 0
ageLarge = 0
c1 = 0
c2 = 0
for i in range(len(dataset)):
if dataset[i]['Length'] < 0.524:
ageSmall += dataset[i]['Rings']
c1 += 1
else:
ageLarge += dataset[i]['Rings']
c2 += 1
ageSmall /= c1
ageLarge /= c2
# Change variable name if necessary
print('Small Abalones have on average', ageSmall, 'rings.')
print('Large Abalones have on average', ageLarge, 'rings.')
localhost:8888/nbconvert/html/MiniProjectWeek4.ipynb?download=false 3/14
8/23/2020 MiniProjectWeek4
In lectures, we covered the basics of line plots, histograms, scatter plots, bar plots, and box plots. Let's try out a
few below.
Line Plots
Line plots show the change in data over time. The example Line Plot below plots the change in density as
abalones age (i.e. the distribution of rings). Note that a line plot is not necessarily the best way to show this
data since it doesn't deal with a trend! Use a histogram (next step) to better showcase this data.
localhost:8888/nbconvert/html/MiniProjectWeek4.ipynb?download=false 4/14
8/23/2020 MiniProjectWeek4
# Customize plot
plt.gca().set(xlabel='Rings', ylabel='Number of Abalones',
title='Abalone Age Distribution')
plt.grid()
Histograms
Histograms show the distribution of numeric continuous variables with central tendency and skewness. Using
the line plot data from above, plot a histogram showing the distribution of abalone age. Feel free to
explore matplotlib (https://matplotlib.org/gallery/index.html) on your own to customize your histogram and the
following visualizations.
localhost:8888/nbconvert/html/MiniProjectWeek4.ipynb?download=false 5/14
8/23/2020 MiniProjectWeek4
[1, 2, 3, 3, 3, 3, 3, 3, 3, 3]
Scatter Plots
Scatter plots show the strength of a relationship between two variables (also known as correlations). From Part
2: Simple Statistics, we see that larger abalones tend to be larger, at least from a numbers perspective. Let's
see if this is actually true by creating a scatter plot showing the relationship between Rings and
Length .
On Your Own: Read up on sciPy and how you can calculate and graph the correlation as well.
localhost:8888/nbconvert/html/MiniProjectWeek4.ipynb?download=false 6/14
8/23/2020 MiniProjectWeek4
fig, ax = plt.subplots()
ax.scatter(rings, length)
Bar Plots
Bar plots are great for comparing categorical variables. There are a few subtypes of bar plots, such as the
grouped bar chart or stacked bar chart. Since we have the Sex field to play with, we can compare data across
M and F abalones. Below is a simple stacked bar chart comparing the Sex category with the Shucked
Weight data. Create a bar chart of your choice of data.
You may refer to the cell below to parse out fields by sex.
localhost:8888/nbconvert/html/MiniProjectWeek4.ipynb?download=false 7/14
8/23/2020 MiniProjectWeek4
localhost:8888/nbconvert/html/MiniProjectWeek4.ipynb?download=false 8/14
8/23/2020 MiniProjectWeek4
Box Plots
Box plots are useful for comparing distributions of data and are commonly found in research papers. The box
portion of a box plot represents 50% of the data, and there are versions where you can mark outliers and other
extremes. We have the distribution of rings already from the line plot example under the variable name
age_freq , assuming you haven't modified it. Find the distribution of another field of your choice and
create one or more box plots with both of these fields.
Hint: You can plot multiple box plots with the command plt.boxplot([plot1, plot2, ..., plotn]) or use
subplots() to draw multiple separate plots at the same time. See this matplotlib example
(https://matplotlib.org/gallery/statistics/boxplot_demo.html#sphx-glr-gallery-statistics-boxplot-demo-py) for more.
localhost:8888/nbconvert/html/MiniProjectWeek4.ipynb?download=false 9/14
8/23/2020 MiniProjectWeek4
[1, 2, 3, 3, 3]
[1, 1, 15, 57, 115]
Description of visualization
I have perfectly understood the concepts of Line Plot, Bar Plot, Scatter Plot, Histograms. I didn't quite get the
concept of Box plot, but I tried it. For Bar plot I have taken the various weights from the data and calculated the
average and shown it. The remaining plots are easy to interpret on seeing. Thanks!
localhost:8888/nbconvert/html/MiniProjectWeek4.ipynb?download=false 10/14
8/23/2020 MiniProjectWeek4
This part of the notebook is not graded, but still contains some valuable tips for web-scraping! You were
introduced to a method of creating your own dataset by parsing a webpage in lecture videos and this week's
notebook. Here is another way to parse a webpage with BeautifulSoup. We will be using a short story from
Project Gutenberg (Little Boy (http://www.gutenberg.org/files/58743/58743-h/58743-h.htm) by Harry Neal, 1954)
as an example.
On Your Own: Read this page on webscraping and try out a project!
https://automatetheboringstuff com/chapter11/ (https://automatetheboringstuff com/chapter11/)
With a BeautifulSoup object, we can easily search through HTML and create lists and other structures.
We can also extract all the text from a page and use it to create a bag of words or other measures.
localhost:8888/nbconvert/html/MiniProjectWeek4.ipynb?download=false 11/14
8/23/2020 MiniProjectWeek4
In [ ]: import string
from collections import defaultdict
letters = defaultdict(int)
punctuation = set(string.punctuation)
letters.items()
In previous lectures and notebooks, we wrote our own parser method to extract parts of the text. Here is a trivial
example of how you can do the same with BeautifulSoup using a list of Top 10 Chefs by Gazette Review
(https://gazettereview.com/2017/04/top-10-chefs/).
Note that all the names of the chefs are between <h2> and </h2> tags and the descriptions are between
<p> and </p> tags. We can get the names of the chefs quite easily, as seen below.
localhost:8888/nbconvert/html/MiniProjectWeek4.ipynb?download=false 12/14
8/23/2020 MiniProjectWeek4
In [ ]: # Clean and strip spaces and numbers from the bs4 element and turn it into a P
ython list
import string
letters = set(string.ascii_letters)
chef_name = []
# Grab relevant letters/spaces and remove extra HTML tags and spaces
for chef in chefs:
chef = [letter for letter in str(chef) if letter in letters or letter is '
']
chef = ''.join(chef[2:len(chef) - 1])
chef_name.append(chef)
chef_name
Getting the list of chef names is trivial with the find_all() function (and a little Python cleaning), but what
about the descriptions? This is a little trickier since there may be overlapping uses for the <p> and </p> tags,
so let's try navigating the BeautifulSoup tree
(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree).
This website is simple in that every chef has a two-paragraph description in the same format. We can use this to
our advantage once we know what to look for. Let's say we want to extract just the text from these two
paragraphs. How can we do so? With the .contents attribute, we can access the children of each tag.
In [ ]: descriptions = soup.find_all('p')
del descriptions[-12:]
del descriptions[0]
print("The number of paragraphs is:", len(descriptions))
descriptions[:2]
localhost:8888/nbconvert/html/MiniProjectWeek4.ipynb?download=false 13/14
8/23/2020 MiniProjectWeek4
We now have lists with the names, descriptions, and images of the chefs! You can arrange this however you
want; chef_data below is arranged like a JSON object but you can modify this section to make the data look
more like a traditional dataset.
In [ ]: chef_data = {}
chef_data['Name'] = chef_name
chef_data['Description'] = chef_description
chef_data['Image'] = chef_image
chef_data['Description'][0]
Note: If you run into a HTTP error 403 (Forbidden) , this means that the site probably blocks web-scraping
scripts. You can get around this by modifying the way you request the URL (see StackOverflow
(https://stackoverflow.com/questions/28396036/python-3-4-urllib-request-error-http-403) for some useful tips) or
try another site.
All Done!
In this notebook, we covered loading a dataset, simple statistics, basic data visualizations, and web-scraping to
round out your toolset. These will be immensely helpful as you move forwards in building your skills in data
science.
By now, you hopefully feel a little more confident with tackling your final project. It is up to you to find your own
data, build your own notebook, and show others what you have achieved. Best of luck!
In [ ]: #DONEEEEE!
localhost:8888/nbconvert/html/MiniProjectWeek4.ipynb?download=false 14/14