Utf-8''libraries Data Management
Utf-8''libraries Data Management
September 7, 2020
1 Python Libraries
Python, like other programming languages, has an abundance of additional modules or libraries
that augument the base framework and functionality of the language.
Think of a library as a collection of functions that can be accessed to complete certain program-
ming tasks without having to write your own algorithm.
For this course, we will focus primarily on the following libraries:
• Pandas provides high-performance, easy-to-use data structures and data analysis tools.
• Seaborn is a higher-level interface to Matplotlib that can be used to simplify many graphing
tasks.
2 Documentation
Reliable and accesible documentation is an absolute necessity when it comes to knowledge trans-
fer of programming languages. Luckily, python provides a significant amount of detailed docu-
mentation that explains the ins and outs of the language syntax, libraries, and more.
Understanding how to read documentation is crucial for any programmer as it will serve as a
fantastic resource when learning the intricacies of python.
Here is the link to the documentation of the python standard library: Python Standard Library
1
2.0.2 Utilizing Library Functions
After importing a library, its functions can then be called from your code by prepending the library
name to the function name. For example, to use the ‘dot’ function from the ‘numpy’ library, you
would enter ‘numpy.dot’. To avoid repeatedly having to type the libary name in your scripts, it
is conventional to define a two or three letter abbreviation for each library, e.g. ‘numpy’ is usually
abbreviated as ‘np’. This allows us to use ‘np.dot’ instead of ‘numpy.dot’. Similarly, the Pandas
library is typically abbreviated as ‘pd’.
The next cell shows how to call functions within an imported library:
In [2]: a = np.array([0,1,2,3,4,5,6,7,8,9,10])
np.mean(a)
Out[2]: 5.0
As you can see, we used the mean() function within the numpy library to calculate the mean
of the numpy 1-dimensional array.
3 Data Management
Data management is a crucial component to statistical analysis and data science work. The fol-
lowing code will show how to import data via the pandas library, view your data, and transform
your data.
The main data structure that Pandas works with is called a Data Frame. This is a two-
dimensional table of data in which the rows typically represent cases (e.g. Cartwheel Contest
Participants), and the columns represent variables. Pandas also has a one-dimensional data struc-
ture called a Series that we will encounter when accesing a single column of a Data Frame.
Pandas has a variety of functions named ‘read_xxx’ for reading data in different formats. Right
now we will focus on reading ‘csv’ files, which stands for comma-separated values. However the
other file formats include excel, json, and sql just to name a few.
This is a link to the .csv that we will be exploring in this tutorial: Cartwheel Data (Link goes to
the dataset section of the Resources for this course)
There are many other options to ‘read_csv’ that are very useful. For example, you would use
the option sep='\t' instead of the default sep=',' if the fields of your data file are delimited by
tabs instead of commas. See here for the full documentation for ‘read_csv’.
Out[3]: pandas.core.frame.DataFrame
2
3.0.2 Viewing Data
In [4]: # We can view our Data Frame by calling the head() function
df.head()
The head() function simply shows the first 5 rows of our Data Frame. If we wanted to show
the entire Data Frame we would simply write the following:
3
23 24 26 M 2 N 0 69.00 71.0
24 25 23 F 1 Y 1 65.00 63.0
As you can see, we have a 2-Dimensional object where each row is an independent observation
of our cartwheel data.
To gather more information regarding the data, we can view the column names and data types
of each column with the following functions:
In [6]: df.columns
Lets say we would like to splice our data frame and select only specific portions of our data.
There are three different ways of doing so.
1. .loc()
2. .iloc()
3. .ix()
4
3.0.3 .loc()
.loc() takes two single/list/range operator separated by ‘,’. The first one indicates the row and the
second one indicates columns.
Out[7]: 0 79
1 70
2 85
3 87
4 72
5 81
6 107
7 98
8 106
9 65
10 96
11 79
12 92
13 66
14 72
15 115
16 90
17 74
18 64
19 85
20 66
21 101
22 82
23 63
24 67
Name: CWDistance, dtype: int64
In [8]: # Select all rows for multiple columns, ["CWDistance", "Height", "Wingspan"]
df.loc[:,["CWDistance", "Height", "Wingspan"]]
5
10 96 69.50 66.0
11 79 62.75 58.0
12 92 65.00 64.5
13 66 61.50 57.5
14 72 73.00 74.0
15 115 71.00 72.0
16 90 61.50 59.5
17 74 66.00 66.0
18 64 70.00 69.0
19 85 68.00 66.0
20 66 69.00 67.0
21 101 71.00 70.0
22 82 70.00 68.0
23 63 69.00 71.0
24 67 65.00 63.0
In [9]: # Select few rows for multiple columns, ["CWDistance", "Height", "Wingspan"]
df.loc[:9, ["CWDistance", "Height", "Wingspan"]]
6
The .loc() function requires to arguments, the indices of the rows and the column names you
wish to observe.
In the above case : specifies all rows, and our column is CWDistance. df.loc[:,“CWDistance”]
Now, let’s say we only want to return the first 10 observations:
Out[11]: 0 79
1 70
2 85
3 87
4 72
5 81
6 107
7 98
8 106
9 65
Name: CWDistance, dtype: int64
3.0.4 .iloc()
.iloc() is integer based slicing, whereas .loc() used labels/column names. Here are some examples:
In [12]: df.iloc[:4]
7
TypeErrorTraceback (most recent call last)
<ipython-input-14-38420b6cd49e> in <module>()
----> 1 df.iloc[1:5, ["Gender", "GenderGroup"]]
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/indexing.pyc in __getit
1470 except (KeyError, IndexError):
1471 pass
-> 1472 return self._getitem_tuple(key)
1473 else:
1474 # we by definition only have the 0th axis
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _getite
2011 def _getitem_tuple(self, tup):
2012
-> 2013 self._has_valid_tuple(tup)
2014 try:
2015 return self._getitem_lowerdim(tup)
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _has_va
220 raise IndexingError('Too many indexers')
221 try:
--> 222 self._validate_key(k, i)
223 except ValueError:
224 raise ValueError("Location based indexing can only have "
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/indexing.pyc in _valida
1965 l = len(self.obj._get_axis(axis))
1966
-> 1967 if len(arr) and (arr.max() >= l or arr.min() < -l):
1968 raise IndexError("positional indexers are out-of-bounds")
1969 else:
/opt/conda/envs/python2/lib/python2.7/site-packages/numpy/core/_methods.pyc in _amax(a,
26 def _amax(a, axis=None, out=None, keepdims=False,
27 initial=_NoValue):
---> 28 return umr_maximum(a, axis, None, out, keepdims, initial)
29
30 def _amin(a, axis=None, out=None, keepdims=False,
8
We can view the data types of our data frame columns with by calling .dtypes on our data
frame:
In [ ]: df.dtypes
The output indicates we have integers, floats, and objects with our Data Frame.
We may also want to observe the different unique values within a specific column, lets do this
for Gender:
It seems that these fields may serve the same purpose, which is to specify male vs. female. Lets
check this quickly by observing only these two columns:
From eyeballing the output, it seems to check out. We can streamline this by utilizing the
groupby() and size() functions.
In [ ]: df.groupby(['Gender','GenderGroup']).size()
This validates our initial assumption that these two fields essentially portray the same infor-
mation.