100% found this document useful (1 vote)
344 views166 pages

Statistics and Machine Learning in Python

This document describes the Python ecosystem for data science. It introduces Python and the Anaconda distribution, which includes many Python tools and libraries. Anaconda allows managing Python environments and packages using conda. Popular Python packages for data science discussed include NumPy, Pandas, Matplotlib, Scikit-Learn, and TensorFlow. Managing different environments is described as useful for collaborating or delivering applications.

Uploaded by

Fanping Bu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
100% found this document useful (1 vote)
344 views166 pages

Statistics and Machine Learning in Python

This document describes the Python ecosystem for data science. It introduces Python and the Anaconda distribution, which includes many Python tools and libraries. Anaconda allows managing Python environments and packages using conda. Popular Python packages for data science discussed include NumPy, Pandas, Matplotlib, Scikit-Learn, and TensorFlow. Managing different environments is described as useful for collaborating or delivering applications.

Uploaded by

Fanping Bu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 166

Statistics and Machine Learning in

Python
Release 0.3 beta

Edouard Duchesnay, Tommy Löfstedt, Feki Younes

Nov 13, 2019


CONTENTS

1 Introduction 1
1.1 Python ecosystem for data-science . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction to Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Data analysis methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Python language 9
2.1 Import libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Basic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Execution control statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 List comprehensions, iterators, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Regular expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 System programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9 Scripts and argument parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.10 Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.11 Modules and packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.12 Object Oriented Programming (OOP) . . . . . . . . . . . . . . . . . . . . . . . . 30
2.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Scientific Python 33
3.1 Numpy: arrays and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Pandas: data manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Matplotlib: data visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 Statistics 69
4.1 Univariate statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Lab 1: Brain volumes study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.3 Multivariate statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.4 Time Series in python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5 Machine Learning 147


5.1 Dimension reduction and feature extraction . . . . . . . . . . . . . . . . . . . . . 147
5.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.3 Linear methods for regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.4 Linear classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.5 Non linear learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
5.6 Resampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.7 Ensemble learning: bagging, boosting and stacking . . . . . . . . . . . . . . . . . 218

i
5.8 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

6 Deep Learning 243


6.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.2 Multilayer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.3 Convolutional neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
6.4 Transfer Learning Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

7 Indices and tables 315

ii
CHAPTER

ONE

INTRODUCTION

1.1 Python ecosystem for data-science

1.1.1 Python language

• Interpreted
• Garbage collector (do not prevent from memory leak)
• Dynamically-typed language (Java is statically typed)

1.1.2 Anaconda

Anaconda is a python distribution that ships most of python tools and libraries
Installation
1. Download anaconda (Python 3.x) http://continuum.io/downloads
2. Install it, on Linux

bash Anaconda3-2.4.1-Linux-x86_64.sh

3. Add anaconda path in your PATH variable in your .bashrc file:

export PATH="${HOME}/anaconda3/bin:$PATH"

Managing with ‘‘conda‘‘


Update conda package and environment manager to current version

conda update conda

Install additional packages. Those commands install qt back-end (Fix a temporary issue to run
spyder)

conda install pyqt


conda install PyOpenGL
conda update --all

Install seaborn for graphics

1
Statistics and Machine Learning in Python, Release 0.3 beta

conda install seaborn


# install a specific version from anaconda chanel
conda install -c anaconda pyqt=4.11.4

List installed packages

conda list

Search available packages

conda search pyqt


conda search scikit-learn

Environments
• A conda environment is a directory that contains a specific collection of conda packages
that you have installed.
• Control packages environment for a specific purpose: collaborating with someone else,
delivering an application to your client,
• Switch between environments
List of all environments
:: conda info –envs
1. Create new environment
2. Activate
3. Install new package

conda create --name test


# Or
conda env create -f environment.yml
source activate test
conda info --envs
conda list
conda search -f numpy
conda install numpy

Miniconda
Anaconda without the collection of (>700) packages. With Miniconda you download only the
packages you want with the conda command: conda install PACKAGENAME
1. Download anaconda (Python 3.x) https://conda.io/miniconda.html
2. Install it, on Linux

bash Miniconda3-latest-Linux-x86_64.sh

3. Add anaconda path in your PATH variable in your .bashrc file:

export PATH=${HOME}/miniconda3/bin:$PATH

4. Install required packages

2 Chapter 1. Introduction
Statistics and Machine Learning in Python, Release 0.3 beta

conda install -y scipy


conda install -y pandas
conda install -y matplotlib
conda install -y statsmodels
conda install -y scikit-learn
conda install -y sqlite
conda install -y spyder
conda install -y jupyter

1.1.3 Commands

python: python interpreter. On the dos/unix command line execute wholes file:

python file.py

Interactive mode:

python

Quite with CTL-D


ipython: advanced interactive python interpreter:

ipython

Quite with CTL-D


pip alternative for packages management (update -U in user directory --user):

pip install -U --user seaborn

For neuroimaging:

pip install -U --user nibabel


pip install -U --user nilearn

spyder: IDE (integrated development environment):


• Syntax highlighting.
• Code introspection for code completion (use TAB).
• Support for multiple Python consoles (including IPython).
• Explore and edit variables from a GUI.
• Debugging.
• Navigate in code (go to function definition) CTL.
3 or 4 panels:

text editor help/variable explorer


ipython interpreter

Shortcuts: - F9 run line/selection

1.1. Python ecosystem for data-science 3


Statistics and Machine Learning in Python, Release 0.3 beta

1.1.4 Libraries

scipy.org: https://www.scipy.org/docs.html
Numpy: Basic numerical operation. Matrix operation plus some basic solvers.:

import numpy as np
X = np.array([[1, 2], [3, 4]])
#v = np.array([1, 2]).reshape((2, 1))
v = np.array([1, 2])
np.dot(X, v) # no broadcasting
X * v # broadcasting
np.dot(v, X)
X - X.mean(axis=0)

Scipy: general scientific libraries with advanced solver:

import scipy
import scipy.linalg
scipy.linalg.svd(X, full_matrices=False)

Matplotlib: visualization:

import numpy as np
import matplotlib.pyplot as plt
#%matplotlib qt
x = np.linspace(0, 10, 50)
sinus = np.sin(x)
plt.plot(x, sinus)
plt.show()

Pandas: Manipulation of structured data (tables). input/output excel files, etc.


Statsmodel: Advanced statistics
Scikit-learn: Machine learning

li- Arrays data, Structured Solvers: Solvers: Stats: Stats: Machine


brary Num. comp, data, I/O basic advanced basic ad- learning
I/O vanced
Numpy X X
Scipy X X X
Pan- X
das
Stat- X X
mod-
els
Scikit- X
learn

4 Chapter 1. Introduction
Statistics and Machine Learning in Python, Release 0.3 beta

1.2 Introduction to Machine Learning

1.2.1 Machine learning within data science

Machine learning covers two main types of data analysis:


1. Exploratory analysis: Unsupervised learning. Discover the structure within the data.
E.g.: Experience (in years in a company) and salary are correlated.
2. Predictive analysis: Supervised learning. This is sometimes described as “learn from
the past to predict the future”. Scenario: a company wants to detect potential future
clients among a base of prospects. Retrospective data analysis: we go through the data
constituted of previous prospected companies, with their characteristics (size, domain,
localization, etc. . . ). Some of these companies became clients, others did not. The ques-
tion is, can we possibly predict which of the new companies are more likely to become
clients, based on their characteristics based on previous observations? In this example,
the training data consists of a set of n training samples. Each sample, 𝑥𝑖 , is a vector of p
input features (company characteristics) and a target feature (𝑦𝑖 ∈ {𝑌 𝑒𝑠, 𝑁 𝑜} (whether
they became a client or not).

1.2.2 IT/computing science tools

• High Performance Computing (HPC)


• Data flow, data base, file I/O, etc.

1.2. Introduction to Machine Learning 5


Statistics and Machine Learning in Python, Release 0.3 beta

• Python: the programming language.


• Numpy: python library particularly useful for handling of raw numerical data (matrices,
mathematical operations).
• Pandas: input/output, manipulation structured data (tables).

1.2.3 Statistics and applied mathematics

• Linear model.
• Non parametric statistics.
• Linear algebra: matrix operations, inversion, eigenvalues.

1.3 Data analysis methodology

1. Formalize customer’s needs into a learning problem:


• A target variable: supervised problem.
– Target is qualitative: classification.
– Target is quantitative: regression.
• No target variable: unsupervised problem
– Vizualisation of high-dimensional samples: PCA, manifolds learning, etc.
– Finding groups of samples (hidden structure): clustering.
2. Ask question about the datasets
• Number of samples
• Number of variables, types of each variable.
3. Define the sample
• For prospective study formalize the experimental design: inclusion/exlusion cri-
teria. The conditions that define the acquisition of the dataset.
• For retrospective study formalize the experimental design: inclusion/exlusion
criteria. The conditions that define the selection of the dataset.
4. In a document formalize (i) the project objectives; (ii) the required learning dataset (more
specifically the input data and the target variables); (iii) The conditions that define the ac-
quisition of the dataset. In this document, warn the customer that the learned algorithms
may not work on new data acquired under different condition.
5. Read the learning dataset.
6. (i) Sanity check (basic descriptive statistics); (ii) data cleaning (impute missing data,
recoding); Final Quality Control (QC) perform descriptive statistics and think ! (re-
move possible confounding variable, etc.).
7. Explore data (visualization, PCA) and perform basic univariate statistics for association
between the target an input variables.
8. Perform more complex multivariate-machine learning.

6 Chapter 1. Introduction
Statistics and Machine Learning in Python, Release 0.3 beta

9. Model validation using a left-out-sample strategy (cross-validation, etc.).


10. Apply on new data.

1.3. Data analysis methodology 7


Statistics and Machine Learning in Python, Release 0.3 beta

8 Chapter 1. Introduction
CHAPTER

TWO

PYTHON LANGUAGE

Note: Click here to download the full example code

Source Kevin Markham https://github.com/justmarkham/python-reference

2.1 Import libraries

# 'generic import' of math module


import math
math.sqrt(25)

# import a function
from math import sqrt
sqrt(25) # no longer have to reference the module

# import multiple functions at once


from math import cos, floor

# import all functions in a module (generally discouraged)


# from os import *

# define an alias
import numpy as np

# show all functions in math module


content = dir(math)

2.2 Basic operations

# Numbers
10 + 4 # add (returns 14)
10 - 4 # subtract (returns 6)
10 * 4 # multiply (returns 40)
10 ** 4 # exponent (returns 10000)
10 / 4 # divide (returns 2 because both types are 'int')
10 / float(4) # divide (returns 2.5)
5 % 4 # modulo (returns 1) - also known as the remainder

(continues on next page)

9
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


10 / 4 # true division (returns 2.5)
10 // 4 # floor division (returns 2)

# Boolean operations
# comparisons (these return True)
5 > 3
5 >= 3
5 != 3
5 == 5

# boolean operations (these return True)


5 > 3 and 6 > 3
5 > 3 or 5 < 3
not False
False or not False and True # evaluation order: not, and, or

2.3 Data types

# determine the type of an object


type(2) # returns 'int'
type(2.0) # returns 'float'
type('two') # returns 'str'
type(True) # returns 'bool'
type(None) # returns 'NoneType'

# check if an object is of a given type


isinstance(2.0, int) # returns False
isinstance(2.0, (int, float)) # returns True

# convert an object to a given type


float(2)
int(2.9)
str(2.9)

# zero, None, and empty containers are converted to False


bool(0)
bool(None)
bool('') # empty string
bool([]) # empty list
bool({}) # empty dictionary

# non-empty containers and non-zeros are converted to True


bool(2)
bool('two')
bool([2])

2.3.1 Lists

Different objects categorized along a certain ordered sequence, lists are ordered, iterable, mu-
table (adding or removing objects changes the list size), can contain multiple data types ..
chunk-chap13-001

10 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3 beta

# create an empty list (two ways)


empty_list = []
empty_list = list()

# create a list
simpsons = ['homer', 'marge', 'bart']

# examine a list
simpsons[0] # print element 0 ('homer')
len(simpsons) # returns the length (3)

# modify a list (does not return the list)


simpsons.append('lisa') # append element to end
simpsons.extend(['itchy', 'scratchy']) # append multiple elements to end
simpsons.insert(0, 'maggie') # insert element at index 0 (shifts everything␣
˓→right)
simpsons.remove('bart') # searches for first instance and removes it
simpsons.pop(0) # removes element 0 and returns it
del simpsons[0] # removes element 0 (does not return it)
simpsons[0] = 'krusty' # replace element 0

# concatenate lists (slower than 'extend' method)


neighbors = simpsons + ['ned','rod','todd']

# find elements in a list


simpsons.count('lisa') # counts the number of instances
simpsons.index('itchy') # returns index of first instance

# list slicing [start:end:stride]


weekdays = ['mon','tues','wed','thurs','fri']
weekdays[0] # element 0
weekdays[0:3] # elements 0, 1, 2
weekdays[:3] # elements 0, 1, 2
weekdays[3:] # elements 3, 4
weekdays[-1] # last element (element 4)
weekdays[::2] # every 2nd element (0, 2, 4)
weekdays[::-1] # backwards (4, 3, 2, 1, 0)

# alternative method for returning the list backwards


list(reversed(weekdays))

# sort a list in place (modifies but does not return the list)
simpsons.sort()
simpsons.sort(reverse=True) # sort in reverse
simpsons.sort(key=len) # sort by a key

# return a sorted list (but does not modify the original list)
sorted(simpsons)
sorted(simpsons, reverse=True)
sorted(simpsons, key=len)

# create a second reference to the same list


num = [1, 2, 3]
same_num = num
same_num[0] = 0 # modifies both 'num' and 'same_num'

# copy a list (three ways)


(continues on next page)

2.3. Data types 11


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


new_num = num.copy()
new_num = num[:]
new_num = list(num)

# examine objects
id(num) == id(same_num) # returns True
id(num) == id(new_num) # returns False
num is same_num # returns True
num is new_num # returns False
num == same_num # returns True
num == new_num # returns True (their contents are equivalent)

# conatenate +, replicate *
[1, 2, 3] + [4, 5, 6]
["a"] * 2 + ["b"] * 3

2.3.2 Tuples

Like lists, but their size cannot change: ordered, iterable, immutable, can contain multiple data
types

# create a tuple
digits = (0, 1, 'two') # create a tuple directly
digits = tuple([0, 1, 'two']) # create a tuple from a list
zero = (0,) # trailing comma is required to indicate it's a tuple

# examine a tuple
digits[2] # returns 'two'
len(digits) # returns 3
digits.count(0) # counts the number of instances of that value (1)
digits.index(1) # returns the index of the first instance of that value (1)

# elements of a tuple cannot be modified


# digits[2] = 2 # throws an error

# concatenate tuples
digits = digits + (3, 4)

# create a single tuple with elements repeated (also works with lists)
(3, 4) * 2 # returns (3, 4, 3, 4)

# tuple unpacking
bart = ('male', 10, 'simpson') # create a tuple

2.3.3 Strings

A sequence of characters, they are iterable, immutable

# create a string
s = str(42) # convert another data type into a string
s = 'I like you'

(continues on next page)

12 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


# examine a string
s[0] # returns 'I'
len(s) # returns 10

# string slicing like lists


s[:6] # returns 'I like'
s[7:] # returns 'you'
s[-1] # returns 'u'

# basic string methods (does not modify the original string)


s.lower() # returns 'i like you'
s.upper() # returns 'I LIKE YOU'
s.startswith('I') # returns True
s.endswith('you') # returns True
s.isdigit() # returns False (returns True if every character in the string is a␣
˓→digit)
s.find('like') # returns index of first occurrence (2), but doesn't support regex
s.find('hate') # returns -1 since not found
s.replace('like','love') # replaces all instances of 'like' with 'love'

# split a string into a list of substrings separated by a delimiter


s.split(' ') # returns ['I','like','you']
s.split() # same thing
s2 = 'a, an, the'
s2.split(',') # returns ['a',' an',' the']

# join a list of strings into one string using a delimiter


stooges = ['larry','curly','moe']
' '.join(stooges) # returns 'larry curly moe'

# concatenate strings
s3 = 'The meaning of life is'
s4 = '42'
s3 + ' ' + s4 # returns 'The meaning of life is 42'
s3 + ' ' + str(42) # same thing

# remove whitespace from start and end of a string


s5 = ' ham and cheese '
s5.strip() # returns 'ham and cheese'

# string substitutions: all of these return 'raining cats and dogs'


'raining %s and %s' % ('cats','dogs') # old way
'raining {} and {}'.format('cats','dogs') # new way
'raining {arg1} and {arg2}'.format(arg1='cats',arg2='dogs') # named arguments

# string formatting
# more examples: http://mkaz.com/2012/10/10/python-string-format/
'pi is {:.2f}'.format(3.14159) # returns 'pi is 3.14'

2.3.4 Strings 2/2

Normal strings allow for escaped characters

print('first line\nsecond line')

2.3. Data types 13


Statistics and Machine Learning in Python, Release 0.3 beta

Out:
first line
second line

raw strings treat backslashes as literal characters


print(r'first line\nfirst line')

Out:
first line\nfirst line

sequece of bytes are not strings, should be decoded before some operations
s = b'first line\nsecond line'
print(s)

print(s.decode('utf-8').split())

Out:
b'first line\nsecond line'
['first', 'line', 'second', 'line']

2.3.5 Dictionaries

Dictionaries are structures which can contain multiple data types, and is ordered with key-value
pairs: for each (unique) key, the dictionary outputs one value. Keys can be strings, numbers, or
tuples, while the corresponding values can be any Python object. Dictionaries are: unordered,
iterable, mutable
# create an empty dictionary (two ways)
empty_dict = {}
empty_dict = dict()

# create a dictionary (two ways)


family = {'dad':'homer', 'mom':'marge', 'size':6}
family = dict(dad='homer', mom='marge', size=6)

# convert a list of tuples into a dictionary


list_of_tuples = [('dad','homer'), ('mom','marge'), ('size', 6)]
family = dict(list_of_tuples)

# examine a dictionary
family['dad'] # returns 'homer'
len(family) # returns 3
family.keys() # returns list: ['dad', 'mom', 'size']
family.values() # returns list: ['homer', 'marge', 6]
family.items() # returns list of tuples:
# [('dad', 'homer'), ('mom', 'marge'), ('size', 6)]
'mom' in family # returns True
'marge' in family # returns False (only checks keys)

# modify a dictionary (does not return the dictionary)


(continues on next page)

14 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


family['cat'] = 'snowball' # add a new entry
family['cat'] = 'snowball ii' # edit an existing entry
del family['cat'] # delete an entry
family['kids'] = ['bart', 'lisa'] # value can be a list
family.pop('dad') # removes an entry and returns the value ('homer')
family.update({'baby':'maggie', 'grandpa':'abe'}) # add multiple entries

# accessing values more safely with 'get'


family['mom'] # returns 'marge'
family.get('mom') # same thing
try:
family['grandma'] # throws an error
except KeyError as e:
print("Error", e)

family.get('grandma') # returns None


family.get('grandma', 'not found') # returns 'not found' (the default)

# accessing a list element within a dictionary


family['kids'][0] # returns 'bart'
family['kids'].remove('lisa') # removes 'lisa'

# string substitution using a dictionary


'youngest child is %(baby)s' % family # returns 'youngest child is maggie'

Out:
Error 'grandma'

2.3.6 Sets

Like dictionaries, but with unique keys only (no corresponding values). They are: unordered, it-
erable, mutable, can contain multiple data types made up of unique elements (strings, numbers,
or tuples)
# create an empty set
empty_set = set()

# create a set
languages = {'python', 'r', 'java'} # create a set directly
snakes = set(['cobra', 'viper', 'python']) # create a set from a list

# examine a set
len(languages) # returns 3
'python' in languages # returns True

# set operations
languages & snakes # returns intersection: {'python'}
languages | snakes # returns union: {'cobra', 'r', 'java', 'viper', 'python'}
languages - snakes # returns set difference: {'r', 'java'}
snakes - languages # returns set difference: {'cobra', 'viper'}

# modify a set (does not return the set)


languages.add('sql') # add a new element
(continues on next page)

2.3. Data types 15


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


languages.add('r') # try to add an existing element (ignored, no error)
languages.remove('java') # remove an element
try:
languages.remove('c') # try to remove a non-existing element (throws an error)
except KeyError as e:
print("Error", e)
languages.discard('c') # removes an element if present, but ignored otherwise
languages.pop() # removes and returns an arbitrary element
languages.clear() # removes all elements
languages.update('go', 'spark') # add multiple elements (can also pass a list or set)

# get a sorted list of unique elements from a list


sorted(set([9, 0, 2, 1, 0])) # returns [0, 1, 2, 9]

Out:

Error 'c'

2.4 Execution control statements

2.4.1 Conditional statements

x = 3
# if statement
if x > 0:
print('positive')

# if/else statement
if x > 0:
print('positive')
else:
print('zero or negative')

# if/elif/else statement
if x > 0:
print('positive')
elif x == 0:
print('zero')
else:
print('negative')

# single-line if statement (sometimes discouraged)


if x > 0: print('positive')

# single-line if/else statement (sometimes discouraged)


# known as a 'ternary operator'
'positive' if x > 0 else 'zero or negative'

'positive' if x > 0 else 'zero or negative'

Out:

16 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3 beta

positive
positive
positive
positive

2.4.2 Loops

Loops are a set of instructions which repeat until termination conditions are met. This can
include iterating through all values in an object, go through a range of values, etc
# range returns a list of integers
range(0, 3) # returns [0, 1, 2]: includes first value but excludes second value
range(3) # same thing: starting at zero is the default
range(0, 5, 2) # returns [0, 2, 4]: third argument specifies the 'stride'

# for loop
fruits = ['apple', 'banana', 'cherry']
for i in range(len(fruits)):
print(fruits[i].upper())

# alternative for loop (recommended style)


for fruit in fruits:
print(fruit.upper())

# use range when iterating over a large sequence to avoid actually creating the integer␣
˓→list in memory
v = 0
for i in range(10 ** 6):
v += 1

quote = """
our incomes are like our shoes; if too small they gall and pinch us
but if too large they cause us to stumble and to trip
"""

count = {k:0 for k in set(quote.split())}


for word in quote.split():
count[word] += 1

# iterate through two things at once (using tuple unpacking)


family = {'dad':'homer', 'mom':'marge', 'size':6}
for key, value in family.items():
print(key, value)

# use enumerate if you need to access the index value within the loop
for index, fruit in enumerate(fruits):
print(index, fruit)

# for/else loop
for fruit in fruits:
if fruit == 'banana':
print("Found the banana!")
break # exit the loop and skip the 'else' block
(continues on next page)

2.4. Execution control statements 17


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


else:
# this block executes ONLY if the for loop completes without hitting 'break'
print("Can't find the banana")

# while loop
count = 0
while count < 5:
print("This will print 5 times")
count += 1 # equivalent to 'count = count + 1'

Out:

APPLE
BANANA
CHERRY
APPLE
BANANA
CHERRY
dad homer
mom marge
size 6
0 apple
1 banana
2 cherry
Can't find the banana
Found the banana!
This will print 5 times
This will print 5 times
This will print 5 times
This will print 5 times
This will print 5 times

2.4.3 Exceptions handling

dct = dict(a=[1, 2], b=[4, 5])

key = 'c'
try:
dct[key]
except:
print("Key %s is missing. Add it with empty value" % key)
dct['c'] = []

print(dct)

Out:

Key c is missing. Add it with empty value


{'a': [1, 2], 'b': [4, 5], 'c': []}

18 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3 beta

2.5 Functions

Functions are sets of instructions launched when called upon, they can have multiple input
values and a return value

# define a function with no arguments and no return values


def print_text():
print('this is text')

# call the function


print_text()

# define a function with one argument and no return values


def print_this(x):
print(x)

# call the function


print_this(3) # prints 3
n = print_this(3) # prints 3, but doesn't assign 3 to n
# because the function has no return statement

#
def add(a, b):
return a + b

add(2, 3)

add("deux", "trois")

add(["deux", "trois"], [2, 3])

# define a function with one argument and one return value


def square_this(x):
return x ** 2

# include an optional docstring to describe the effect of a function


def square_this(x):
"""Return the square of a number."""
return x ** 2

# call the function


square_this(3) # prints 9
var = square_this(3) # assigns 9 to var, but does not print 9

# default arguments
def power_this(x, power=2):
return x ** power

power_this(2) # 4
power_this(2, 3) # 8

# use 'pass' as a placeholder if you haven't written the function body


def stub():
pass

# return two values from a single function


(continues on next page)

2.5. Functions 19
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


def min_max(nums):
return min(nums), max(nums)

# return values can be assigned to a single variable as a tuple


nums = [1, 2, 3]
min_max_num = min_max(nums) # min_max_num = (1, 3)

# return values can be assigned into multiple variables using tuple unpacking
min_num, max_num = min_max(nums) # min_num = 1, max_num = 3

Out:
this is text
3
3

2.6 List comprehensions, iterators, etc.

2.6.1 List comprehensions

Process which affects whole lists without iterating through loops. For more: http://
python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html
# for loop to create a list of cubes
nums = [1, 2, 3, 4, 5]
cubes = []
for num in nums:
cubes.append(num**3)

# equivalent list comprehension


cubes = [num**3 for num in nums] # [1, 8, 27, 64, 125]

# for loop to create a list of cubes of even numbers


cubes_of_even = []
for num in nums:
if num % 2 == 0:
cubes_of_even.append(num**3)

# equivalent list comprehension


# syntax: [expression for variable in iterable if condition]
cubes_of_even = [num**3 for num in nums if num % 2 == 0] # [8, 64]

# for loop to cube even numbers and square odd numbers


cubes_and_squares = []
for num in nums:
if num % 2 == 0:
cubes_and_squares.append(num**3)
else:
cubes_and_squares.append(num**2)

# equivalent list comprehension (using a ternary expression)


# syntax: [true_condition if condition else false_condition for variable in iterable]
cubes_and_squares = [num**3 if num % 2 == 0 else num**2 for num in nums] # [1, 8, 9,␣
˓→64, 25] (continues on next page)

20 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)

# for loop to flatten a 2d-matrix


matrix = [[1, 2], [3, 4]]
items = []
for row in matrix:
for item in row:
items.append(item)

# equivalent list comprehension


items = [item for row in matrix
for item in row] # [1, 2, 3, 4]

# set comprehension
fruits = ['apple', 'banana', 'cherry']
unique_lengths = {len(fruit) for fruit in fruits} # {5, 6}

# dictionary comprehension
fruit_lengths = {fruit:len(fruit) for fruit in fruits} # {'apple': 5, 'banana
˓→': 6, 'cherry': 6}

2.7 Regular expression

1. Compile Regular expression with a patetrn

import re

# 1. compile Regular expression with a patetrn


regex = re.compile("^.+(sub-.+)_(ses-.+)_(mod-.+)")

2. Match compiled RE on string


Capture the pattern `anyprefixsub-<subj id>_ses-<session id>_<modality>`

strings = ["abcsub-033_ses-01_mod-mri", "defsub-044_ses-01_mod-mri", "ghisub-055_ses-02_


˓→mod-ctscan" ]
print([regex.findall(s)[0] for s in strings])

Out:

[('sub-033', 'ses-01', 'mod-mri'), ('sub-044', 'ses-01', 'mod-mri'), ('sub-055', 'ses-02',


˓→ 'mod-ctscan')]

Match methods on compiled regular expression

Method/Attribute Purpose
match(string) Determine if the RE matches at the beginning of the string.
search(string) Scan through a string, looking for any location where this RE matches.
findall(string) Find all substrings where the RE matches, and returns them as a list.
finditer(string) Find all substrings where the RE matches, and returns them as an itera-
tor.

2. Replace compiled RE on string

2.7. Regular expression 21


Statistics and Machine Learning in Python, Release 0.3 beta

regex = re.compile("(sub-[^_]+)") # match (sub-...)_


print([regex.sub("SUB-", s) for s in strings])

regex.sub("SUB-", "toto")

Out:

['abcSUB-_ses-01_mod-mri', 'defSUB-_ses-01_mod-mri', 'ghiSUB-_ses-02_mod-ctscan']

Replace all non-alphanumeric characters in a string

re.sub('[^0-9a-zA-Z]+', '', 'h^&ell`.,|o w]{+orld')

2.8 System programming

2.8.1 Operating system interfaces (os)

import os

Current working directory

# Get the current working directory


cwd = os.getcwd()
print(cwd)

# Set the current working directory


os.chdir(cwd)

Out:

/home/edouard/git/pystatsml/python_lang

Temporary directory

import tempfile

tmpdir = tempfile.gettempdir()

Join paths

mytmpdir = os.path.join(tmpdir, "foobar")

# list containing the names of the entries in the directory given by path.
os.listdir(tmpdir)

Create a directory

if not os.path.exists(mytmpdir):
os.mkdir(mytmpdir)

os.makedirs(os.path.join(tmpdir, "foobar", "plop", "toto"), exist_ok=True)

22 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3 beta

2.8.2 File input/output

filename = os.path.join(mytmpdir, "myfile.txt")


print(filename)

# Write
lines = ["Dans python tout est bon", "Enfin, presque"]

## write line by line


fd = open(filename, "w")
fd.write(lines[0] + "\n")
fd.write(lines[1]+ "\n")
fd.close()

## use a context manager to automatically close your file


with open(filename, 'w') as f:
for line in lines:
f.write(line + '\n')

# Read
## read one line at a time (entire file does not have to fit into memory)
f = open(filename, "r")
f.readline() # one string per line (including newlines)
f.readline() # next line
f.close()

## read one line at a time (entire file does not have to fit into memory)
f = open(filename, 'r')
f.readline() # one string per line (including newlines)
f.readline() # next line
f.close()

## read the whole file at once, return a list of lines


f = open(filename, 'r')
f.readlines() # one list, each line is one string
f.close()

## use list comprehension to duplicate readlines without reading entire file at once
f = open(filename, 'r')
[line for line in f]
f.close()

## use a context manager to automatically close your file


with open(filename, 'r') as f:
lines = [line for line in f]

Out:

/tmp/foobar/myfile.txt

2.8.3 Explore, list directories

Walk

2.8. System programming 23


Statistics and Machine Learning in Python, Release 0.3 beta

import os

WD = os.path.join(tmpdir, "foobar")

for dirpath, dirnames, filenames in os.walk(WD):


print(dirpath, dirnames, filenames)

Out:

/tmp/foobar ['plop'] ['myfile.txt']


/tmp/foobar/plop ['toto'] []
/tmp/foobar/plop/toto [] []

glob, basename and file extension TODO FIXME

import tempfile
import glob

tmpdir = tempfile.gettempdir()

filenames = glob.glob(os.path.join(tmpdir, "*", "*.txt"))


print(filenames)

# take basename then remove extension


basenames = [os.path.splitext(os.path.basename(f))[0] for f in filenames]
print(basenames)

Out:

['/tmp/foobar/myfile.txt']
['myfile']

shutil - High-level file operations

import shutil

src = os.path.join(tmpdir, "foobar", "myfile.txt")


dst = os.path.join(tmpdir, "foobar", "plop", "myfile.txt")
print("copy %s to %s" % (src, dst))

shutil.copy(src, dst)

print("File %s exists ?" % dst, os.path.exists(dst))

src = os.path.join(tmpdir, "foobar", "plop")


dst = os.path.join(tmpdir, "plop2")
print("copy tree %s under %s" % (src, dst))

try:
shutil.copytree(src, dst)

shutil.rmtree(dst)

shutil.move(src, dst)
except (FileExistsError, FileNotFoundError) as e:
pass

24 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3 beta

Out:

copy /tmp/foobar/myfile.txt to /tmp/foobar/plop/myfile.txt


File /tmp/foobar/plop/myfile.txt exists ? True
copy tree /tmp/foobar/plop under /tmp/plop2

2.8.4 Command execution with subprocess

• For more advanced use cases, the underlying Popen interface can be used directly.
• Run the command described by args.
• Wait for command to complete
• return a CompletedProcess instance.
• Does not capture stdout or stderr by default. To do so, pass PIPE for the stdout and/or
stderr arguments.

import subprocess

# doesn't capture output


p = subprocess.run(["ls", "-l"])
print(p.returncode)

# Run through the shell.


subprocess.run("ls -l", shell=True)

# Capture output
out = subprocess.run(["ls", "-a", "/"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
# out.stdout is a sequence of bytes that should be decoded into a utf-8 string
print(out.stdout.decode('utf-8').split("\n")[:5])

Out:

0
['.', '..', 'bin', 'boot', 'cdrom']

2.8.5 Multiprocessing and multithreading

Process
A process is a name given to a program instance that has been loaded into memory
and managed by the operating system.
Process = address space + execution context (thread of control)
Process address space (segments):
• Code.
• Data (static/global).
• Heap (dynamic memory allocation).
• Stack.
Execution context:

2.8. System programming 25


Statistics and Machine Learning in Python, Release 0.3 beta

• Data registers.
• Stack pointer (SP).
• Program counter (PC).
• Working Registers.
OS Scheduling of processes: context switching (ie. save/load Execution context)
Pros/cons
• Context switching expensive.
• (potentially) complex data sharing (not necessary true).
• Cooperating processes - no need for memory protection (separate address
spaces).
• Relevant for parrallel computation with memory allocation.
Threads
• Threads share the same address space (Data registers): access to code, heap
and (global) data.
• Separate execution stack, PC and Working Registers.
Pros/cons
• Faster context switching only SP, PC and Working Registers.
• Can exploit fine-grain concurrency
• Simple data sharing through the shared address space.
• Precautions have to be taken or two threads will write to the same memory at
the same time. This is what the global interpreter lock (GIL) is for.
• Relevant for GUI, I/O (Network, disk) concurrent operation
In Python
• The threading module uses threads.
• The multiprocessing module uses processes.
Multithreading

import time
import threading

def list_append(count, sign=1, out_list=None):


if out_list is None:
out_list = list()
for i in range(count):
out_list.append(sign * i)
sum(out_list) # do some computation
return out_list

size = 10000 # Number of numbers to add

out_list = list() # result is a simple list


(continues on next page)

26 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


thread1 = threading.Thread(target=list_append, args=(size, 1, out_list, ))
thread2 = threading.Thread(target=list_append, args=(size, -1, out_list, ))

startime = time.time()
# Will execute both in parallel
thread1.start()
thread2.start()
# Joins threads back to the parent process
thread1.join()
thread2.join()
print("Threading ellapsed time ", time.time() - startime)

print(out_list[:10])

Out:

Threading ellapsed time 1.7868659496307373


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Multiprocessing

import multiprocessing

# Sharing requires specific mecanism


out_list1 = multiprocessing.Manager().list()
p1 = multiprocessing.Process(target=list_append, args=(size, 1, None))
out_list2 = multiprocessing.Manager().list()
p2 = multiprocessing.Process(target=list_append, args=(size, -1, None))

startime = time.time()
p1.start()
p2.start()
p1.join()
p2.join()
print("Multiprocessing ellapsed time ", time.time() - startime)

# print(out_list[:10]) is not availlable

Out:

Multiprocessing ellapsed time 0.3927607536315918

Sharing object between process with Managers


Managers provide a way to create data which can be shared between different processes, in-
cluding sharing over a network between processes running on different machines. A manager
object controls a server process which manages shared objects.

import multiprocessing
import time

size = int(size / 100) # Number of numbers to add

# Sharing requires specific mecanism


out_list = multiprocessing.Manager().list()
(continues on next page)

2.8. System programming 27


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


p1 = multiprocessing.Process(target=list_append, args=(size, 1, out_list))
p2 = multiprocessing.Process(target=list_append, args=(size, -1, out_list))

startime = time.time()

p1.start()
p2.start()

p1.join()
p2.join()

print(out_list[:10])

print("Multiprocessing with shared object ellapsed time ", time.time() - startime)

Out:

[0, 1, 2, 0, 3, -1, 4, -2, 5, -3]


Multiprocessing with shared object ellapsed time 0.7650048732757568

2.9 Scripts and argument parsing

Example, the word count script

import os
import os.path
import argparse
import re
import pandas as pd

if __name__ == "__main__":
# parse command line options
output = "word_count.csv"
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input',
help='list of input files.',
nargs='+', type=str)
parser.add_argument('-o', '--output',
help='output csv file (default %s)' % output,
type=str, default=output)
options = parser.parse_args()

if options.input is None :
parser.print_help()
raise SystemExit("Error: input files are missing")
else:
filenames = [f for f in options.input if os.path.isfile(f)]

# Match words
regex = re.compile("[a-zA-Z]+")

count = dict()
for filename in filenames:
(continues on next page)

28 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


fd = open(filename, "r")
for line in fd:
for word in regex.findall(line.lower()):
if not word in count:
count[word] = 1
else:
count[word] += 1

fd = open(options.output, "w")

# Pandas
df = pd.DataFrame([[k, count[k]] for k in count], columns=["word", "count"])
df.to_csv(options.output, index=False)

2.10 Networking

# TODO

2.10.1 FTP

# Full FTP features with ftplib


import ftplib
ftp = ftplib.FTP("ftp.cea.fr")
ftp.login()
ftp.cwd('/pub/unati/people/educhesnay/pystatml')
ftp.retrlines('LIST')

fd = open(os.path.join(tmpdir, "README.md"), "wb")


ftp.retrbinary('RETR README.md', fd.write)
fd.close()
ftp.quit()

# File download urllib


import urllib.request
ftp_url = 'ftp://ftp.cea.fr/pub/unati/people/educhesnay/pystatml/README.md'
urllib.request.urlretrieve(ftp_url, os.path.join(tmpdir, "README2.md"))

Out:

-rw-r--r-- 1 ftp ftp 3019 Oct 16 00:30 README.md


-rw-r--r-- 1 ftp ftp 9588437 Oct 28 19:58␣
˓→StatisticsMachineLearningPythonDraft.pdf

2.10.2 HTTP

# TODO

2.10. Networking 29
Statistics and Machine Learning in Python, Release 0.3 beta

2.10.3 Sockets

# TODO

2.10.4 xmlrpc

# TODO

2.11 Modules and packages

A module is a Python file. A package is a directory which MUST contain a special file called
__init__.py
To import, extend variable PYTHONPATH:

export PYTHONPATH=path_to_parent_python_module:${PYTHONPATH}

Or

import sys
sys.path.append("path_to_parent_python_module")

The __init__.py file can be empty. But you can set which modules the package exports as the
API, while keeping other modules internal, by overriding the __all__ variable, like so:
parentmodule/__init__.py file:

from . import submodule1


from . import submodule2

from .submodule3 import function1


from .submodule3 import function2

__all__ = ["submodule1", "submodule2",


"function1", "function2"]

User can import:

import parentmodule.submodule1
import parentmodule.function1

Python Unit Testing

2.12 Object Oriented Programming (OOP)

Sources
• http://python-textbok.readthedocs.org/en/latest/Object_Oriented_Programming.html
Principles
• Encapsulate data (attributes) and code (methods) into objects.

30 Chapter 2. Python language


Statistics and Machine Learning in Python, Release 0.3 beta

• Class = template or blueprint that can be used to create objects.


• An object is a specific instance of a class.
• Inheritance: OOP allows classes to inherit commonly used state and behaviour from other
classes. Reduce code duplication
• Polymorphism: (usually obtained through polymorphism) calling code is agnostic as to
whether an object belongs to a parent class or one of its descendants (abstraction, modu-
larity). The same method called on 2 objects of 2 different classes will behave differently.

import math

class Shape2D:
def area(self):
raise NotImplementedError()

# __init__ is a special method called the constructor

# Inheritance + Encapsulation
class Square(Shape2D):
def __init__(self, width):
self.width = width

def area(self):
return self.width ** 2

class Disk(Shape2D):
def __init__(self, radius):
self.radius = radius

def area(self):
return math.pi * self.radius ** 2

shapes = [Square(2), Disk(3)]

# Polymorphism
print([s.area() for s in shapes])

s = Shape2D()
try:
s.area()
except NotImplementedError as e:
print("NotImplementedError")

Out:

[4, 28.274333882308138]
NotImplementedError

2.12. Object Oriented Programming (OOP) 31


Statistics and Machine Learning in Python, Release 0.3 beta

2.13 Exercises

2.13.1 Exercise 1: functions

Create a function that acts as a simple calulator If the operation is not specified, default to
addition If the operation is misspecified, return an prompt message Ex: calc(4,5,"multiply")
returns 20 Ex: calc(3,5) returns 8 Ex: calc(1, 2, "something") returns error message

2.13.2 Exercise 2: functions + list + loop

Given a list of numbers, return a list where all adjacent duplicate elements have been reduced
to a single element. Ex: [1, 2, 2, 3, 2] returns [1, 2, 3, 2]. You may create a new list or
modify the passed in list.
Remove all duplicate values (adjacent or not) Ex: [1, 2, 2, 3, 2] returns [1, 2, 3]

2.13.3 Exercise 3: File I/O

1. Copy/paste the BSD 4 clause license (https://en.wikipedia.org/wiki/BSD_licenses) into a


text file. Read, the file and count the occurrences of each word within the file. Store the words’
occurrence number in a dictionary.
2. Write an executable python command count_words.py that parse a list of input files provided
after --input parameter. The dictionary of occurrence is save in a csv file provides by --output.
with default value word_count.csv. Use: - open - regular expression - argparse (https://docs.
python.org/3/howto/argparse.html)

2.13.4 Exercise 4: OOP

1. Create a class Employee with 2 attributes provided in the constructor: name,


years_of_service. With one method salary with is obtained by 1500 + 100 *
years_of_service.
2. Create a subclass Manager which redefine salary method 2500 + 120 * years_of_service.
3. Create a small dictionary-nosed database where the key is the employee’s name. Populate
the database with: samples = Employee(‘lucy’, 3), Employee(‘john’, 1), Manager(‘julie’,
10), Manager(‘paul’, 3)
4. Return a table of made name, salary rows, i.e. a list of list [[name, salary]]
5. Compute the average salary
Total running time of the script: ( 0 minutes 3.188 seconds)

32 Chapter 2. Python language


CHAPTER

THREE

SCIENTIFIC PYTHON

Note: Click here to download the full example code

3.1 Numpy: arrays and matrices

NumPy is an extension to the Python programming language, adding support for large, multi-
dimensional (numerical) arrays and matrices, along with a large library of high-level mathe-
matical functions to operate on these arrays.
Sources:
• Kevin Markham: https://github.com/justmarkham

import numpy as np

3.1.1 Create arrays

Create ndarrays from lists. note: every element must be the same type (will be converted if
possible)

data1 = [1, 2, 3, 4, 5] # list


arr1 = np.array(data1) # 1d array
data2 = [range(1, 5), range(5, 9)] # list of lists
arr2 = np.array(data2) # 2d array
arr2.tolist() # convert array back to list

create special arrays

np.zeros(10)
np.zeros((3, 6))
np.ones(10)
np.linspace(0, 1, 5) # 0 to 1 (inclusive) with 5 points
np.logspace(0, 3, 4) # 10^0 to 10^3 (inclusive) with 4 points

arange is like range, except it returns an array (not a list)

int_array = np.arange(5)
float_array = int_array.astype(float)

33
Statistics and Machine Learning in Python, Release 0.3 beta

3.1.2 Examining arrays

arr1.dtype # float64
arr2.dtype # int32
arr2.ndim # 2
arr2.shape # (2, 4) - axis 0 is rows, axis 1 is columns
arr2.size # 8 - total number of elements
len(arr2) # 2 - size of first dimension (aka axis)

3.1.3 Reshaping

arr = np.arange(10, dtype=float).reshape((2, 5))


print(arr.shape)
print(arr.reshape(5, 2))

Out:

(2, 5)
[[0. 1.]
[2. 3.]
[4. 5.]
[6. 7.]
[8. 9.]]

Add an axis

a = np.array([0, 1])
a_col = a[:, np.newaxis]
print(a_col)
#or
a_col = a[:, None]

Out:

[[0]
[1]]

Transpose

print(a_col.T)

Out:

[[0 1]]

Flatten: always returns a flat copy of the orriginal array

arr_flt = arr.flatten()
arr_flt[0] = 33
print(arr_flt)
print(arr)

Out:

34 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

[33. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[[0. 1. 2. 3. 4.]
[5. 6. 7. 8. 9.]]

Ravel: returns a view of the original array whenever possible.

arr_flt = arr.ravel()
arr_flt[0] = 33
print(arr_flt)
print(arr)

Out:

[33. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[[33. 1. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]]

3.1.4 Summary on axis, reshaping/flattening and selection

Numpy internals: By default Numpy use C convention, ie, Row-major language: The matrix is
stored by rows. In C, the last index changes most rapidly as one moves through the array as
stored in memory.
For 2D arrays, sequential move in the memory will:
• iterate over rows (axis 0)
– iterate over columns (axis 1)
For 3D arrays, sequential move in the memory will:
• iterate over plans (axis 0)
– iterate over rows (axis 1)

* iterate over columns (axis 2)

3.1. Numpy: arrays and matrices 35


Statistics and Machine Learning in Python, Release 0.3 beta

x = np.arange(2 * 3 * 4)
print(x)

Out:

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]

Reshape into 3D (axis 0, axis 1, axis 2)

x = x.reshape(2, 3, 4)
print(x)

Out:

[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]

[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]

Selection get first plan

print(x[0, :, :])

Out:

[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]

Selection get first rows

print(x[:, 0, :])

Out:

[[ 0 1 2 3]
[12 13 14 15]]

Selection get first columns

print(x[:, :, 0])

Out:

[[ 0 4 8]
[12 16 20]]

Ravel

print(x.ravel())

Out:

36 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]

3.1.5 Stack arrays

Stack flat arrays in columns

a = np.array([0, 1])
b = np.array([2, 3])

ab = np.stack((a, b)).T
print(ab)

# or
np.hstack((a[:, None], b[:, None]))

Out:

[[0 2]
[1 3]]

3.1.6 Selection

Single item

arr = np.arange(10, dtype=float).reshape((2, 5))

arr[0] # 0th element (slices like a list)


arr[0, 3] # row 0, column 3: returns 4
arr[0][3] # alternative syntax

Slicing

Syntax: start:stop:step with start (default 0) stop (default last) step (default 1)

arr[0, :] # row 0: returns 1d array ([1, 2, 3, 4])


arr[:, 0] # column 0: returns 1d array ([1, 5])
arr[:, :2] # columns strictly before index 2 (2 first columns)
arr[:, 2:] # columns after index 2 included
arr2 = arr[:, 1:4] # columns between index 1 (included) and 4 (excluded)
print(arr2)

Out:

[[1. 2. 3.]
[6. 7. 8.]]

Slicing returns a view (not a copy)

arr2[0, 0] = 33
print(arr2)
print(arr)

3.1. Numpy: arrays and matrices 37


Statistics and Machine Learning in Python, Release 0.3 beta

Out:

[[33. 2. 3.]
[ 6. 7. 8.]]
[[ 0. 33. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]]

Row 0: reverse order

print(arr[0, ::-1])

# The rule of thumb here can be: in the context of lvalue indexing (i.e. the indices are␣
˓→placed in the left hand side value of an assignment), no view or copy of the array is␣
˓→created (because there is no need to). However, with regular values, the above rules␣

˓→for creating views does apply.

Out:

[ 4. 3. 2. 33. 0.]

Fancy indexing: Integer or boolean array indexing

Fancy indexing returns a copy not a view.


Integer array indexing

arr2 = arr[:, [1,2,3]] # return a copy


print(arr2)
arr2[0, 0] = 44
print(arr2)
print(arr)

Out:

[[33. 2. 3.]
[ 6. 7. 8.]]
[[44. 2. 3.]
[ 6. 7. 8.]]
[[ 0. 33. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]]

Boolean arrays indexing

arr2 = arr[arr > 5] # return a copy

print(arr2)
arr2[0] = 44
print(arr2)
print(arr)

Out:

[33. 6. 7. 8. 9.]
[44. 6. 7. 8. 9.]
[[ 0. 33. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]]

38 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

However, In the context of lvalue indexing (left hand side value of an assignment) Fancy autho-
rizes the modification of the original array

arr[arr > 5] = 0
print(arr)

Out:

[[0. 0. 2. 3. 4.]
[5. 0. 0. 0. 0.]]

Boolean arrays indexing continues

names = np.array(['Bob', 'Joe', 'Will', 'Bob'])


names == 'Bob' # returns a boolean array
names[names != 'Bob'] # logical selection
(names == 'Bob') | (names == 'Will') # keywords "and/or" don't work with boolean arrays
names[names != 'Bob'] = 'Joe' # assign based on a logical selection
np.unique(names) # set function

3.1.7 Vectorized operations

nums = np.arange(5)
nums * 10 # multiply each element by 10
nums = np.sqrt(nums) # square root of each element
np.ceil(nums) # also floor, rint (round to nearest int)
np.isnan(nums) # checks for NaN
nums + np.arange(5) # add element-wise
np.maximum(nums, np.array([1, -2, 3, -4, 5])) # compare element-wise

# Compute Euclidean distance between 2 vectors


vec1 = np.random.randn(10)
vec2 = np.random.randn(10)
dist = np.sqrt(np.sum((vec1 - vec2) ** 2))

# math and stats


rnd = np.random.randn(4, 2) # random normals in 4x2 array
rnd.mean()
rnd.std()
rnd.argmin() # index of minimum element
rnd.sum()
rnd.sum(axis=0) # sum of columns
rnd.sum(axis=1) # sum of rows

# methods for boolean arrays


(rnd > 0).sum() # counts number of positive values
(rnd > 0).any() # checks if any value is True
(rnd > 0).all() # checks if all values are True

# random numbers
np.random.seed(12234) # Set the seed
np.random.rand(2, 3) # 2 x 3 matrix in [0, 1]
np.random.randn(10) # random normals (mean 0, sd 1)
np.random.randint(0, 2, 10) # 10 randomly picked 0 or 1

3.1. Numpy: arrays and matrices 39


Statistics and Machine Learning in Python, Release 0.3 beta

3.1.8 Broadcasting

Sources: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html Implicit con-


version to allow operations on arrays of different sizes. - The smaller array is stretched or
“broadcasted” across the larger array so that they have compatible shapes. - Fast vectorized
operation in C instead of Python. - No needless copies.

Rules

Starting with the trailing axis and working backward, Numpy compares arrays dimensions.
• If two dimensions are equal then continues
• If one of the operand has dimension 1 stretches it to match the largest one
• When one of the shapes runs out of dimensions (because it has less dimensions than
the other shape), Numpy will use 1 in the comparison process until the other shape’s
dimensions run out as well.

Fig. 1: Source: http://www.scipy-lectures.org

a = np.array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])

b = np.array([0, 1, 2])
(continues on next page)

40 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)

print(a + b)

Out:

[[ 0 1 2]
[10 11 12]
[20 21 22]
[30 31 32]]

Examples
Shapes of operands A, B and result:

A (2d array): 5 x 4
B (1d array): 1
Result (2d array): 5 x 4

A (2d array): 5 x 4
B (1d array): 4
Result (2d array): 5 x 4

A (3d array): 15 x 3 x 5
B (3d array): 15 x 1 x 5
Result (3d array): 15 x 3 x 5

A (3d array): 15 x 3 x 5
B (2d array): 3 x 5
Result (3d array): 15 x 3 x 5

A (3d array): 15 x 3 x 5
B (2d array): 3 x 1
Result (3d array): 15 x 3 x 5

3.1.9 Exercises

Given the array:

X = np.random.randn(4, 2) # random normals in 4x2 array

• For each column find the row index of the minimum value.
• Write a function standardize(X) that return an array whose columns are centered and
scaled (by std-dev).
Total running time of the script: ( 0 minutes 0.039 seconds)

Note: Click here to download the full example code

3.1. Numpy: arrays and matrices 41


Statistics and Machine Learning in Python, Release 0.3 beta

3.2 Pandas: data manipulation

It is often said that 80% of data analysis is spent on the cleaning and small, but important,
aspect of data manipulation and cleaning with Pandas.
Sources:
• Kevin Markham: https://github.com/justmarkham
• Pandas doc: http://pandas.pydata.org/pandas-docs/stable/index.html
Data structures
• Series is a one-dimensional labeled array capable of holding any data type (inte-
gers, strings, floating point numbers, Python objects, etc.). The axis labels are col-
lectively referred to as the index. The basic method to create a Series is to call
pd.Series([1,3,5,np.nan,6,8])
• DataFrame is a 2-dimensional labeled data structure with columns of potentially different
types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It
stems from the R data.frame() object.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

3.2.1 Create DataFrame

columns = ['name', 'age', 'gender', 'job']

user1 = pd.DataFrame([['alice', 19, "F", "student"],


['john', 26, "M", "student"]],
columns=columns)

user2 = pd.DataFrame([['eric', 22, "M", "student"],


['paul', 58, "F", "manager"]],
columns=columns)

user3 = pd.DataFrame(dict(name=['peter', 'julie'],


age=[33, 44], gender=['M', 'F'],
job=['engineer', 'scientist']))

print(user3)

Out:

name age gender job


0 peter 33 M engineer
1 julie 44 F scientist

3.2.2 Combining DataFrames

42 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

Concatenate DataFrame

user1.append(user2)
users = pd.concat([user1, user2, user3])
print(users)

Out:

name age gender job


0 alice 19 F student
1 john 26 M student
0 eric 22 M student
1 paul 58 F manager
0 peter 33 M engineer
1 julie 44 F scientist

Join DataFrame

user4 = pd.DataFrame(dict(name=['alice', 'john', 'eric', 'julie'],


height=[165, 180, 175, 171]))
print(user4)

Out:

name height
0 alice 165
1 john 180
2 eric 175
3 julie 171

Use intersection of keys from both frames

merge_inter = pd.merge(users, user4, on="name")

print(merge_inter)

Out:

name age gender job height


0 alice 19 F student 165
1 john 26 M student 180
2 eric 22 M student 175
3 julie 44 F scientist 171

Use union of keys from both frames

users = pd.merge(users, user4, on="name", how='outer')


print(users)

Out:

name age gender job height


0 alice 19 F student 165.0
1 john 26 M student 180.0
(continues on next page)

3.2. Pandas: data manipulation 43


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


2 eric 22 M student 175.0
3 paul 58 F manager NaN
4 peter 33 M engineer NaN
5 julie 44 F scientist 171.0

Reshaping by pivoting

“Unpivots” a DataFrame from wide format to long (stacked) format,

staked = pd.melt(users, id_vars="name", var_name="variable", value_name="value")


print(staked)

Out:

name variable value


0 alice age 19
1 john age 26
2 eric age 22
3 paul age 58
4 peter age 33
5 julie age 44
6 alice gender F
7 john gender M
8 eric gender M
9 paul gender F
10 peter gender M
11 julie gender F
12 alice job student
13 john job student
14 eric job student
15 paul job manager
16 peter job engineer
17 julie job scientist
18 alice height 165
19 john height 180
20 eric height 175
21 paul height NaN
22 peter height NaN
23 julie height 171

“pivots” a DataFrame from long (stacked) format to wide format,

print(staked.pivot(index='name', columns='variable', values='value'))

Out:

variable age gender height job


name
alice 19 F 165 student
eric 22 M 175 student
john 26 M 180 student
julie 44 F 171 scientist
paul 58 F NaN manager
peter 33 M NaN engineer

44 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

3.2.3 Summarizing

# examine the users data

users # print the first 30 and last 30 rows


type(users) # DataFrame
users.head() # print the first 5 rows
users.tail() # print the last 5 rows

users.index # "the index" (aka "the labels")


users.columns # column names (which is "an index")
users.dtypes # data types of each column
users.shape # number of rows and columns
users.values # underlying numpy array
users.info() # concise summary (includes memory usage as of pandas 0.15.0)

Out:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 5 columns):
name 6 non-null object
age 6 non-null int64
gender 6 non-null object
job 6 non-null object
height 4 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 288.0+ bytes

3.2.4 Columns selection

users['gender'] # select one column


type(users['gender']) # Series
users.gender # select one column using the DataFrame

# select multiple columns


users[['age', 'gender']] # select two columns
my_cols = ['age', 'gender'] # or, create a list...
users[my_cols] # ...and use that list to select columns
type(users[my_cols]) # DataFrame

3.2.5 Rows selection (basic)

iloc is strictly integer position based

df = users.copy()
df.iloc[0] # first row
df.iloc[0, 0] # first item of first row
df.iloc[0, 0] = 55

for i in range(users.shape[0]):
row = df.iloc[i]
(continues on next page)

3.2. Pandas: data manipulation 45


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


row.age *= 100 # setting a copy, and not the original frame data.

print(df) # df is not modified

Out:

/home/edouard/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py:5096:␣
˓→SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/


˓→indexing.html#indexing-view-versus-copy

self[name] = value
name age gender job height
0 55 19 F student 165.0
1 john 26 M student 180.0
2 eric 22 M student 175.0
3 paul 58 F manager NaN
4 peter 33 M engineer NaN
5 julie 44 F scientist 171.0

ix supports mixed integer and label based access.

df = users.copy()
df.loc[0] # first row
df.loc[0, "age"] # first item of first row
df.loc[0, "age"] = 55

for i in range(df.shape[0]):
df.loc[i, "age"] *= 10

print(df) # df is modified

Out:

name age gender job height


0 alice 550 F student 165.0
1 john 260 M student 180.0
2 eric 220 M student 175.0
3 paul 580 F manager NaN
4 peter 330 M engineer NaN
5 julie 440 F scientist 171.0

3.2.6 Rows selection (filtering)

simple logical filtering

users[users.age < 20] # only show users with age < 20


young_bool = users.age < 20 # or, create a Series of booleans...
young = users[young_bool] # ...and use that Series to filter rows
users[users.age < 20].job # select one column from the filtered results
print(young)

Out:

46 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

name age gender job height


0 alice 19 F student 165.0

Advanced logical filtering

users[users.age < 20][['age', 'job']] # select multiple columns


users[(users.age > 20) & (users.gender == 'M')] # use multiple conditions
users[users.job.isin(['student', 'engineer'])] # filter specific values

3.2.7 Sorting

df = users.copy()

df.age.sort_values() # only works for a Series


df.sort_values(by='age') # sort rows by a specific column
df.sort_values(by='age', ascending=False) # use descending order instead
df.sort_values(by=['job', 'age']) # sort by multiple columns
df.sort_values(by=['job', 'age'], inplace=True) # modify df

print(df)

Out:

name age gender job height


4 peter 33 M engineer NaN
3 paul 58 F manager NaN
5 julie 44 F scientist 171.0
0 alice 19 F student 165.0
2 eric 22 M student 175.0
1 john 26 M student 180.0

3.2.8 Descriptive statistics

Summarize all numeric columns

print(df.describe())

Out:

age height
count 6.000000 4.000000
mean 33.666667 172.750000
std 14.895189 6.344289
min 19.000000 165.000000
25% 23.000000 169.500000
50% 29.500000 173.000000
75% 41.250000 176.250000
max 58.000000 180.000000

Summarize all columns

print(df.describe(include='all'))
print(df.describe(include=['object'])) # limit to one (or more) types

3.2. Pandas: data manipulation 47


Statistics and Machine Learning in Python, Release 0.3 beta

Out:

name age gender job height


count 6 6.000000 6 6 4.000000
unique 6 NaN 2 4 NaN
top eric NaN M student NaN
freq 1 NaN 3 3 NaN
mean NaN 33.666667 NaN NaN 172.750000
std NaN 14.895189 NaN NaN 6.344289
min NaN 19.000000 NaN NaN 165.000000
25% NaN 23.000000 NaN NaN 169.500000
50% NaN 29.500000 NaN NaN 173.000000
75% NaN 41.250000 NaN NaN 176.250000
max NaN 58.000000 NaN NaN 180.000000
name gender job
count 6 6 6
unique 6 2 4
top eric M student
freq 1 3 3

Statistics per group (groupby)

print(df.groupby("job").mean())

print(df.groupby("job")["age"].mean())

print(df.groupby("job").describe(include='all'))

Out:

age height
job
engineer 33.000000 NaN
manager 58.000000 NaN
scientist 44.000000 171.000000
student 22.333333 173.333333
job
engineer 33.000000
manager 58.000000
scientist 44.000000
student 22.333333
Name: age, dtype: float64
name ... height
count unique top freq mean ... min 25% 50% 75% max
job ...
engineer 1 1 peter 1 NaN ... NaN NaN NaN NaN NaN
manager 1 1 paul 1 NaN ... NaN NaN NaN NaN NaN
scientist 1 1 julie 1 NaN ... 171.0 171.0 171.0 171.0 171.0
student 3 3 eric 1 NaN ... 165.0 170.0 175.0 177.5 180.0

[4 rows x 44 columns]

Groupby in a loop

for grp, data in df.groupby("job"):


print(grp, data)

Out:

48 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

engineer name age gender job height


4 peter 33 M engineer NaN
manager name age gender job height
3 paul 58 F manager NaN
scientist name age gender job height
5 julie 44 F scientist 171.0
student name age gender job height
0 alice 19 F student 165.0
2 eric 22 M student 175.0
1 john 26 M student 180.0

3.2.9 Quality check

Remove duplicate data

df = users.append(df.iloc[0], ignore_index=True)

print(df.duplicated()) # Series of booleans


# (True if a row is identical to a previous row)
df.duplicated().sum() # count of duplicates
df[df.duplicated()] # only show duplicates
df.age.duplicated() # check a single column for duplicates
df.duplicated(['age', 'gender']).sum() # specify columns for finding duplicates
df = df.drop_duplicates() # drop duplicate rows

Out:

0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool

Missing data

# Missing values are often just excluded


df = users.copy()

df.describe(include='all') # excludes missing values

# find missing values in a Series


df.height.isnull() # True if NaN, False otherwise
df.height.notnull() # False if NaN, True otherwise
df[df.height.notnull()] # only show rows where age is not NaN
df.height.isnull().sum() # count the missing values

# find missing values in a DataFrame


df.isnull() # DataFrame of booleans
df.isnull().sum() # calculate the sum of each column

3.2. Pandas: data manipulation 49


Statistics and Machine Learning in Python, Release 0.3 beta

Strategy 1: drop missing values

df.dropna() # drop a row if ANY values are missing


df.dropna(how='all') # drop a row only if ALL values are missing

Strategy 2: fill in missing values

df.height.mean()
df = users.copy()
df.loc[df.height.isnull(), "height"] = df["height"].mean()

print(df)

Out:

name age gender job height


0 alice 19 F student 165.00
1 john 26 M student 180.00
2 eric 22 M student 175.00
3 paul 58 F manager 172.75
4 peter 33 M engineer 172.75
5 julie 44 F scientist 171.00

3.2.10 Rename values

df = users.copy()
print(df.columns)
df.columns = ['age', 'genre', 'travail', 'nom', 'taille']

df.travail = df.travail.map({ 'student':'etudiant', 'manager':'manager',


'engineer':'ingenieur', 'scientist':'scientific'})
# assert df.travail.isnull().sum() == 0

df['travail'].str.contains("etu|inge")

Out:

Index(['name', 'age', 'gender', 'job', 'height'], dtype='object')

3.2.11 Dealing with outliers

size = pd.Series(np.random.normal(loc=175, size=20, scale=10))


# Corrupt the first 3 measures
size[:3] += 500

Based on parametric statistics: use the mean

Assume random variable follows the normal distribution Exclude data outside 3 standard-
deviations: - Probability that a sample lies within 1 sd: 68.27% - Probability that a sample
lies within 3 sd: 99.73% (68.27 + 2 * 15.73)

50 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

size_outlr_mean = size.copy()
size_outlr_mean[((size - size.mean()).abs() > 3 * size.std())] = size.mean()
print(size_outlr_mean.mean())

Out:

248.48963819938044

Based on non-parametric statistics: use the median

Median absolute deviation (MAD), based on the median, is a robust non-parametric statistics.
https://en.wikipedia.org/wiki/Median_absolute_deviation

mad = 1.4826 * np.median(np.abs(size - size.median()))


size_outlr_mad = size.copy()

size_outlr_mad[((size - size.median()).abs() > 3 * mad)] = size.median()


print(size_outlr_mad.mean(), size_outlr_mad.median())

Out:

173.80000467192673 178.7023568870694

3.2.12 File I/O

csv

import tempfile, os.path


tmpdir = tempfile.gettempdir()
csv_filename = os.path.join(tmpdir, "users.csv")
users.to_csv(csv_filename, index=False)
other = pd.read_csv(csv_filename)

Read csv from url

url = 'https://raw.github.com/neurospin/pystatsml/master/datasets/salary_table.csv'
salary = pd.read_csv(url)

Excel

xls_filename = os.path.join(tmpdir, "users.xlsx")


users.to_excel(xls_filename, sheet_name='users', index=False)

pd.read_excel(xls_filename, sheet_name='users')

# Multiple sheets
with pd.ExcelWriter(xls_filename) as writer:
users.to_excel(writer, sheet_name='users', index=False)
(continues on next page)

3.2. Pandas: data manipulation 51


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


df.to_excel(writer, sheet_name='salary', index=False)

pd.read_excel(xls_filename, sheet_name='users')
pd.read_excel(xls_filename, sheet_name='salary')

SQL (SQLite)

import pandas as pd
import sqlite3

db_filename = os.path.join(tmpdir, "users.db")

Connect

conn = sqlite3.connect(db_filename)

Creating tables with pandas

url = 'https://raw.github.com/neurospin/pystatsml/master/datasets/salary_table.csv'
salary = pd.read_csv(url)

salary.to_sql("salary", conn, if_exists="replace")

Push modifications

cur = conn.cursor()
values = (100, 14000, 5, 'Bachelor', 'N')
cur.execute("insert into salary values (?, ?, ?, ?, ?)", values)
conn.commit()

Reading results into a pandas DataFrame

salary_sql = pd.read_sql_query("select * from salary;", conn)


print(salary_sql.head())

pd.read_sql_query("select * from salary;", conn).tail()


pd.read_sql_query('select * from salary where salary>25000;', conn)
pd.read_sql_query('select * from salary where experience=16;', conn)
pd.read_sql_query('select * from salary where education="Master";', conn)

Out:

index salary experience education management


0 0 13876 1 Bachelor Y
1 1 11608 1 Ph.D N
2 2 18701 1 Ph.D Y
3 3 11283 1 Master N
4 4 11767 1 Ph.D N

3.2.13 Exercises

52 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

Data Frame

1. Read the iris dataset at ‘https://github.com/neurospin/pystatsml/tree/master/datasets/


iris.csv’
2. Print column names
3. Get numerical columns
4. For each species compute the mean of numerical columns and store it in a stats table
like:

species sepal_length sepal_width petal_length petal_width


0 setosa 5.006 3.428 1.462 0.246
1 versicolor 5.936 2.770 4.260 1.326
2 virginica 6.588 2.974 5.552 2.026

Missing data

Add some missing data to the previous table users:

df = users.copy()
df.ix[[0, 2], "age"] = None
df.ix[[1, 3], "gender"] = None

Out:

/home/edouard/git/pystatsml/scientific_python/scipy_pandas.py:440: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:


http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
df.ix[[0, 2], "age"] = None
/home/edouard/git/pystatsml/scientific_python/scipy_pandas.py:441: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:


http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
df.ix[[1, 3], "gender"] = None

1. Write a function fillmissing_with_mean(df) that fill all missing value of numerical col-
umn with the mean of the current columns.
2. Save the original users and “imputed” frame in a single excel file “users.xlsx” with 2 sheets:
original, imputed.
Total running time of the script: ( 0 minutes 1.488 seconds)

3.2. Pandas: data manipulation 53


Statistics and Machine Learning in Python, Release 0.3 beta

3.3 Matplotlib: data visualization

Sources - Nicolas P. Rougier: http://www.labri.fr/perso/nrougier/teaching/matplotlib - https:


//www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations

3.3.1 Basic plots

import numpy as np
import matplotlib.pyplot as plt

# inline plot (for jupyter)


%matplotlib inline

x = np.linspace(0, 10, 50)


sinus = np.sin(x)

plt.plot(x, sinus)
plt.show()

plt.plot(x, sinus, "o")


plt.show()
# use plt.plot to get color / marker abbreviations

54 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

# Rapid multiplot

cosinus = np.cos(x)
plt.plot(x, sinus, "-b", x, sinus, "ob", x, cosinus, "-r", x, cosinus, "or")
plt.xlabel('this is x!')
plt.ylabel('this is y!')
plt.title('My First Plot')
plt.show()

# Step by step
plt.plot(x, sinus, label='sinus', color='blue', linestyle='--', linewidth=2)
(continues on next page)

3.3. Matplotlib: data visualization 55


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


plt.plot(x, cosinus, label='cosinus', color='red', linestyle='-', linewidth=2)
plt.legend()
plt.show()

3.3.2 Scatter (2D) plots

Load dataset

import pandas as pd
try:
salary = pd.read_csv("../datasets/salary_table.csv")
except:
url = 'https://raw.github.com/neurospin/pystatsml/master/datasets/salary_table.csv'
salary = pd.read_csv(url)

df = salary

Simple scatter with colors

colors = colors_edu = {'Bachelor':'r', 'Master':'g', 'Ph.D':'blue'}


plt.scatter(df['experience'], df['salary'], c=df['education'].apply(lambda x: colors[x]),␣
˓→s=100)

<matplotlib.collections.PathCollection at 0x7f39efac6358>

56 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

Scatter plot with colors and symbols

## Figure size
plt.figure(figsize=(6,5))

## Define colors / sumbols manually


symbols_manag = dict(Y='*', N='.')
colors_edu = {'Bachelor':'r', 'Master':'g', 'Ph.D':'b'}

## group by education x management => 6 groups


for values, d in salary.groupby(['education','management']):
edu, manager = values
plt.scatter(d['experience'], d['salary'], marker=symbols_manag[manager], color=colors_
˓→edu[edu],
s=150, label=manager+"/"+edu)

## Set labels
plt.xlabel('Experience')
plt.ylabel('Salary')
plt.legend(loc=4) # lower right
plt.show()

3.3. Matplotlib: data visualization 57


Statistics and Machine Learning in Python, Release 0.3 beta

3.3.3 Saving Figures

### bitmap format


plt.plot(x, sinus)
plt.savefig("sinus.png")
plt.close()

# Prefer vectorial format (SVG: Scalable Vector Graphics) can be edited with
# Inkscape, Adobe Illustrator, Blender, etc.
plt.plot(x, sinus)
plt.savefig("sinus.svg")
plt.close()

# Or pdf
plt.plot(x, sinus)
plt.savefig("sinus.pdf")
plt.close()

3.3.4 Seaborn

Sources: - http://stanford.edu/~mwaskom/software/seaborn - https://elitedatascience.com/


python-seaborn-tutorial
If needed, install using: pip install -U --user seaborn

58 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

Boxplot

Box plots are non-parametric: they display variation in samples of a statistical population with-
out making any assumptions of the underlying statistical distribution.

Fig. 2: title

import seaborn as sns

sns.boxplot(x="education", y="salary", hue="management", data=salary)

<matplotlib.axes._subplots.AxesSubplot at 0x7f39ed42ff28>

3.3. Matplotlib: data visualization 59


Statistics and Machine Learning in Python, Release 0.3 beta

sns.boxplot(x="management", y="salary", hue="education", data=salary)


sns.stripplot(x="management", y="salary", hue="education", data=salary, jitter=True,␣
˓→dodge=True, linewidth=1)# Jitter and split options separate datapoints according to␣
˓→group"

<matplotlib.axes._subplots.AxesSubplot at 0x7f39eb61d780>

### Density plot with one figure containing multiple axis


One figure can contain several axis, whose contain the graphic elements

60 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

# Set up the matplotlib figure: 3 x 1 axis

f, axes = plt.subplots(3, 1, figsize=(9, 9), sharex=True)

i = 0
for edu, d in salary.groupby(['education']):
sns.distplot(d.salary[d.management == "Y"], color="b", bins=10, label="Manager",␣
˓→ax=axes[i])
sns.distplot(d.salary[d.management == "N"], color="r", bins=10, label="Employee",␣
˓→ax=axes[i])
axes[i].set_title(edu)
axes[i].set_ylabel('Density')
i += 1
ax = plt.legend()

3.3. Matplotlib: data visualization 61


Statistics and Machine Learning in Python, Release 0.3 beta

Violin plot (distribution)

ax = sns.violinplot(x="salary", data=salary)

Tune bandwidth

ax = sns.violinplot(x="salary", data=salary, bw=.15)

ax = sns.violinplot(x="management", y="salary", hue="education", data=salary)

62 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

Tips dataset One waiter recorded information about each tip he received over a period of a few
months working in one restaurant. He collected several variables:

import seaborn as sns


#sns.set(style="whitegrid")
tips = sns.load_dataset("tips")
print(tips.head())

ax = sns.violinplot(x=tips["total_bill"])

total_bill tip sex smoker day time size


0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

3.3. Matplotlib: data visualization 63


Statistics and Machine Learning in Python, Release 0.3 beta

Group by day

ax = sns.violinplot(x="day", y="total_bill", data=tips, palette="muted")

Group by day and color by time (lunch vs dinner)

ax = sns.violinplot(x="day", y="total_bill", hue="time", data=tips, palette="muted",␣


˓→split=True)

64 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

Pairwise scatter plots

g = sns.PairGrid(salary, hue="management")
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)
ax = g.add_legend()

3.3. Matplotlib: data visualization 65


Statistics and Machine Learning in Python, Release 0.3 beta

3.3.5 Time series

import seaborn as sns


sns.set(style="darkgrid")

# Load an example dataset with long-form data


fmri = sns.load_dataset("fmri")

# Plot the responses for different events and regions

ax = sns.pointplot(x="timepoint", y="signal",
hue="region", style="event",
data=fmri)
# version 0.9
# sns.lineplot(x="timepoint", y="signal",
# hue="region", style="event",
# data=fmri)

66 Chapter 3. Scientific Python


Statistics and Machine Learning in Python, Release 0.3 beta

3.3. Matplotlib: data visualization 67


Statistics and Machine Learning in Python, Release 0.3 beta

68 Chapter 3. Scientific Python


CHAPTER

FOUR

STATISTICS

4.1 Univariate statistics

Basics univariate statistics are required to explore dataset:


• Discover associations between a variable of interest and potential predictors. It is strongly
recommended to start with simple univariate methods before moving to complex multi-
variate predictors.
• Assess the prediction performances of machine learning predictors.
• Most of the univariate statistics are based on the linear model which is one of the main
model in machine learning.

4.1.1 Estimators of the main statistical measures

Mean

Properties of the expected value operator E(·) of a random variable 𝑋

𝐸(𝑋 + 𝑐) = 𝐸(𝑋) + 𝑐 (4.1)


𝐸(𝑋 + 𝑌 ) = 𝐸(𝑋) + 𝐸(𝑌 ) (4.2)
𝐸(𝑎𝑋) = 𝑎𝐸(𝑋) (4.3)

The estimator 𝑥
¯ on a sample of size 𝑛: 𝑥 = 𝑥1 , ..., 𝑥𝑛 is given by
1 ∑︁
𝑥
¯= 𝑥𝑖
𝑛
𝑖

¯ is itself a random variable with properties:


𝑥
• 𝐸(¯ ¯,
𝑥) = 𝑥
𝑉 𝑎𝑟(𝑋)
• 𝑉 𝑎𝑟(¯
𝑥) = 𝑛 .

Variance

𝑉 𝑎𝑟(𝑋) = 𝐸((𝑋 − 𝐸(𝑋))2 ) = 𝐸(𝑋 2 ) − (𝐸(𝑋))2

69
Statistics and Machine Learning in Python, Release 0.3 beta

The estimator is
1 ∑︁
𝜎𝑥2 = ¯)2
(𝑥𝑖 − 𝑥
𝑛−1
𝑖

Note here the subtracted 1 degree of freedom (df) in the divisor. In standard statistical practice,
𝑑𝑓 = 1 provides an unbiased estimator of the variance of a hypothetical infinite population.
With 𝑑𝑓 = 0 it instead provides a maximum likelihood estimate of the variance for normally
distributed variables.

Standard deviation
√︀
𝑆𝑡𝑑(𝑋) = 𝑉 𝑎𝑟(𝑋)
√︀
The estimator is simply 𝜎𝑥 = 𝜎𝑥2 .

Covariance

𝐶𝑜𝑣(𝑋, 𝑌 ) = 𝐸((𝑋 − 𝐸(𝑋))(𝑌 − 𝐸(𝑌 ))) = 𝐸(𝑋𝑌 ) − 𝐸(𝑋)𝐸(𝑌 ).


Properties:
Cov(𝑋, 𝑋) = Var(𝑋)
Cov(𝑋, 𝑌 ) = Cov(𝑌, 𝑋)
Cov(𝑐𝑋, 𝑌 ) = 𝑐 Cov(𝑋, 𝑌 )
Cov(𝑋 + 𝑐, 𝑌 ) = Cov(𝑋, 𝑌 )
The estimator with 𝑑𝑓 = 1 is
1 ∑︁
𝜎𝑥𝑦 = (𝑥𝑖 − 𝑥
¯)(𝑦𝑖 − 𝑦¯).
𝑛−1
𝑖

Correlation

𝐶𝑜𝑣(𝑋, 𝑌 )
𝐶𝑜𝑟(𝑋, 𝑌 ) =
𝑆𝑡𝑑(𝑋)𝑆𝑡𝑑(𝑌 )
The estimator is
𝜎𝑥𝑦
𝜌𝑥𝑦 = .
𝜎𝑥 𝜎𝑦

Standard Error (SE)

The standard error (SE) is the standard deviation (of the sampling distribution) of a statistic:
𝑆𝑡𝑑(𝑋)
𝑆𝐸(𝑋) = √ .
𝑛
It is most commonly considered for the mean with the estimator

𝑆𝐸(𝑥) = 𝑆𝑡𝑑(𝑋) = 𝜎𝑥¯ (4.4)


𝜎𝑥
=√ . (4.5)
𝑛

70 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

Exercises

• Generate 2 random samples: 𝑥 ∼ 𝑁 (1.78, 0.1) and 𝑦 ∼ 𝑁 (1.66, 0.1), both of size 10.
• Compute 𝑥¯, 𝜎𝑥 , 𝜎𝑥𝑦 (xbar, xvar, xycov) using only the np.sum() operation. Explore
the np. module to find out which numpy functions performs the same computations and
compare them (using assert) with your previous results.

4.1.2 Main distributions

Normal distribution

The normal distribution, noted 𝒩 (𝜇, 𝜎) with parameters: 𝜇 mean (location) and 𝜎 > 0 std-dev.
Estimators: 𝑥
¯ and 𝜎𝑥 .
The normal distribution, noted 𝒩 , is useful because of the central limit theorem (CLT) which
states that: given certain conditions, the arithmetic mean of a sufficiently large number of iter-
ates of independent random variables, each with a well-defined expected value and well-defined
variance, will be approximately normally distributed, regardless of the underlying distribution.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
%matplotlib inline

mu = 0 # mean
variance = 2 #variance
sigma = np.sqrt(variance) #standard deviation",
x = np.linspace(mu-3*variance,mu+3*variance, 100)
plt.plot(x, norm.pdf(x, mu, sigma))

[<matplotlib.lines.Line2D at 0x7f5cd6d3afd0>]

4.1. Univariate statistics 71


Statistics and Machine Learning in Python, Release 0.3 beta

The Chi-Square distribution

The chi-square or 𝜒2𝑛 distribution with 𝑛 degrees of freedom (df) is the distribution of a sum of
the squares of 𝑛 independent standard normal random variables 𝒩 (0, 1). Let 𝑋 ∼ 𝒩 (𝜇, 𝜎 2 ),
then, 𝑍 = (𝑋 − 𝜇)/𝜎 ∼ 𝒩 (0, 1), then:
• The squared standard 𝑍 2 ∼ 𝜒21 (one df).
∑︀𝑛
• The distribution of sum of squares of 𝑛 normal random variables: 𝑖 𝑍𝑖2 ∼ 𝜒2𝑛
The sum of two 𝜒2 RV with 𝑝 and 𝑞 df is a 𝜒2 RV with 𝑝 + 𝑞 df. This is useful when sum-
ming/subtracting sum of squares.
The 𝜒2 -distribution is used to model errors measured as sum of squares or the distribution of
the sample variance.

The Fisher’s F-distribution

The 𝐹 -distribution, 𝐹𝑛,𝑝 , with 𝑛 and 𝑝 degrees of freedom is the ratio of two independent 𝜒2
variables. Let 𝑋 ∼ 𝜒2𝑛 and 𝑌 ∼ 𝜒2𝑝 then:

𝑋/𝑛
𝐹𝑛,𝑝 =
𝑌 /𝑝

The 𝐹 -distribution plays a central role in hypothesis testing answering the question: Are two
variances equals?, is the ratio or two errors significantly large ?.

import numpy as np
from scipy.stats import f
import matplotlib.pyplot as plt
%matplotlib inline

fvalues = np.linspace(.1, 5, 100)

# pdf(x, df1, df2): Probability density function at x of F.


plt.plot(fvalues, f.pdf(fvalues, 1, 30), 'b-', label="F(1, 30)")
plt.plot(fvalues, f.pdf(fvalues, 5, 30), 'r-', label="F(5, 30)")
plt.legend()

# cdf(x, df1, df2): Cumulative distribution function of F.


# ie.
proba_at_f_inf_3 = f.cdf(3, 1, 30) # P(F(1,30) < 3)

# ppf(q, df1, df2): Percent point function (inverse of cdf) at q of F.


f_at_proba_inf_95 = f.ppf(.95, 1, 30) # q such P(F(1,30) < .95)
assert f.cdf(f_at_proba_inf_95, 1, 30) == .95

# sf(x, df1, df2): Survival function (1 - cdf) at x of F.


proba_at_f_sup_3 = f.sf(3, 1, 30) # P(F(1,30) > 3)
assert proba_at_f_inf_3 + proba_at_f_sup_3 == 1

# p-value: P(F(1, 30)) < 0.05


low_proba_fvalues = fvalues[fvalues > f_at_proba_inf_95]
plt.fill_between(low_proba_fvalues, 0, f.pdf(low_proba_fvalues, 1, 30),
alpha=.8, label="P < 0.05")
plt.show()

72 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

The Student’s 𝑡-distribution

Let 𝑀 ∼ 𝒩 (0, 1) and 𝑉 ∼ 𝜒2𝑛 . The 𝑡-distribution, 𝑇𝑛 , with 𝑛 degrees of freedom is the ratio:
𝑀
𝑇𝑛 = √︀
𝑉 /𝑛

The distribution of the difference between an estimated parameter and its true (or assumed)
value divided by the standard deviation of the estimated parameter (standard error) follow a
𝑡-distribution. Is this parameters different from a given value?

4.1.3 Hypothesis Testing

Examples
• Test a proportion: Biased coin ? 200 heads have been found over 300 flips, is it coins
biased ?
• Test the association between two variables.
– Exemple height and sex: In a sample of 25 individuals (15 females, 10 males), is
female height is different from male height ?
– Exemple age and arterial hypertension: In a sample of 25 individuals is age height
correlated with arterial hypertension ?
Steps
1. Model the data
2. Fit: estimate the model parameters (frequency, mean, correlation, regression coeficient)
3. Compute a test statistic from model the parameters.
4. Formulate the null hypothesis: What would be the (distribution of the) test statistic if the
observations are the result of pure chance.

4.1. Univariate statistics 73


Statistics and Machine Learning in Python, Release 0.3 beta

5. Compute the probability (𝑝-value) to obtain a larger value for the test statistic by chance
(under the null hypothesis).

Flip coin: Simplified example

Biased coin ? 2 heads have been found over 3 flips, is it coins biased ?
1. Model the data: number of heads follow a Binomial disctribution.
2. Compute model parameters: N=3, P = the frequency of number of heads over the number
of flip: 2/3.
3. Compute a test statistic, same as frequency.
4. Under the null hypothesis the distribution of the number of tail is:

1 2 3 count #heads
0
H 1
H 1
H 1
H H 2
H H 2
H H 2
H H H 3

8 possibles configurations, probabilities of differents values for 𝑝 are: 𝑥 measure the number of
success.
• 𝑃 (𝑥 = 0) = 1/8
• 𝑃 (𝑥 = 1) = 3/8
• 𝑃 (𝑥 = 2) = 3/8
• 𝑃 (𝑥 = 3) = 1/8

plt.bar([0, 1, 2, 3], [1/8, 3/8, 3/8, 1/8], width=0.9)


_ = plt.xticks([0, 1, 2, 3], [0, 1, 2, 3])
plt.xlabel("Distribution of the number of head over 3 flip under the null hypothesis")

Text(0.5, 0, 'Distribution of the number of head over 3 flip under the null hypothesis')

74 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

3. Compute the probability (𝑝-value) to observe a value larger or equal that 2 under the null
hypothesis ? This probability is the 𝑝-value:

𝑃 (𝑥 ≥ 2|𝐻0 ) = 𝑃 (𝑥 = 2) + 𝑃 (𝑥 = 3) = 3/8 + 1/8 = 4/8 = 1/2

Flip coin: Real Example

Biased coin ? 60 heads have been found over 100 flips, is it coins biased ?
1. Model the data: number of heads follow a Binomial disctribution.
2. Compute model parameters: N=100, P=60/100.
3. Compute a test statistic, same as frequency.
4. Compute a test statistic: 60/100.
5. Under the null hypothesis the distribution of the number of tail (𝑘) follow the binomial
distribution of parameters N=100, P=0.5:
(︂ )︂
100
𝑃 𝑟(𝑋 = 𝑘|𝐻0 ) = 𝑃 𝑟(𝑋 = 𝑘|𝑛 = 100, 𝑝 = 0.5) = 0.5𝑘 (1 − 0.5)(100−𝑘) .
𝑘

100 (︂ )︂
∑︁ 100
𝑃 (𝑋 = 𝑘 ≥ 60|𝐻0 ) = 0.5𝑘 (1 − 0.5)(100−𝑘)
𝑘
𝑘=60
60 (︂ )︂
∑︁ 100
=1− 0.5𝑘 (1 − 0.5)(100−𝑘) , the cumulative distribution function.
𝑘
𝑘=1

Use tabulated binomial distribution

4.1. Univariate statistics 75


Statistics and Machine Learning in Python, Release 0.3 beta

import scipy.stats
import matplotlib.pyplot as plt

#tobs = 2.39687663116 # assume the t-value


succes = np.linspace(30, 70, 41)
plt.plot(succes, scipy.stats.binom.pmf(succes, 100, 0.5), 'b-', label="Binomial(100, 0.5)
˓→")

upper_succes_tvalues = succes[succes > 60]


plt.fill_between(upper_succes_tvalues, 0, scipy.stats.binom.pmf(upper_succes_tvalues, 100,
˓→ 0.5), alpha=.8, label="p-value")
_ = plt.legend()

pval = 1 - scipy.stats.binom.cdf(60, 100, 0.5)


print(pval)

0.01760010010885238

Random sampling of the Binomial distribution under the null hypothesis

sccess_h0 = scipy.stats.binom.rvs(100, 0.5, size=10000, random_state=4)


print(sccess_h0)

#sccess_h0 = np.array([) for i in range(5000)])


import seaborn as sns
_ = sns.distplot(sccess_h0, hist=False)

pval_rnd = np.sum(sccess_h0 >= 60) / (len(sccess_h0) + 1)


print("P-value using monte-carlo sampling of the Binomial distribution under H0=", pval_
˓→rnd)

[60 52 51 ... 45 51 44]


P-value using monte-carlo sampling of the Binomial distribution under H0= 0.
˓→025897410258974102

76 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

One sample 𝑡-test

The one-sample 𝑡-test is used to determine whether a sample comes from a population with a
specific mean. For example you want to test if the average height of a population is 1.75 𝑚.
1 Model the data
Assume that height is normally distributed: 𝑋 ∼ 𝒩 (𝜇, 𝜎), ie:

height𝑖 = average height over the population + error𝑖 (4.6)


𝑥𝑖 = 𝑥
¯ + 𝜀𝑖 (4.7)

The 𝜀𝑖 are called the residuals


2 Fit: estimate the model parameters
¯, 𝑠𝑥 are the estimators of 𝜇, 𝜎.
𝑥
3 Compute a test statistic
In testing the null hypothesis that the population mean is equal to a specified value 𝜇0 = 1.75,
one uses the statistic:

difference of means √
𝑡= 𝑛 (4.8)
std-dev √of noise
𝑡 = effect size 𝑛 (4.9)
¯ − 𝜇0 √
𝑥
𝑡= 𝑛 (4.10)
𝑠𝑥

Remarks: Although the parent population does not need to be normally distributed, the dis-
tribution of the population of sample means, 𝑥, is assumed to be normal. By the central limit

4.1. Univariate statistics 77


Statistics and Machine Learning in Python, Release 0.3 beta

theorem, if the sampling of the parent population is independent then the sample means will
be approximately normal.
4 Compute the probability of the test statistic under the null hypotheis. This require to have the
distribution of the t statistic under 𝐻0 .

Example

Given the following samples, we will test whether its true mean is 1.75.
Warning, when computing the std or the variance, set ddof=1. The default value, ddof=0, leads
to the biased estimator of the variance.

import numpy as np

x = [1.83, 1.83, 1.73, 1.82, 1.83, 1.73, 1.99, 1.85, 1.68, 1.87]

xbar = np.mean(x) # sample mean


mu0 = 1.75 # hypothesized value
s = np.std(x, ddof=1) # sample standard deviation
n = len(x) # sample size

print(xbar)

tobs = (xbar - mu0) / (s / np.sqrt(n))


print(tobs)

1.816
2.3968766311585883

The :math:‘p‘-value is the probability to observe a value 𝑡 more extreme than the observed one
𝑡𝑜𝑏𝑠 under the null hypothesis 𝐻0 : 𝑃 (𝑡 > 𝑡𝑜𝑏𝑠 |𝐻0 )

import scipy.stats as stats


import matplotlib.pyplot as plt

#tobs = 2.39687663116 # assume the t-value


tvalues = np.linspace(-10, 10, 100)
plt.plot(tvalues, stats.t.pdf(tvalues, n-1), 'b-', label="T(n-1)")
upper_tval_tvalues = tvalues[tvalues > tobs]
plt.fill_between(upper_tval_tvalues, 0, stats.t.pdf(upper_tval_tvalues, n-1), alpha=.8,␣
˓→label="p-value")
_ = plt.legend()

78 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

4.1.4 Testing pairwise associations

Univariate statistical analysis: explore association betweens pairs of variables.


• In statistics, a categorical variable or factor is a variable that can take on one of a limited,
and usually fixed, number of possible values, thus assigning each individual to a particular
group or “category”. The levels are the possibles values of the variable. Number of levels
= 2: binomial; Number of levels > 2: multinomial. There is no intrinsic ordering to the
categories. For example, gender is a categorical variable having two categories (male and
female) and there is no intrinsic ordering to the categories. For example, Sex (Female,
Male), Hair color (blonde, brown, etc.).
• An ordinal variable is a categorical variable with a clear ordering of the levels. For
example: drinks per day (none, small, medium and high).
• A continuous or quantitative variable 𝑥 ∈ R is one that can take any value in a range of
possible values, possibly infinite. E.g.: salary, experience in years, weight.
What statistical test should I use?
See: http://www.ats.ucla.edu/stat/mult_pkg/whatstat/
### Pearson correlation test: test association between two quantitative variables
Test the correlation coefficient of two quantitative variables. The test calculates a Pearson cor-
relation coefficient and the 𝑝-value for testing non-correlation.
Let 𝑥 and 𝑦 two quantitative variables, where 𝑛 samples were obeserved. The linear correlation
coeficient is defined as :
∑︀𝑛
(𝑥𝑖 − 𝑥¯)(𝑦𝑖 − 𝑦¯)
𝑟 = √︀∑︀𝑛 𝑖=1 √︀∑︀𝑛 .
2 ¯)2
𝑖=1 (𝑥𝑖 − 𝑥¯) 𝑖=1 (𝑦𝑖 − 𝑦
√ 𝑟
Under 𝐻0 , the test statistic 𝑡 = 𝑛 − 2 √1−𝑟 2
follow Student distribution with 𝑛 − 2 degrees of
freedom.

4.1. Univariate statistics 79


Statistics and Machine Learning in Python, Release 0.3 beta

Fig. 1: Statistical tests

80 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

import numpy as np
import scipy.stats as stats
n = 50
x = np.random.normal(size=n)
y = 2 * x + np.random.normal(size=n)

# Compute with scipy


cor, pval = stats.pearsonr(x, y)
print(cor, pval)

0.904453622242007 2.189729365511301e-19

Two sample (Student) 𝑡-test: compare two means

Fig. 2: Two-sample model

The two-sample 𝑡-test (Snedecor and Cochran, 1989) is used to determine if two population
means are equal. There are several variations on this test. If data are paired (e.g. 2 measures,
before and after treatment for each individual) use the one-sample 𝑡-test of the difference. The
variances of the two samples may be assumed to be equal (a.k.a. homoscedasticity) or unequal
(a.k.a. heteroscedasticity).

1. Model the data

Assume that the two random variables are normally distributed: 𝑦1 ∼ 𝒩 (𝜇1 , 𝜎1 ), 𝑦2 ∼
𝒩 (𝜇2 , 𝜎2 ).

2. Fit: estimate the model parameters

Estimate means and variances: 𝑦¯1 , 𝑠2𝑦1 , 𝑦¯2 , 𝑠2𝑦2 .

3. 𝑡-test

The general principle is

4.1. Univariate statistics 81


Statistics and Machine Learning in Python, Release 0.3 beta

difference of means
𝑡= (4.11)
standard dev of error
difference of means
= (4.12)
its standard error
𝑦¯1 − 𝑦¯2 √
= √︀∑︀ 𝑛−2 (4.13)
𝜀2
𝑦¯1 − 𝑦¯2
= (4.14)
𝑠𝑦¯1 −𝑦¯2

Since 𝑦1 and 𝑦2 are independant:

𝑠2𝑦1 𝑠2𝑦2
𝑠2𝑦¯1 −𝑦¯2 = 𝑠2𝑦¯1 + 𝑠2𝑦¯2 = + (4.15)
𝑛1 𝑛2
thus (4.16)
√︃
𝑠2𝑦1 𝑠2𝑦
𝑠𝑦¯1 −𝑦¯2 = + 2 (4.17)
𝑛1 𝑛2

Equal or unequal sample sizes, unequal variances (Welch’s 𝑡-test)

Welch’s 𝑡-test defines the 𝑡 statistic as


𝑦¯1 − 𝑦¯2
𝑡 = √︁ 2 .
𝑠𝑦1 𝑠2𝑦2
𝑛1 + 𝑛2

To compute the 𝑝-value one needs the degrees of freedom associated with this variance estimate.
It is approximated using the Welch–Satterthwaite equation:
(︂ 2 )︂2
𝑠𝑦1 𝑠2𝑦2
𝑛1 + 𝑛2
𝜈≈ 𝑠4𝑦1 𝑠4𝑦2
.
𝑛2 (𝑛 −1)
+ 𝑛2 (𝑛 −1)
1 1 2 2

Equal or unequal sample sizes, equal variances

If we assume equal variance (ie, 𝑠2𝑦1 = 𝑠2𝑦1 = 𝑠2 ), where 𝑠2 is an estimator of the common
variance of the two samples:

𝑠2𝑦1 (𝑛1 − 1) + 𝑠2𝑦2 (𝑛2 − 1)


𝑠2 = (4.18)
𝑛1 + 𝑛2 − 2
∑︀𝑛1 2
∑︀𝑛2 2
𝑖 (𝑦1𝑖 − 𝑦¯1 ) + 𝑗 (𝑦2𝑗 − 𝑦¯2 )
= (4.19)
(𝑛1 − 1) + (𝑛2 − 1)

then
√︃ √︂
𝑠2 𝑠2 1 1
𝑠𝑦¯1 −𝑦¯2 = + =𝑠 +
𝑛1 𝑛2 𝑛1 𝑛2

82 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

Therefore, the 𝑡 statistic, that is used to test whether the means are different is:
𝑦¯ − 𝑦¯2
𝑡= √︁1 ,
𝑠 · 𝑛11 + 1
𝑛2

Equal sample sizes, equal variances

If we simplify the problem assuming equal samples of size 𝑛1 = 𝑛2 = 𝑛 we get

𝑦¯1 − 𝑦¯2 √
𝑡= √ · 𝑛 (4.20)
𝑠 2

≈ effect size · 𝑛 (4.21)
difference of means √
≈ · 𝑛 (4.22)
standard deviation of the noise

Example

Given the following two samples, test whether their means are equal using the standard t-test,
assuming equal variance.

import scipy.stats as stats

height = np.array([ 1.83, 1.83, 1.73, 1.82, 1.83, 1.73, 1.99, 1.85, 1.68, 1.87,
1.66, 1.71, 1.73, 1.64, 1.70, 1.60, 1.79, 1.73, 1.62, 1.77])

grp = np.array(["M"] * 10 + ["F"] * 10)

# Compute with scipy


print(stats.ttest_ind(height[grp == "M"], height[grp == "F"], equal_var=True))

Ttest_indResult(statistic=3.5511519888466885, pvalue=0.00228208937112721)

ANOVA 𝐹 -test (quantitative ~ categorial (>=2 levels))

Analysis of variance (ANOVA) provides a statistical test of whether or not the means of several
groups are equal, and therefore generalizes the 𝑡-test to more than two groups. ANOVAs are
useful for comparing (testing) three or more means (groups or variables) for statistical signifi-
cance. It is conceptually similar to multiple two-sample 𝑡-tests, but is less conservative.
Here we will consider one-way ANOVA with one independent variable, ie one-way anova.
Wikipedia:
• Test if any group is on average superior, or inferior, to the others versus the null hypothesis
that all four strategies yield the same mean response
• Detect any of several possible differences.
• The advantage of the ANOVA 𝐹 -test is that we do not need to pre-specify which strategies
are to be compared, and we do not need to adjust for making multiple comparisons.

4.1. Univariate statistics 83


Statistics and Machine Learning in Python, Release 0.3 beta

• The disadvantage of the ANOVA 𝐹 -test is that if we reject the null hypothesis, we do not
know which strategies can be said to be significantly different from the others.

1. Model the data

A company has applied three marketing strategies to three samples of customers in order in-
crease their business volume. The marketing is asking whether the strategies led to different
increases of business volume. Let 𝑦1 , 𝑦2 and 𝑦3 be the three samples of business volume increase.
Here we assume that the three populations were sampled from three random variables that are
normally distributed. I.e., 𝑌1 ∼ 𝑁 (𝜇1 , 𝜎1 ), 𝑌2 ∼ 𝑁 (𝜇2 , 𝜎2 ) and 𝑌3 ∼ 𝑁 (𝜇3 , 𝜎3 ).

2. Fit: estimate the model parameters

Estimate means and variances: 𝑦¯𝑖 , 𝜎𝑖 , ∀𝑖 ∈ {1, 2, 3}.

3. 𝐹 -test

The formula for the one-way ANOVA F-test statistic is

Explained variance
𝐹 = (4.23)
Unexplained variance
Between-group variability 𝑠2
= = 2𝐵 . (4.24)
Within-group variability 𝑠𝑊

The “explained variance”, or “between-group variability” is


∑︁
𝑠2𝐵 = 𝑦𝑖· − 𝑦¯)2 /(𝐾 − 1),
𝑛𝑖 (¯
𝑖

where 𝑦¯𝑖· denotes the sample mean in the 𝑖th group, 𝑛𝑖 is the number of observations in the 𝑖th
group, 𝑦¯ denotes the overall mean of the data, and 𝐾 denotes the number of groups.
The “unexplained variance”, or “within-group variability” is
∑︁
𝑠2𝑊 = (𝑦𝑖𝑗 − 𝑦¯𝑖· )2 /(𝑁 − 𝐾),
𝑖𝑗

where 𝑦𝑖𝑗 is the 𝑗th observation in the 𝑖th out of 𝐾 groups and 𝑁 is the overall sample size.
This 𝐹 -statistic follows the 𝐹 -distribution with 𝐾 − 1 and 𝑁 − 𝐾 degrees of freedom under the
null hypothesis. The statistic will be large if the between-group variability is large relative to
the within-group variability, which is unlikely to happen if the population means of the groups
all have the same value.
Note that when there are only two groups for the one-way ANOVA F-test, 𝐹 = 𝑡2 where 𝑡 is the
Student’s 𝑡 statistic.

84 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

Chi-square, 𝜒2 (categorial ~ categorial)

Computes the chi-square, 𝜒2 , statistic and 𝑝-value for the hypothesis test of independence of
frequencies in the observed contingency table (cross-table). The observed frequencies are tested
against an expected contingency table obtained by computing expected frequencies based on
the marginal sums under the assumption of independence.
Example: 20 participants: 10 exposed to some chemical product and 10 non exposed (exposed
= 1 or 0). Among the 20 participants 10 had cancer 10 not (cancer = 1 or 0). 𝜒2 tests the
association between those two variables.

import numpy as np
import pandas as pd
import scipy.stats as stats

# Dataset:
# 15 samples:
# 10 first exposed
exposed = np.array([1] * 10 + [0] * 10)
# 8 first with cancer, 10 without, the last two with.
cancer = np.array([1] * 8 + [0] * 10 + [1] * 2)

crosstab = pd.crosstab(exposed, cancer, rownames=['exposed'],


colnames=['cancer'])
print("Observed table:")
print("---------------")
print(crosstab)

chi2, pval, dof, expected = stats.chi2_contingency(crosstab)


print("Statistics:")
print("-----------")
print("Chi2 = %f, pval = %f" % (chi2, pval))
print("Expected table:")
print("---------------")
print(expected)

Observed table:
---------------
cancer 0 1
exposed
0 8 2
1 2 8
Statistics:
-----------
Chi2 = 5.000000, pval = 0.025347
Expected table:
---------------
[[5. 5.]
[5. 5.]]

Computing expected cross-table

# Compute expected cross-table based on proportion


exposed_marg = crosstab.sum(axis=0)
exposed_freq = exposed_marg / exposed_marg.sum()

(continues on next page)

4.1. Univariate statistics 85


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


cancer_marg = crosstab.sum(axis=1)
cancer_freq = cancer_marg / cancer_marg.sum()

print('Exposed frequency? Yes: %.2f' % exposed_freq[0],


'No: %.2f' % exposed_freq[1])
print('Cancer frequency? Yes: %.2f' % cancer_freq[0],
'No: %.2f' % cancer_freq[1])

print('Expected frequencies:')
print(np.outer(exposed_freq, cancer_freq))

print('Expected cross-table (frequencies * N): ')


print(np.outer(exposed_freq, cancer_freq) * len(exposed))

Exposed frequency? Yes: 0.50 No: 0.50


Cancer frequency? Yes: 0.50 No: 0.50
Expected frequencies:
[[0.25 0.25]
[0.25 0.25]]
Expected cross-table (frequencies * N):
[[5. 5.]
[5. 5.]]

4.1.5 Non-parametric test of pairwise associations

Spearman rank-order correlation (quantitative ~ quantitative)

The Spearman correlation is a non-parametric measure of the monotonicity of the relationship


between two datasets.
When to use it? Observe the data distribution: - presence of outliers - the distribution of the
residuals is not Gaussian.
Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no cor-
relation. Correlations of -1 or +1 imply an exact monotonic relationship. Positive correlations
imply that as 𝑥 increases, so does 𝑦. Negative correlations imply that as 𝑥 increases, 𝑦 decreases.

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

x = np.array([44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 46, 47, 48, 60.1])
y = np.array([2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 4, 4.1, 4.5, 3.8])

plt.plot(x, y, "bo")

# Non-Parametric Spearman
cor, pval = stats.spearmanr(x, y)
print("Non-Parametric Spearman cor test, cor: %.4f, pval: %.4f" % (cor, pval))

# "Parametric Pearson cor test


cor, pval = stats.pearsonr(x, y)
print("Parametric Pearson cor test: cor: %.4f, pval: %.4f" % (cor, pval))

86 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

Non-Parametric Spearman cor test, cor: 0.7110, pval: 0.0095


Parametric Pearson cor test: cor: 0.5263, pval: 0.0788

Wilcoxon signed-rank test (quantitative ~ cte)

Source: https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used when com-
paring two related samples, matched samples, or repeated measurements on a single sample
to assess whether their population mean ranks differ (i.e. it is a paired difference test). It is
equivalent to one-sample test of the difference of paired samples.
It can be used as an alternative to the paired Student’s 𝑡-test, 𝑡-test for matched pairs, or the 𝑡-
test for dependent samples when the population cannot be assumed to be normally distributed.
When to use it? Observe the data distribution: - presence of outliers - the distribution of the
residuals is not Gaussian
It has a lower sensitivity compared to 𝑡-test. May be problematic to use when the sample size is
small.
Null hypothesis 𝐻0 : difference between the pairs follows a symmetric distribution around zero.

import scipy.stats as stats


n = 20
# Buisness Volume time 0
bv0 = np.random.normal(loc=3, scale=.1, size=n)
# Buisness Volume time 1
bv1 = bv0 + 0.1 + np.random.normal(loc=0, scale=.1, size=n)

# create an outlier
bv1[0] -= 10

# Paired t-test
(continues on next page)

4.1. Univariate statistics 87


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


print(stats.ttest_rel(bv0, bv1))

# Wilcoxon
print(stats.wilcoxon(bv0, bv1))

Ttest_relResult(statistic=0.8167367438079456, pvalue=0.4242016933514212)
WilcoxonResult(statistic=40.0, pvalue=0.015240061183200121)

Mann–Whitney 𝑈 test (quantitative ~ categorial (2 levels))

In statistics, the Mann–Whitney 𝑈 test (also called the Mann–Whitney–Wilcoxon, Wilcoxon


rank-sum test or Wilcoxon–Mann–Whitney test) is a nonparametric test of the null hypothesis
that two samples come from the same population against an alternative hypothesis, especially
that a particular population tends to have larger values than the other.
It can be applied on unknown distributions contrary to e.g. a 𝑡-test that has to be applied only
on normal distributions, and it is nearly as efficient as the 𝑡-test on normal distributions.

import scipy.stats as stats


n = 20
# Buismess Volume group 0
bv0 = np.random.normal(loc=1, scale=.1, size=n)

# Buismess Volume group 1


bv1 = np.random.normal(loc=1.2, scale=.1, size=n)

# create an outlier
bv1[0] -= 10

# Two-samples t-test
print(stats.ttest_ind(bv0, bv1))

# Wilcoxon
print(stats.mannwhitneyu(bv0, bv1))

Ttest_indResult(statistic=0.6227075213159515, pvalue=0.5371960369300763)
MannwhitneyuResult(statistic=43.0, pvalue=1.1512354940556314e-05)

4.1.6 Linear model

Given 𝑛 random samples (𝑦𝑖 , 𝑥1𝑖 , . . . , 𝑥𝑝𝑖 ), 𝑖 = 1, . . . , 𝑛, the linear regression models the relation
between the observations 𝑦𝑖 and the independent variables 𝑥𝑝𝑖 is formulated as

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + · · · + 𝛽𝑝 𝑥𝑝𝑖 + 𝜀𝑖 𝑖 = 1, . . . , 𝑛

• The 𝛽’s are the model parameters, ie, the regression coeficients.
• 𝛽0 is the intercept or the bias.
• 𝜀𝑖 are the residuals.
• An independent variable (IV). It is a variable that stands alone and isn’t changed by
the other variables you are trying to measure. For example, someone’s age might be an

88 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

Fig. 3: Linear model

independent variable. Other factors (such as what they eat, how much they go to school,
how much television they watch) aren’t going to change a person’s age. In fact, when
you are looking for some kind of relationship between variables you are trying to see if
the independent variable causes some kind of change in the other variables, or dependent
variables. In Machine Learning, these variables are also called the predictors.
• A dependent variable. It is something that depends on other factors. For example, a test
score could be a dependent variable because it could change depending on several factors
such as how much you studied, how much sleep you got the night before you took the
test, or even how hungry you were when you took it. Usually when you are looking for
a relationship between two things you are trying to find out what makes the dependent
variable change the way it does. In Machine Learning this variable is called a target
variable.

Simple regression: test association between two quantitative variables

Using the dataset “salary”, explore the association between the dependant variable (e.g. Salary)
and the independent variable (e.g.: Experience is quantitative).

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

url = 'https://raw.github.com/neurospin/pystatsml/master/datasets/salary_table.csv'
salary = pd.read_csv(url)

1. Model the data

Model the data on some hypothesis e.g.: salary is a linear function of the experience.

salary𝑖 = 𝛽 experience𝑖 + 𝛽0 + 𝜖𝑖 ,

more generally

𝑦𝑖 = 𝛽 𝑥𝑖 + 𝛽0 + 𝜖𝑖

• 𝛽: the slope or coefficient or parameter of the model,


• 𝛽0 : the intercept or bias is the second parameter of the model,

4.1. Univariate statistics 89


Statistics and Machine Learning in Python, Release 0.3 beta

• 𝜖𝑖 : is the 𝑖th error, or residual with 𝜖 ∼ 𝒩 (0, 𝜎 2 ).


The simple regression is equivalent to the Pearson correlation.

2. Fit: estimate the model parameters

The goal it so estimate 𝛽, 𝛽0 and 𝜎 2 .


Minimizes the mean squared error (MSE) or the Sum squared error ∑︀ (SSE). The so-called
2
Ordinary Least Squares (OLS) finds 𝛽, 𝛽0 that minimizes the 𝑆𝑆𝐸 = 𝑖 𝜖𝑖
∑︁
𝑆𝑆𝐸 = (𝑦𝑖 − 𝛽 𝑥𝑖 − 𝛽0 )2
𝑖

Recall from calculus that an extreme point can be found by computing where the derivative is
zero, i.e. to find the intercept, we perform the steps:
𝜕𝑆𝑆𝐸 ∑︁
= (𝑦𝑖 − 𝛽 𝑥𝑖 − 𝛽0 ) = 0
𝜕𝛽0
𝑖
∑︁ ∑︁
𝑦𝑖 = 𝛽 𝑥𝑖 + 𝑛 𝛽0
𝑖 𝑖
𝑛 𝑦¯ = 𝑛 𝛽 𝑥
¯ + 𝑛 𝛽0
𝛽0 = 𝑦¯ − 𝛽 𝑥
¯

To find the regression coefficient, we perform the steps:


𝜕𝑆𝑆𝐸 ∑︁
= 𝑥𝑖 (𝑦𝑖 − 𝛽 𝑥𝑖 − 𝛽0 ) = 0
𝜕𝛽
𝑖

Plug in 𝛽0 :
∑︁
𝑥𝑖 (𝑦𝑖 − 𝛽 𝑥𝑖 − 𝑦¯ + 𝛽 𝑥
¯) = 0
𝑖
∑︁ ∑︁ ∑︁
𝑥𝑖 𝑦𝑖 − 𝑦¯ 𝑥𝑖 = 𝛽 (𝑥𝑖 − 𝑥
¯)
𝑖 𝑖 𝑖

Divide both sides by 𝑛:


1 ∑︁ 1 ∑︁
𝑥𝑖 𝑦𝑖 − 𝑦¯𝑥¯= 𝛽 (𝑥𝑖 − 𝑥
¯)
𝑛 𝑛
𝑖 𝑖
1 ∑︀
𝑥𝑖 𝑦𝑖 − 𝑦¯𝑥
¯ 𝐶𝑜𝑣(𝑥, 𝑦)
𝛽 = 1 ∑︀𝑖
𝑛
= .
𝑛 𝑖 (𝑥𝑖 − 𝑥
¯) 𝑉 𝑎𝑟(𝑥)

from scipy import stats


import numpy as np
y, x = salary.salary, salary.experience
beta, beta0, r_value, p_value, std_err = stats.linregress(x,y)
print("y = %f x + %f, r: %f, r-squared: %f,\np-value: %f, std_err: %f"
% (beta, beta0, r_value, r_value**2, p_value, std_err))

print("Regression line with the scatterplot")


yhat = beta * x + beta0 # regression line
(continues on next page)

90 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


plt.plot(x, yhat, 'r-', x, y,'o')
plt.xlabel('Experience (years)')
plt.ylabel('Salary')
plt.show()

print("Using seaborn")
import seaborn as sns
sns.regplot(x="experience", y="salary", data=salary);

y = 491.486913 x + 13584.043803, r: 0.538886, r-squared: 0.290398,


p-value: 0.000112, std_err: 115.823381
Regression line with the scatterplot

Using seaborn

4.1. Univariate statistics 91


Statistics and Machine Learning in Python, Release 0.3 beta

3. 𝐹 -Test

3.1 Goodness of fit

The goodness of fit of a statistical model describes how well it fits a set of observations. Mea-
sures of goodness of fit typically summarize the discrepancy between observed values and the
values expected under the model in question. We will consider the explained variance also
known as the coefficient of determination, denoted 𝑅2 pronounced R-squared.
The total sum of squares, 𝑆𝑆tot is the sum of the sum of squares explained by the regression,
𝑆𝑆reg , plus the sum of squares of residuals unexplained by the regression, 𝑆𝑆res , also called the
SSE, i.e. such that

𝑆𝑆tot = 𝑆𝑆reg + 𝑆𝑆res

Fig. 4: title

92 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

The mean of 𝑦 is
1 ∑︁
𝑦¯ = 𝑦𝑖 .
𝑛
𝑖

The total sum of squares is the total squared sum of deviations from the mean of 𝑦, i.e.
∑︁
𝑆𝑆tot = (𝑦𝑖 − 𝑦¯)2
𝑖

The regression sum of squares, also called the explained sum of squares:
∑︁
𝑆𝑆reg = 𝑦𝑖 − 𝑦¯)2 ,

𝑖

where 𝑦ˆ𝑖 = 𝛽𝑥𝑖 + 𝛽0 is the estimated value of salary 𝑦ˆ𝑖 given a value of experience 𝑥𝑖 .
The sum of squares of the residuals, also called the residual sum of squares (RSS) is:
∑︁
𝑆𝑆res = (𝑦𝑖 − 𝑦ˆ𝑖 )2 .
𝑖

𝑅2 is the explained sum of squares of errors. It is the variance explain by the regression divided
by the total variance, i.e.

explained SS 𝑆𝑆reg 𝑆𝑆𝑟𝑒𝑠


𝑅2 = = =1− .
total SS 𝑆𝑆𝑡𝑜𝑡 𝑆𝑆𝑡𝑜𝑡

3.2 Test

ˆ 2 = 𝑆𝑆res /(𝑛 − 2) be an estimator of the variance of 𝜖. The 2 in the denominator stems


Let 𝜎
from the 2 estimated parameters: intercept and coefficient.
𝑆𝑆res
• Unexplained variance: ^2
𝜎
∼ 𝜒2𝑛−2
𝑆𝑆
• Explained variance: 𝜎^ 2reg ∼ 𝜒21 . The single degree of freedom comes from the difference
between 𝑆𝑆
^2
𝜎
tot
(∼ 𝜒2𝑛−1 ) and 𝑆𝑆
^2
𝜎
res
(∼ 𝜒2𝑛−2 ), i.e. (𝑛 − 1) − (𝑛 − 2) degree of freedom.
The Fisher statistics of the ratio of two variances:
Explained variance 𝑆𝑆reg /1
𝐹 = = ∼ 𝐹 (1, 𝑛 − 2)
Unexplained variance 𝑆𝑆res /(𝑛 − 2)

Using the 𝐹 -distribution, compute the probability of observing a value greater than 𝐹 under
𝐻0 , i.e.: 𝑃 (𝑥 > 𝐹 |𝐻0 ), i.e. the survival function (1 − Cumulative Distribution Function) at 𝑥 of
the given 𝐹 -distribution.

Multiple regression

Theory

Muliple Linear Regression is the most basic supervised learning algorithm.


Given: a set of training data {𝑥1 , ..., 𝑥𝑁 } with corresponding targets {𝑦1 , ..., 𝑦𝑁 }.

4.1. Univariate statistics 93


Statistics and Machine Learning in Python, Release 0.3 beta

In linear regression, we assume that the model that generates the data involves only a linear
combination of the input variables, i.e.

𝑦(𝑥𝑖 , 𝛽) = 𝛽 0 + 𝛽 1 𝑥1𝑖 + ... + 𝛽 𝑃 𝑥𝑃𝑖 ,

or, simplified
𝑃 −1
𝛽𝑗 𝑥𝑗𝑖 .
∑︁
𝑦(𝑥𝑖 , 𝛽) = 𝛽0 +
𝑗=1

Extending each sample with an intercept, 𝑥𝑖 := [1, 𝑥𝑖 ] ∈ 𝑅𝑃 +1 allows us to use a more general
notation based on linear algebra and write it as a simple dot product:

𝑦(𝑥𝑖 , 𝛽) = 𝑥𝑇𝑖 𝛽,

where 𝛽 ∈ 𝑅𝑃 +1 is a vector of weights that define the 𝑃 + 1 parameters of the model. From
now we have 𝑃 regressors + the intercept.
Minimize the Mean Squared Error MSE loss:
𝑁 𝑁
1 ∑︁ 1 ∑︁
𝑀 𝑆𝐸(𝛽) = (𝑦𝑖 − 𝑦(𝑥𝑖 , 𝛽))2 = (𝑦𝑖 − 𝑥𝑇𝑖 𝛽)2
𝑁 𝑁
𝑖=1 𝑖=1

Let 𝑋 = [𝑥𝑇0 , ..., 𝑥𝑇𝑁 ] be a 𝑁 × 𝑃 + 1 matrix of 𝑁 samples of 𝑃 input features with one column
of one and let be 𝑦 = [𝑦1 , ..., 𝑦𝑁 ] be a vector of the 𝑁 targets. Then, using linear algebra, the
mean squared error (MSE) loss can be rewritten:
1
𝑀 𝑆𝐸(𝛽) = ||𝑦 − 𝑋𝛽||22 .
𝑁
The 𝛽 that minimises the MSE can be found by:

(︂ )︂
1
∇𝛽 ||𝑦 − 𝑋𝛽||22 =0 (4.25)
𝑁
1
∇𝛽 (𝑦 − 𝑋𝛽)𝑇 (𝑦 − 𝑋𝛽) = 0 (4.26)
𝑁
1
∇𝛽 (𝑦 𝑇 𝑦 − 2𝛽 𝑇 𝑋 𝑇 𝑦 + 𝛽 𝑇 𝑋 𝑇 𝑋𝛽) = 0 (4.27)
𝑁
−2𝑋 𝑇 𝑦 + 2𝑋 𝑇 𝑋𝛽 = 0 (4.28)
𝑇 𝑇
𝑋 𝑋𝛽 = 𝑋 𝑦 (4.29)
𝛽 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦, (4.30)

where (𝑋 𝑇 𝑋)−1 𝑋 𝑇 is a pseudo inverse of 𝑋.

Fit with numpy

import numpy as np
from scipy import linalg
np.random.seed(seed=42) # make the example reproducible
(continues on next page)

94 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)

# Dataset
N, P = 50, 4
X = np.random.normal(size= N * P).reshape((N, P))
## Our model needs an intercept so we add a column of 1s:
X[:, 0] = 1
print(X[:5, :])

betastar = np.array([10, 1., .5, 0.1])


e = np.random.normal(size=N)
y = np.dot(X, betastar) + e

# Estimate the parameters


Xpinv = linalg.pinv2(X)
betahat = np.dot(Xpinv, y)
print("Estimated beta:\n", betahat)

[[ 1. -0.1382643 0.64768854 1.52302986]


[ 1. -0.23413696 1.57921282 0.76743473]
[ 1. 0.54256004 -0.46341769 -0.46572975]
[ 1. -1.91328024 -1.72491783 -0.56228753]
[ 1. 0.31424733 -0.90802408 -1.4123037 ]]
Estimated beta:
[10.14742501 0.57938106 0.51654653 0.17862194]

4.1.7 Linear model with statsmodels

Sources: http://statsmodels.sourceforge.net/devel/examples/

Multiple regression

Interface with statsmodels

import statsmodels.api as sm

## Fit and summary:


model = sm.OLS(y, X).fit()
print(model.summary())

# prediction of new values


ypred = model.predict(X)

# residuals + prediction == true values


assert np.all(ypred + model.resid == y)

OLS Regression Results


==============================================================================
Dep. Variable: y R-squared: 0.363
Model: OLS Adj. R-squared: 0.322
Method: Least Squares F-statistic: 8.748
Date: Wed, 06 Nov 2019 Prob (F-statistic): 0.000106
(continues on next page)

4.1. Univariate statistics 95


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


Time: 18:03:24 Log-Likelihood: -71.271
No. Observations: 50 AIC: 150.5
Df Residuals: 46 BIC: 158.2
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 10.1474 0.150 67.520 0.000 9.845 10.450
x1 0.5794 0.160 3.623 0.001 0.258 0.901
x2 0.5165 0.151 3.425 0.001 0.213 0.820
x3 0.1786 0.144 1.240 0.221 -0.111 0.469
==============================================================================
Omnibus: 2.493 Durbin-Watson: 2.369
Prob(Omnibus): 0.288 Jarque-Bera (JB): 1.544
Skew: 0.330 Prob(JB): 0.462
Kurtosis: 3.554 Cond. No. 1.27
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.

Interface with Pandas

Use R language syntax for data.frame. For an additive model: 𝑦𝑖 = 𝛽 0 + 𝑥1𝑖 𝛽 1 + 𝑥2𝑖 𝛽 2 + 𝜖𝑖 ≡ y ~
x1 + x2.

import statsmodels.formula.api as smfrmla

df = pd.DataFrame(np.column_stack([X, y]), columns=['inter', 'x1','x2', 'x3', 'y'])


print(df.columns, df.shape)
# Build a model excluding the intercept, it is implicit
model = smfrmla.ols("y~x1 + x2 + x3", df).fit()
print(model.summary())

Index(['inter', 'x1', 'x2', 'x3', 'y'], dtype='object') (50, 5)


OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.363
Model: OLS Adj. R-squared: 0.322
Method: Least Squares F-statistic: 8.748
Date: Wed, 06 Nov 2019 Prob (F-statistic): 0.000106
Time: 18:03:24 Log-Likelihood: -71.271
No. Observations: 50 AIC: 150.5
Df Residuals: 46 BIC: 158.2
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 10.1474 0.150 67.520 0.000 9.845 10.450
x1 0.5794 0.160 3.623 0.001 0.258 0.901
(continues on next page)

96 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


x2 0.5165 0.151 3.425 0.001 0.213 0.820
x3 0.1786 0.144 1.240 0.221 -0.111 0.469
==============================================================================
Omnibus: 2.493 Durbin-Watson: 2.369
Prob(Omnibus): 0.288 Jarque-Bera (JB): 1.544
Skew: 0.330 Prob(JB): 0.462
Kurtosis: 3.554 Cond. No. 1.27
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.

Multiple regression with categorical independent variables or factors: Analysis of covariance


(ANCOVA)

Analysis of covariance (ANCOVA) is a linear model that blends ANOVA and linear regression.
ANCOVA evaluates whether population means of a dependent variable (DV) are equal across
levels of a categorical independent variable (IV) often called a treatment, while statistically
controlling for the effects of other quantitative or continuous variables that are not of primary
interest, known as covariates (CV).
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

try:
df = pd.read_csv("../datasets/salary_table.csv")
except:
url = 'https://raw.github.com/neurospin/pystatsml/master/datasets/salary_table.csv'
df = pd.read_csv(url)

import seaborn as sns


fig, axes = plt.subplots(1, 3)

sns.distplot(df.salary[df.management == "Y"], color="r", bins=10, label="Manager:Y",␣


˓→ax=axes[0])
sns.distplot(df.salary[df.management == "N"], color="b", bins=10, label="Manager:Y",␣
˓→ax=axes[0])

sns.regplot("experience", "salary", data=df, ax=axes[1])

sns.regplot("experience", "salary", color=df.management,


data=df, ax=axes[2])

#sns.stripplot("experience", "salary", hue="management", data=df, ax=axes[2])

-----------------------------------------------------

TypeError Traceback (most recent call last)

<ipython-input-28-c2d69cab90c3> in <module>
8
(continues on next page)

4.1. Univariate statistics 97


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


9 sns.regplot("experience", "salary", hue=df.management,
---> 10 data=df, ax=axes[2])
11
12 #sns.stripplot("experience", "salary", hue="management", data=df, ax=axes[2])

TypeError: regplot() got an unexpected keyword argument 'hue'

One-way AN(C)OVA

• ANOVA: one categorical independent variable, i.e. one factor.


• ANCOVA: ANOVA with some covariates.

import statsmodels.formula.api as smfrmla

oneway = smfrmla.ols('salary ~ management + experience', df).fit()


print(oneway.summary())
aov = sm.stats.anova_lm(oneway, typ=2) # Type 2 ANOVA DataFrame
print(aov)

OLS Regression Results


==============================================================================
Dep. Variable: salary R-squared: 0.865
Model: OLS Adj. R-squared: 0.859
Method: Least Squares F-statistic: 138.2
Date: Thu, 07 Nov 2019 Prob (F-statistic): 1.90e-19
Time: 12:16:50 Log-Likelihood: -407.76
No. Observations: 46 AIC: 821.5
Df Residuals: 43 BIC: 827.0
Df Model: 2
(continues on next page)

98 Chapter 4. Statistics
Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 1.021e+04 525.999 19.411 0.000 9149.578 1.13e+04
management[T.Y] 7145.0151 527.320 13.550 0.000 6081.572 8208.458
experience 527.1081 51.106 10.314 0.000 424.042 630.174
==============================================================================
Omnibus: 11.437 Durbin-Watson: 2.193
Prob(Omnibus): 0.003 Jarque-Bera (JB): 11.260
Skew: -1.131 Prob(JB): 0.00359
Kurtosis: 3.872 Cond. No. 22.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
sum_sq df F PR(>F)
management 5.755739e+08 1.0 183.593466 4.054116e-17
experience 3.334992e+08 1.0 106.377768 3.349662e-13
Residual 1.348070e+08 43.0 NaN NaN

Two-way AN(C)OVA

Ancova with two categorical independent variables, i.e. two factors.

import statsmodels.formula.api as smfrmla

twoway = smfrmla.ols('salary ~ education + management + experience', df).fit()


print(twoway.summary())
aov = sm.stats.anova_lm(twoway, typ=2) # Type 2 ANOVA DataFrame
print(aov)

OLS Regression Results


==============================================================================
Dep. Variable: salary R-squared: 0.957
Model: OLS Adj. R-squared: 0.953
Method: Least Squares F-statistic: 226.8
Date: Thu, 07 Nov 2019 Prob (F-statistic): 2.23e-27
Time: 12:16:52 Log-Likelihood: -381.63
No. Observations: 46 AIC: 773.3
Df Residuals: 41 BIC: 782.4
Df Model: 4
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 8035.5976 386.689 20.781 0.000 7254.663 8816.532
education[T.Master] 3144.0352 361.968 8.686 0.000 2413.025 3875.045
education[T.Ph.D] 2996.2103 411.753 7.277 0.000 2164.659 3827.762
management[T.Y] 6883.5310 313.919 21.928 0.000 6249.559 7517.503
experience 546.1840 30.519 17.896 0.000 484.549 607.819
==============================================================================
(continues on next page)

4.1. Univariate statistics 99


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


Omnibus: 2.293 Durbin-Watson: 2.237
Prob(Omnibus): 0.318 Jarque-Bera (JB): 1.362
Skew: -0.077 Prob(JB): 0.506
Kurtosis: 2.171 Cond. No. 33.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
sum_sq df F PR(>F)
education 9.152624e+07 2.0 43.351589 7.672450e-11
management 5.075724e+08 1.0 480.825394 2.901444e-24
experience 3.380979e+08 1.0 320.281524 5.546313e-21
Residual 4.328072e+07 41.0 NaN NaN

Comparing two nested models

oneway is nested within twoway. Comparing two nested models tells us if the additional predic-
tors (i.e. education) of the full model significantly decrease the residuals. Such comparison can
be done using an 𝐹 -test on residuals:

print(twoway.compare_f_test(oneway)) # return F, pval, df

(43.35158945918107, 7.672449570495418e-11, 2.0)

Factor coding

See http://statsmodels.sourceforge.net/devel/contrasts.html
By default Pandas use “dummy coding”. Explore:

print(twoway.model.data.param_names)
print(twoway.model.data.exog[:10, :])

['Intercept', 'education[T.Master]', 'education[T.Ph.D]', 'management[T.Y]', 'experience']


[[1. 0. 0. 1. 1.]
[1. 0. 1. 0. 1.]
[1. 0. 1. 1. 1.]
[1. 1. 0. 0. 1.]
[1. 0. 1. 0. 1.]
[1. 1. 0. 1. 2.]
[1. 1. 0. 0. 2.]
[1. 0. 0. 0. 2.]
[1. 0. 1. 0. 2.]
[1. 1. 0. 0. 3.]]

Contrasts and post-hoc tests

100 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

# t-test of the specific contribution of experience:


ttest_exp = twoway.t_test([0, 0, 0, 0, 1])
ttest_exp.pvalue, ttest_exp.tvalue
print(ttest_exp)

# Alternatively, you can specify the hypothesis tests using a string


twoway.t_test('experience')

# Post-hoc is salary of Master different salary of Ph.D?


# ie. t-test salary of Master = salary of Ph.D.
print(twoway.t_test('education[T.Master] = education[T.Ph.D]'))

Test for Constraints


==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 546.1840 30.519 17.896 0.000 484.549 607.819
==============================================================================
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 147.8249 387.659 0.381 0.705 -635.069 930.719
==============================================================================

4.1.8 Multiple comparisons

import numpy as np
np.random.seed(seed=42) # make example reproducible

# Dataset
n_samples, n_features = 100, 1000
n_info = int(n_features/10) # number of features with information
n1, n2 = int(n_samples/2), n_samples - int(n_samples/2)
snr = .5
Y = np.random.randn(n_samples, n_features)
grp = np.array(["g1"] * n1 + ["g2"] * n2)

# Add some group effect for Pinfo features


Y[grp=="g1", :n_info] += snr

#
import scipy.stats as stats
import matplotlib.pyplot as plt
tvals, pvals = np.full(n_features, np.NAN), np.full(n_features, np.NAN)
for j in range(n_features):
tvals[j], pvals[j] = stats.ttest_ind(Y[grp=="g1", j], Y[grp=="g2", j],
equal_var=True)

fig, axis = plt.subplots(3, 1)#, sharex='col')

axis[0].plot(range(n_features), tvals, 'o')


axis[0].set_ylabel("t-value")

(continues on next page)

4.1. Univariate statistics 101


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


axis[1].plot(range(n_features), pvals, 'o')
axis[1].axhline(y=0.05, color='red', linewidth=3, label="p-value=0.05")
#axis[1].axhline(y=0.05, label="toto", color='red')
axis[1].set_ylabel("p-value")
axis[1].legend()

axis[2].hist([pvals[n_info:], pvals[:n_info]],
stacked=True, bins=100, label=["Negatives", "Positives"])
axis[2].set_xlabel("p-value histogram")
axis[2].set_ylabel("density")
axis[2].legend()

plt.tight_layout()

Note that under the null hypothesis the distribution of the p-values is uniform.
Statistical measures:
• True Positive (TP) equivalent to a hit. The test correctly concludes the presence of an
effect.
• True Negative (TN). The test correctly concludes the absence of an effect.
• False Positive (FP) equivalent to a false alarm, Type I error. The test improperly con-
cludes the presence of an effect. Thresholding at 𝑝-value < 0.05 leads to 47 FP.
• False Negative (FN) equivalent to a miss, Type II error. The test improperly concludes the
absence of an effect.

P, N = n_info, n_features - n_info # Positives, Negatives


TP = np.sum(pvals[:n_info ] < 0.05) # True Positives
FP = np.sum(pvals[n_info: ] < 0.05) # False Positives
print("No correction, FP: %i (expected: %.2f), TP: %i" % (FP, N * 0.05, TP))

102 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

Bonferroni correction for multiple comparisons

The Bonferroni correction is based on the idea that if an experimenter is testing 𝑃 hypothe-
ses, then one way of maintaining the familywise error rate (FWER) is to test each individual
hypothesis at a statistical significance level of 1/𝑃 times the desired maximum overall level.
So, if the desired significance level for the whole family of tests is 𝛼 (usually 0.05), then the
Bonferroni correction would test each individual hypothesis at a significance level of 𝛼/𝑃 . For
example, if a trial is testing 𝑃 = 8 hypotheses with a desired 𝛼 = 0.05, then the Bonferroni
correction would test each individual hypothesis at 𝛼 = 0.05/8 = 0.00625.

import statsmodels.sandbox.stats.multicomp as multicomp


_, pvals_fwer, _, _ = multicomp.multipletests(pvals, alpha=0.05,
method='bonferroni')
TP = np.sum(pvals_fwer[:n_info ] < 0.05) # True Positives
FP = np.sum(pvals_fwer[n_info: ] < 0.05) # False Positives
print("FWER correction, FP: %i, TP: %i" % (FP, TP))

The False discovery rate (FDR) correction for multiple comparisons

FDR-controlling procedures are designed to control the expected proportion of rejected null
hypotheses that were incorrect rejections (“false discoveries”). FDR-controlling procedures pro-
vide less stringent control of Type I errors compared to the familywise error rate (FWER) con-
trolling procedures (such as the Bonferroni correction), which control the probability of at least
one Type I error. Thus, FDR-controlling procedures have greater power, at the cost of increased
rates of Type I errors.

import statsmodels.sandbox.stats.multicomp as multicomp


_, pvals_fdr, _, _ = multicomp.multipletests(pvals, alpha=0.05,
method='fdr_bh')
TP = np.sum(pvals_fdr[:n_info ] < 0.05) # True Positives
FP = np.sum(pvals_fdr[n_info: ] < 0.05) # False Positives

print("FDR correction, FP: %i, TP: %i" % (FP, TP))

4.1.9 Exercises

Simple linear regression and correlation (application)

Load the dataset: birthwt Risk Factors Associated with Low Infant Birth Weight at https://raw.
github.com/neurospin/pystatsml/master/datasets/birthwt.csv
1. Test the association of mother’s (bwt) age and birth weight using the correlation test and
linear regeression.
2. Test the association of mother’s weight (lwt) and birth weight using the correlation test
and linear regeression.
3. Produce two scatter plot of: (i) age by birth weight; (ii) mother’s weight by birth weight.
Conclusion ?

4.1. Univariate statistics 103


Statistics and Machine Learning in Python, Release 0.3 beta

Simple linear regression (maths)

Considering the salary and the experience of the salary table. https://raw.github.com/
neurospin/pystatsml/master/datasets/salary_table.csv
Compute:
• Estimate the model paramters 𝛽, 𝛽0 using scipy stats.linregress(x,y)
• Compute the predicted values 𝑦ˆ
Compute:
• 𝑦¯: y_mu
• 𝑆𝑆tot : ss_tot
• 𝑆𝑆reg : ss_reg
• 𝑆𝑆res : ss_res
• Check partition of variance formula based on sum of squares by using assert np.
allclose(val1, val2, atol=1e-05)
• Compute 𝑅2 and compare it with the r_value above
• Compute the 𝐹 score
• Compute the 𝑝-value:
• Plot the 𝐹 (1, 𝑛) distribution for 100 𝑓 values within [10, 25]. Draw 𝑃 (𝐹 (1, 𝑛) > 𝐹 ),
i.e. color the surface defined by the 𝑥 values larger than 𝐹 below the 𝐹 (1, 𝑛).
• 𝑃 (𝐹 (1, 𝑛) > 𝐹 ) is the 𝑝-value, compute it.

Multiple regression

Considering the simulated data used below:


1. What are the dimensions of pinv(𝑋)?
2. Compute the MSE between the predicted values and the true values.

import numpy as np
from scipy import linalg
np.random.seed(seed=42) # make the example reproducible

# Dataset
N, P = 50, 4
X = np.random.normal(size= N * P).reshape((N, P))
## Our model needs an intercept so we add a column of 1s:
X[:, 0] = 1
print(X[:5, :])

betastar = np.array([10, 1., .5, 0.1])


e = np.random.normal(size=N)
y = np.dot(X, betastar) + e

# Estimate the parameters


Xpinv = linalg.pinv2(X)
(continues on next page)

104 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


betahat = np.dot(Xpinv, y)
print("Estimated beta:\n", betahat)

Two sample t-test (maths)

Given the following two sample, test whether their means are equals.

height = np.array([ 1.83, 1.83, 1.73, 1.82, 1.83,


1.73,1.99, 1.85, 1.68, 1.87,
1.66, 1.71, 1.73, 1.64, 1.70,
1.60, 1.79, 1.73, 1.62, 1.77])
grp = np.array(["M"] * 10 + ["F"] * 10)

• Compute the means/std-dev per groups.


• Compute the 𝑡-value (standard two sample t-test with equal variances).
• Compute the 𝑝-value.
• The 𝑝-value is one-sided: a two-sided test would test P(T > tval) and P(T < -tval).
What would the two sided 𝑝-value be?
• Compare the two-sided 𝑝-value with the one obtained by stats.ttest_ind using assert
np.allclose(arr1, arr2).

Two sample t-test (application)

Risk Factors Associated with Low Infant Birth Weight: https://raw.github.com/neurospin/


pystatsml/master/datasets/birthwt.csv
1. Explore the data
2. Recode smoke factor
3. Compute the means/std-dev per groups.
4. Plot birth weight by smoking (box plot, violin plot or histogram)
5. Test the effect of smoking on birth weight

Two sample t-test and random permutations

Generate 100 samples following the model:

𝑦 =𝑔+𝜀

Where the noise 𝜀 ∼ 𝑁 (1, 1) and 𝑔 ∈ {0, 1} is a group indicator variable with 50 ones and 50
zeros.
• Write a function tstat(y, g) that compute the two samples t-test of y splited in two
groups defined by g.
• Sample the t-statistic distribution under the null hypothesis using random permutations.
• Assess the p-value.

4.1. Univariate statistics 105


Statistics and Machine Learning in Python, Release 0.3 beta

Univariate associations (developpement)

Write a function univar_stat(df, target, variables) that computes the parametric statistics
and 𝑝-values between the target variable (provided as as string) and all variables (provided
as a list of string) of the pandas DataFrame df. The target is a quantitative variable but vari-
ables may be quantitative or qualitative. The function returns a DataFrame with four columns:
variable, test, value, p_value.
Apply it to the salary dataset available at https://raw.github.com/neurospin/pystatsml/master/
datasets/salary_table.csv, with target being S: salaries for IT staff in a corporation.

Multiple comparisons

This exercise has 2 goals: apply you knowledge of statistics using vectorized numpy operations.
Given the dataset provided for multiple comparisons, compute the two-sample 𝑡-test (assuming
equal variance) for each (column) feature of the Y array given the two groups defined by grp
variable. You should return two vectors of size n_features: one for the 𝑡-values and one for the
𝑝-values.

ANOVA

Perform an ANOVA dataset described bellow


• Compute between and within variances
• Compute 𝐹 -value: fval
• Compare the 𝑝-value with the one obtained by stats.f_oneway using assert np.
allclose(arr1, arr2)

# dataset
mu_k = np.array([1, 2, 3]) # means of 3 samples
sd_k = np.array([1, 1, 1]) # sd of 3 samples
n_k = np.array([10, 20, 30]) # sizes of 3 samples
grp = [0, 1, 2] # group labels
n = np.sum(n_k)
label = np.hstack([[k] * n_k[k] for k in [0, 1, 2]])

y = np.zeros(n)
for k in grp:
y[label == k] = np.random.normal(mu_k[k], sd_k[k], n_k[k])

# Compute with scipy


fval, pval = stats.f_oneway(y[label == 0], y[label == 1], y[label == 2])

Note: Click here to download the full example code

106 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

4.2 Lab 1: Brain volumes study

The study provides the brain volumes of grey matter (gm), white matter (wm) and cerebrospinal
fluid) (csf) of 808 anatomical MRI scans.

4.2.1 Manipulate data

Set the working directory within a directory called “brainvol”


Create 2 subdirectories: data that will contain downloaded data and reports for results of the
analysis.

import os
import os.path
import pandas as pd
import tempfile
import urllib.request

WD = os.path.join(tempfile.gettempdir(), "brainvol")
os.makedirs(WD, exist_ok=True)
#os.chdir(WD)

# use cookiecutter file organization


# https://drivendata.github.io/cookiecutter-data-science/
os.makedirs(os.path.join(WD, "data"), exist_ok=True)
#os.makedirs("reports", exist_ok=True)

Fetch data
• Demographic data demo.csv (columns: participant_id, site, group, age, sex) and tissue
volume data: group is Control or Patient. site is the recruiting site.
• Gray matter volume gm.csv (columns: participant_id, session, gm_vol)
• White matter volume wm.csv (columns: participant_id, session, wm_vol)
• Cerebrospinal Fluid csf.csv (columns: participant_id, session, csf_vol)

base_url = 'https://raw.github.com/neurospin/pystatsml/master/datasets/brain_volumes/%s'
data = dict()
for file in ["demo.csv", "gm.csv", "wm.csv", "csf.csv"]:
urllib.request.urlretrieve(base_url % file, os.path.join(WD, "data", file))

demo = pd.read_csv(os.path.join(WD, "data", "demo.csv"))


gm = pd.read_csv(os.path.join(WD, "data", "gm.csv"))
wm = pd.read_csv(os.path.join(WD, "data", "wm.csv"))
csf = pd.read_csv(os.path.join(WD, "data", "csf.csv"))

print("tables can be merge using shared columns")


print(gm.head())

Out:

tables can be merge using shared columns


participant_id session gm_vol
0 sub-S1-0002 ses-01 0.672506
(continues on next page)

4.2. Lab 1: Brain volumes study 107


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


1 sub-S1-0002 ses-02 0.678772
2 sub-S1-0002 ses-03 0.665592
3 sub-S1-0004 ses-01 0.890714
4 sub-S1-0004 ses-02 0.881127

Merge tables according to participant_id

brain_vol = pd.merge(pd.merge(pd.merge(demo, gm), wm), csf)


assert brain_vol.shape == (808, 9)

Drop rows with missing values

brain_vol = brain_vol.dropna()
assert brain_vol.shape == (766, 9)

Compute Total Intra-cranial volume tiv_vol = gm_vol + csf_vol + wm_vol.

brain_vol["tiv_vol"] = brain_vol["gm_vol"] + brain_vol["wm_vol"] + brain_vol["csf_vol"]

Compute tissue fractions gm_f = gm_vol / tiv_vol, wm_f = wm_vol / tiv_vol.

brain_vol["gm_f"] = brain_vol["gm_vol"] / brain_vol["tiv_vol"]


brain_vol["wm_f"] = brain_vol["wm_vol"] / brain_vol["tiv_vol"]

Save in a excel file brain_vol.xlsx

brain_vol.to_excel(os.path.join(WD, "data", "brain_vol.xlsx"),


sheet_name='data', index=False)

4.2.2 Descriptive Statistics

Load excel file brain_vol.xlsx

import os
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smfrmla
import statsmodels.api as sm

brain_vol = pd.read_excel(os.path.join(WD, "data", "brain_vol.xlsx"),


sheet_name='data')
# Round float at 2 decimals when printing
pd.options.display.float_format = '{:,.2f}'.format

Descriptive statistics Most of participants have several MRI sessions (column session) Select
on rows from session one “ses-01”

brain_vol1 = brain_vol[brain_vol.session == "ses-01"]


# Check that there are no duplicates
assert len(brain_vol1.participant_id.unique()) == len(brain_vol1.participant_id)

Global descriptives statistics of numerical variables

108 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

desc_glob_num = brain_vol1.describe()
print(desc_glob_num)

Out:

age gm_vol wm_vol csf_vol tiv_vol gm_f wm_f


count 244.00 244.00 244.00 244.00 244.00 244.00 244.00
mean 34.54 0.71 0.44 0.31 1.46 0.49 0.30
std 12.09 0.08 0.07 0.08 0.17 0.04 0.03
min 18.00 0.48 0.05 0.12 0.83 0.37 0.06
25% 25.00 0.66 0.40 0.25 1.34 0.46 0.28
50% 31.00 0.70 0.43 0.30 1.45 0.49 0.30
75% 44.00 0.77 0.48 0.37 1.57 0.52 0.31
max 61.00 1.03 0.62 0.63 2.06 0.60 0.36

Global Descriptive statistics of categorical variable

desc_glob_cat = brain_vol1[["site", "group", "sex"]].describe(include='all')


print(desc_glob_cat)

print("Get count by level")


desc_glob_cat = pd.DataFrame({col: brain_vol1[col].value_counts().to_dict()
for col in ["site", "group", "sex"]})
print(desc_glob_cat)

Out:

site group sex


count 244 244 244
unique 7 2 2
top S7 Patient M
freq 65 157 155
Get count by level
site group sex
Control nan 87.00 nan
F nan nan 89.00
M nan nan 155.00
Patient nan 157.00 nan
S1 13.00 nan nan
S3 29.00 nan nan
S4 15.00 nan nan
S5 62.00 nan nan
S6 1.00 nan nan
S7 65.00 nan nan
S8 59.00 nan nan

Remove the single participant from site 6

brain_vol = brain_vol[brain_vol.site != "S6"]


brain_vol1 = brain_vol[brain_vol.session == "ses-01"]
desc_glob_cat = pd.DataFrame({col: brain_vol1[col].value_counts().to_dict()
for col in ["site", "group", "sex"]})
print(desc_glob_cat)

Out:

4.2. Lab 1: Brain volumes study 109


Statistics and Machine Learning in Python, Release 0.3 beta

site group sex


Control nan 86.00 nan
F nan nan 88.00
M nan nan 155.00
Patient nan 157.00 nan
S1 13.00 nan nan
S3 29.00 nan nan
S4 15.00 nan nan
S5 62.00 nan nan
S7 65.00 nan nan
S8 59.00 nan nan

Descriptives statistics of numerical variables per clinical status

desc_group_num = brain_vol1[["group", 'gm_vol']].groupby("group").describe()


print(desc_group_num)

Out:

gm_vol
count meanstd min 25% 50% 75% max
group
Control 86.00 0.72 0.09 0.48 0.66 0.71 0.78 1.03
Patient 157.00 0.70 0.08 0.53 0.65 0.70 0.76 0.90

4.2.3 Statistics

Objectives:
1. Site effect of gray matter atrophy
2. Test the association between the age and gray matter atrophy in the control and patient
population independently.
3. Test for differences of atrophy between the patients and the controls
4. Test for interaction between age and clinical status, ie: is the brain atrophy process in
patient population faster than in the control population.
5. The effect of the medication in the patient population.

import statsmodels.api as sm
import statsmodels.formula.api as smfrmla
import scipy.stats
import seaborn as sns

1 Site effect on Grey Matter atrophy


The model is Oneway Anova gm_f ~ site The ANOVA test has important assumptions that must
be satisfied in order for the associated p-value to be valid.
• The samples are independent.
• Each sample is from a normally distributed population.
• The population standard deviations of the groups are all equal. This property is known as
homoscedasticity.

110 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

Plot

sns.violinplot("site", "gm_f", data=brain_vol1)

Stats with scipy

fstat, pval = scipy.stats.f_oneway(*[brain_vol1.gm_f[brain_vol1.site == s]


for s in brain_vol1.site.unique()])
print("Oneway Anova gm_f ~ site F=%.2f, p-value=%E" % (fstat, pval))

Out:

Oneway Anova gm_f ~ site F=14.82, p-value=1.188136E-12

Stats with statsmodels

anova = smfrmla.ols("gm_f ~ site", data=brain_vol1).fit()


# print(anova.summary())
print("Site explains %.2f%% of the grey matter fraction variance" %
(anova.rsquared * 100))

print(sm.stats.anova_lm(anova, typ=2))

Out:

Site explains 23.82% of the grey matter fraction variance


sum_sq df F PR(>F)
(continues on next page)

4.2. Lab 1: Brain volumes study 111


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


site 0.11 5.00 14.82 0.00
Residual 0.35 237.00 nan nan

2. Test the association between the age and gray matter atrophy in the control and patient
population independently.
Plot

sns.lmplot("age", "gm_f", hue="group", data=brain_vol1)

brain_vol1_ctl = brain_vol1[brain_vol1.group == "Control"]


brain_vol1_pat = brain_vol1[brain_vol1.group == "Patient"]

Stats with scipy

print("--- In control population ---")


beta, beta0, r_value, p_value, std_err = \
scipy.stats.linregress(x=brain_vol1_ctl.age, y=brain_vol1_ctl.gm_f)

print("gm_f = %f * age + %f" % (beta, beta0))


print("Corr: %f, r-squared: %f, p-value: %f, std_err: %f"\
% (r_value, r_value**2, p_value, std_err))

print("--- In patient population ---")


beta, beta0, r_value, p_value, std_err = \
(continues on next page)

112 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


scipy.stats.linregress(x=brain_vol1_pat.age, y=brain_vol1_pat.gm_f)

print("gm_f = %f * age + %f" % (beta, beta0))


print("Corr: %f, r-squared: %f, p-value: %f, std_err: %f"\
% (r_value, r_value**2, p_value, std_err))

print("Decrease seems faster in patient than in control population")

Out:

--- In control population ---


gm_f = -0.001181 * age + 0.529829
Corr: -0.325122, r-squared: 0.105704, p-value: 0.002255, std_err: 0.000375
--- In patient population ---
gm_f = -0.001899 * age + 0.556886
Corr: -0.528765, r-squared: 0.279592, p-value: 0.000000, std_err: 0.000245
Decrease seems faster in patient than in control population

Stats with statsmodels

print("--- In control population ---")


lr = smfrmla.ols("gm_f ~ age", data=brain_vol1_ctl).fit()
print(lr.summary())
print("Age explains %.2f%% of the grey matter fraction variance" %
(lr.rsquared * 100))

print("--- In patient population ---")


lr = smfrmla.ols("gm_f ~ age", data=brain_vol1_pat).fit()
print(lr.summary())
print("Age explains %.2f%% of the grey matter fraction variance" %
(lr.rsquared * 100))

Out:

--- In control population ---


OLS Regression Results
==============================================================================
Dep. Variable: gm_f R-squared: 0.106
Model: OLS Adj. R-squared: 0.095
Method: Least Squares F-statistic: 9.929
Date: jeu., 31 oct. 2019 Prob (F-statistic): 0.00226
Time: 16:09:40 Log-Likelihood: 159.34
No. Observations: 86 AIC: -314.7
Df Residuals: 84 BIC: -309.8
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.5298 0.013 40.350 0.000 0.504 0.556
age -0.0012 0.000 -3.151 0.002 -0.002 -0.000
==============================================================================
Omnibus: 0.946 Durbin-Watson: 1.628
Prob(Omnibus): 0.623 Jarque-Bera (JB): 0.782
Skew: 0.233 Prob(JB): 0.676
(continues on next page)

4.2. Lab 1: Brain volumes study 113


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


Kurtosis: 2.962 Cond. No. 111.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
Age explains 10.57% of the grey matter fraction variance
--- In patient population ---
OLS Regression Results
==============================================================================
Dep. Variable: gm_f R-squared: 0.280
Model: OLS Adj. R-squared: 0.275
Method: Least Squares F-statistic: 60.16
Date: jeu., 31 oct. 2019 Prob (F-statistic): 1.09e-12
Time: 16:09:40 Log-Likelihood: 289.38
No. Observations: 157 AIC: -574.8
Df Residuals: 155 BIC: -568.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.5569 0.009 60.817 0.000 0.539 0.575
age -0.0019 0.000 -7.756 0.000 -0.002 -0.001
==============================================================================
Omnibus: 2.310 Durbin-Watson: 1.325
Prob(Omnibus): 0.315 Jarque-Bera (JB): 1.854
Skew: 0.230 Prob(JB): 0.396
Kurtosis: 3.268 Cond. No. 111.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
Age explains 27.96% of the grey matter fraction variance

Before testing for differences of atrophy between the patients ans the controls Preliminary tests
for age x group effect (patients would be older or younger than Controls)
Plot

sns.violinplot("group", "age", data=brain_vol1)

114 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

Stats with scipy


print(scipy.stats.ttest_ind(brain_vol1_ctl.age, brain_vol1_pat.age))

Out:
Ttest_indResult(statistic=-1.2155557697674162, pvalue=0.225343592508479)

Stats with statsmodels


print(smfrmla.ols("age ~ group", data=brain_vol1).fit().summary())
print("No significant difference in age between patients and controls")

Out:
OLS Regression Results
==============================================================================
Dep. Variable: age R-squared: 0.006
Model: OLS Adj. R-squared: 0.002
Method: Least Squares F-statistic: 1.478
Date: jeu., 31 oct. 2019 Prob (F-statistic): 0.225
Time: 16:09:40 Log-Likelihood: -949.69
No. Observations: 243 AIC: 1903.
Df Residuals: 241 BIC: 1910.
Df Model: 1
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
(continues on next page)

4.2. Lab 1: Brain volumes study 115


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


------------------------------------------------------------------------------------
Intercept 33.2558 1.305 25.484 0.000 30.685 35.826
group[T.Patient] 1.9735 1.624 1.216 0.225 -1.225 5.172
==============================================================================
Omnibus: 35.711 Durbin-Watson: 2.096
Prob(Omnibus): 0.000 Jarque-Bera (JB): 20.726
Skew: 0.569 Prob(JB): 3.16e-05
Kurtosis: 2.133 Cond. No. 3.12
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly␣
˓→specified.
No significant difference in age between patients and controls

Preliminary tests for sex x group (more/less males in patients than in Controls)

crosstab = pd.crosstab(brain_vol1.sex, brain_vol1.group)


print("Obeserved contingency table")
print(crosstab)

chi2, pval, dof, expected = scipy.stats.chi2_contingency(crosstab)

print("Chi2 = %f, pval = %f" % (chi2, pval))


print("Expected contingency table under the null hypothesis")
print(expected)
print("No significant difference in sex between patients and controls")

Out:

Obeserved contingency table


group Control Patient
sex
F 33 55
M 53 102
Chi2 = 0.143253, pval = 0.705068
Expected contingency table under the null hypothesis
[[ 31.14403292 56.85596708]
[ 54.85596708 100.14403292]]
No significant difference in sex between patients and controls

3. Test for differences of atrophy between the patients and the controls

print(sm.stats.anova_lm(smfrmla.ols("gm_f ~ group", data=brain_vol1).fit(), typ=2))


print("No significant difference in age between patients and controls")

Out:

sum_sq df F PR(>F)
group 0.00 1.00 0.01 0.92
Residual 0.46 241.00 nan nan
No significant difference in age between patients and controls

This model is simplistic we should adjust for age and site

116 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

print(sm.stats.anova_lm(smfrmla.ols(
"gm_f ~ group + age + site", data=brain_vol1).fit(), typ=2))
print("No significant difference in age between patients and controls")

Out:

sum_sq df F PR(>F)
group 0.00 1.00 1.82 0.18
site 0.11 5.00 19.79 0.00
age 0.09 1.00 86.86 0.00
Residual 0.25 235.00 nan nan
No significant difference in age between patients and controls

4. Test for interaction between age and clinical status, ie: is the brain atrophy process in
patient population faster than in the control population.

ancova = smfrmla.ols("gm_f ~ group:age + age + site", data=brain_vol1).fit()


print(sm.stats.anova_lm(ancova, typ=2))

print("= Parameters =")


print(ancova.params)

print("%.3f%% of grey matter loss per year (almost %.1f%% per decade)" %\
(ancova.params.age * 100, ancova.params.age * 100 * 10))

print("grey matter loss in patients is accelerated by %.3f%% per decade" %


(ancova.params['group[T.Patient]:age'] * 100 * 10))

Out:

sum_sq df F PR(>F)
site 0.11 5.00 20.28 0.00
age 0.10 1.00 89.37 0.00
group:age 0.00 1.00 3.28 0.07
Residual 0.25 235.00 nan nan
= Parameters =
Intercept 0.52
site[T.S3] 0.01
site[T.S4] 0.03
site[T.S5] 0.01
site[T.S7] 0.06
site[T.S8] 0.02
age -0.00
group[T.Patient]:age -0.00
dtype: float64
-0.148% of grey matter loss per year (almost -1.5% per decade)
grey matter loss in patients is accelerated by -0.232% per decade

Total running time of the script: ( 0 minutes 4.267 seconds)

4.3 Multivariate statistics

Multivariate statistics includes all statistical techniques for analyzing samples made of two or
more variables. The data set (a 𝑁 × 𝑃 matrix X) is a collection of 𝑁 independent samples

4.3. Multivariate statistics 117


Statistics and Machine Learning in Python, Release 0.3 beta

column vectors [x1 , . . . , x𝑖 , . . . , x𝑁 ] of length 𝑃


⎡ 𝑇 ⎤ ⎡ ⎤ ⎡ ⎤
−x1 − 𝑥11 · · · 𝑥1𝑗 · · · 𝑥1𝑃 𝑥11 . . . 𝑥1𝑃
⎢ .. ⎥ ⎢ .. .. .. ⎥ ⎢ .. .. ⎥
⎢ . ⎥ ⎢ . . . ⎥ ⎢ . . ⎥
⎢ 𝑇 ⎥ ⎢ ⎥ ⎢ ⎥
X = ⎢−x𝑖 −⎥ = ⎢ 𝑥𝑖1 · · · 𝑥𝑖𝑗 · · · 𝑥𝑖𝑃 ⎥ = ⎢
⎢ ⎥ ⎢ ⎥ ⎢ X ⎥ .
⎢ .. ⎥ ⎢ .. .. .. .. .. ⎥

⎥ ⎢
⎣ . ⎦ ⎣ . . . ⎦ ⎣ . . ⎦
−x𝑃 −𝑇 𝑥𝑁 1 · · · 𝑥𝑁 𝑗 · · · 𝑥𝑁 𝑃 𝑥𝑁 1 . . . 𝑥𝑁 𝑃 𝑁 ×𝑃

4.3.1 Linear Algebra

Euclidean norm and distance

The Euclidean norm of a vector a ∈ R𝑃 is denoted



⎸ 𝑃
⎸∑︁
‖a‖2 = ⎷ 𝑎𝑖 2
𝑖

The Euclidean distance between two vectors a, b ∈ R𝑃 is



⎸ 𝑃
⎸∑︁
‖a − b‖2 = ⎷ (𝑎𝑖 − 𝑏𝑖 )2
𝑖

Dot product and projection

Source: Wikipedia
Algebraic definition
The dot product, denoted ’‘·” of two 𝑃 -dimensional vectors a = [𝑎1 , 𝑎2 , ..., 𝑎𝑃 ] and a =
[𝑏1 , 𝑏2 , ..., 𝑏𝑃 ] is defined as
⎡ ⎤
𝑏1
⎢ .. ⎥
]︀ ⎢ . ⎥
⎢ ⎥
∑︁
𝑇 𝑇
[︀
a·b=a b= 𝑎𝑖 𝑏𝑖 = 𝑎1 . . . a . . . 𝑎𝑃 ⎢ ⎢ b ⎥.

𝑖 ⎢ .. ⎥
⎣.⎦
𝑏𝑃

The Euclidean norm of a vector can be computed using the dot product, as

‖a‖2 = a · a.

Geometric definition: projection


In Euclidean space, a Euclidean vector is a geometrical object that possesses both a magnitude
and a direction. A vector can be pictured as an arrow. Its magnitude is its length, and its
direction is the direction that the arrow points. The magnitude of a vector a is denoted by ‖a‖2 .
The dot product of two Euclidean vectors a and b is defined by

a · b = ‖a‖2 ‖b‖2 cos 𝜃,

118 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

where 𝜃 is the angle between a and b.


In particular, if a and b are orthogonal, then the angle between them is 90° and

a · b = 0.

At the other extreme, if they are codirectional, then the angle between them is 0° and

a · b = ‖a‖2 ‖b‖2

This implies that the dot product of a vector a by itself is

a · a = ‖a‖22 .

The scalar projection (or scalar component) of a Euclidean vector a in the direction of a Eu-
clidean vector b is given by

𝑎𝑏 = ‖a‖2 cos 𝜃,

where 𝜃 is the angle between a and b.


In terms of the geometric definition of the dot product, this can be rewritten
a·b
𝑎𝑏 = ,
‖b‖2

Fig. 5: Projection.

import numpy as np
np.random.seed(42)

a = np.random.randn(10)
b = np.random.randn(10)

np.dot(a, b)

-4.085788532659924

4.3. Multivariate statistics 119


Statistics and Machine Learning in Python, Release 0.3 beta

4.3.2 Mean vector

The mean (𝑃 × 1) column-vector 𝜇 whose estimator is


⎡ ⎤ ⎡ ⎤
𝑥𝑖1 𝑥
¯1
⎢ .. ⎥ ⎢ .. ⎥
𝑁 ⎢ . ⎥
𝑁
⎥ ⎢ . ⎥
⎢ ⎥
1 ∑︁ 1 ∑︁ ⎢⎢ 𝑥𝑖𝑗 ⎥ = ⎢ 𝑥
x̄ = xi = ⎢ ⎥ ⎢ ¯𝑗 ⎥ .

𝑁 𝑁 . .
𝑖=1 𝑖=1
⎣ .. ⎦ ⎣ .. ⎦
⎢ ⎥ ⎢ ⎥

𝑥𝑖𝑃 𝑥¯𝑃

4.3.3 Covariance matrix

• The covariance matrix ΣXX is a symmetric positive semi-definite matrix whose element
in the 𝑗, 𝑘 position is the covariance between the 𝑗 𝑡ℎ and 𝑘 𝑡ℎ elements of a random vector
i.e. the 𝑗 𝑡ℎ and 𝑘 𝑡ℎ columns of X.
• The covariance matrix generalizes the notion of covariance to multiple dimensions.
• The covariance matrix describe the shape of the sample distribution around the mean
assuming an elliptical distribution:

ΣXX = 𝐸(X − 𝐸(X))𝑇 𝐸(X − 𝐸(X)),


whose estimator SXX is a 𝑃 × 𝑃 matrix given by
1
SXX = (X − 1x̄𝑇 )𝑇 (X − 1x̄𝑇 ).
𝑁 −1
If we assume that X is centered, i.e. X is replaced by X − 1x̄𝑇 then the estimator is
⎡ ⎤ ⎡ ⎤
𝑥11 · · · 𝑥𝑁 1 ⎡ ⎤ 𝑠1 . . . 𝑠1𝑘 𝑠1𝑃
⎥ 𝑥11 · · · 𝑥1𝑘 𝑥1𝑃 .. .. ⎥
1 1 ⎢ ⎢ 𝑥1𝑗 · · · 𝑥𝑁 𝑗 ⎥ ⎢ .. .. .. ⎥ = ⎢ . 𝑠𝑗𝑘

SXX = 𝑇
X X= ⎢ .. . ⎥
⎥,
.. . . .
𝑁 −1 𝑁 −1⎣ .
⎥ ⎣ ⎦ ⎢
. ⎦ ⎣ 𝑠𝑘 𝑠𝑘𝑃 ⎦
𝑥𝑁 1 · · · 𝑥𝑁 𝑘 𝑥𝑁 𝑃
𝑥1𝑃 · · · 𝑥𝑁 𝑃 𝑠𝑃

where
𝑁
1 1 ∑︁
𝑠𝑗𝑘 = 𝑠𝑘𝑗 = xj 𝑇 x k = 𝑥𝑖𝑗 𝑥𝑖𝑘
𝑁 −1 𝑁 −1
𝑖=1

is an estimator of the covariance between the 𝑗 𝑡ℎ and 𝑘 𝑡ℎ variables.

## Avoid warnings and force inline plot


%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
##
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import pystatsml.plot_utils
import seaborn as sns # nice color

(continues on next page)

120 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


np.random.seed(42)
colors = sns.color_palette()

n_samples, n_features = 100, 2

mean, Cov, X = [None] * 4, [None] * 4, [None] * 4


mean[0] = np.array([-2.5, 2.5])
Cov[0] = np.array([[1, 0],
[0, 1]])

mean[1] = np.array([2.5, 2.5])


Cov[1] = np.array([[1, .5],
[.5, 1]])

mean[2] = np.array([-2.5, -2.5])


Cov[2] = np.array([[1, .9],
[.9, 1]])

mean[3] = np.array([2.5, -2.5])


Cov[3] = np.array([[1, -.9],
[-.9, 1]])

# Generate dataset
for i in range(len(mean)):
X[i] = np.random.multivariate_normal(mean[i], Cov[i], n_samples)

# Plot
for i in range(len(mean)):
# Points
plt.scatter(X[i][:, 0], X[i][:, 1], color=colors[i], label="class %i" % i)
# Means
plt.scatter(mean[i][0], mean[i][1], marker="o", s=200, facecolors='w',
edgecolors=colors[i], linewidth=2)
# Ellipses representing the covariance matrices
pystatsml.plot_utils.plot_cov_ellipse(Cov[i], pos=mean[i], facecolor='none',
linewidth=2, edgecolor=colors[i])

plt.axis('equal')
_ = plt.legend(loc='upper left')

4.3. Multivariate statistics 121


Statistics and Machine Learning in Python, Release 0.3 beta

4.3.4 Correlation matrix

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

url = 'https://python-graph-gallery.com/wp-content/uploads/mtcars.csv'
df = pd.read_csv(url)

# Compute the correlation matrix


corr = df.corr()

# Generate a mask for the upper triangle


mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(5.5, 4.5))
cmap = sns.color_palette("RdBu_r", 11)
# Draw the heatmap with the mask and correct aspect ratio
_ = sns.heatmap(corr, mask=None, cmap=cmap, vmax=1, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})

122 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

Re-order correlation matrix using AgglomerativeClustering

# convert correlation to distances


d = 2 * (1 - np.abs(corr))

from sklearn.cluster import AgglomerativeClustering


clustering = AgglomerativeClustering(n_clusters=3, linkage='single', affinity="precomputed
˓→").fit(d)

lab=0

clusters = [list(corr.columns[clustering.labels_==lab]) for lab in set(clustering.labels_


˓→)]
print(clusters)

reordered = np.concatenate(clusters)

R = corr.loc[reordered, reordered]

f, ax = plt.subplots(figsize=(5.5, 4.5))
# Draw the heatmap with the mask and correct aspect ratio
_ = sns.heatmap(R, mask=None, cmap=cmap, vmax=1, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})

[['mpg', 'cyl', 'disp', 'hp', 'wt', 'qsec', 'vs', 'carb'], ['am', 'gear'], ['drat']]

4.3. Multivariate statistics 123


Statistics and Machine Learning in Python, Release 0.3 beta

4.3.5 Precision matrix

In statistics, precision is the reciprocal of the variance, and the precision matrix is the matrix
inverse of the covariance matrix.
It is related to partial correlations that measures the degree of association between two vari-
ables, while controlling the effect of other variables.

import numpy as np

Cov = np.array([[1.0, 0.9, 0.9, 0.0, 0.0, 0.0],


[0.9, 1.0, 0.9, 0.0, 0.0, 0.0],
[0.9, 0.9, 1.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 1.0, 0.9, 0.0],
[0.0, 0.0, 0.0, 0.9, 1.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 1.0]])

print("# Precision matrix:")


Prec = np.linalg.inv(Cov)
print(Prec.round(2))

print("# Partial correlations:")


Pcor = np.zeros(Prec.shape)
Pcor[::] = np.NaN

for i, j in zip(*np.triu_indices_from(Prec, 1)):


Pcor[i, j] = - Prec[i, j] / np.sqrt(Prec[i, i] * Prec[j, j])

print(Pcor.round(2))

124 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

# Precision matrix:
[[ 6.79 -3.21 -3.21 0. 0. 0. ]
[-3.21 6.79 -3.21 0. 0. 0. ]
[-3.21 -3.21 6.79 0. 0. 0. ]
[ 0. -0. -0. 5.26 -4.74 -0. ]
[ 0. 0. 0. -4.74 5.26 0. ]
[ 0. 0. 0. 0. 0. 1. ]]
# Partial correlations:
[[ nan 0.47 0.47 -0. -0. -0. ]
[ nan nan 0.47 -0. -0. -0. ]
[ nan nan nan -0. -0. -0. ]
[ nan nan nan nan 0.9 0. ]
[ nan nan nan nan nan -0. ]
[ nan nan nan nan nan nan]]

4.3.6 Mahalanobis distance

• The Mahalanobis distance is a measure of the distance between two points x and 𝜇 where
the dispersion (i.e. the covariance structure) of the samples is taken into account.
• The dispersion is considered through covariance matrix.
This is formally expressed as
√︁
𝐷𝑀 (x, 𝜇) = (x − 𝜇)𝑇 Σ−1 (x − 𝜇).

Intuitions
• Distances along the principal directions of dispersion are contracted since they correspond
to likely dispersion of points.
• Distances othogonal to the principal directions of dispersion are dilated since they corre-
spond to unlikely dispersion of points.
For example

𝐷𝑀 (1) = 1𝑇 Σ−1 1.

ones = np.ones(Cov.shape[0])
d_euc = np.sqrt(np.dot(ones, ones))
d_mah = np.sqrt(np.dot(np.dot(ones, Prec), ones))

print("Euclidean norm of ones=%.2f. Mahalanobis norm of ones=%.2f" % (d_euc, d_mah))

Euclidean norm of ones=2.45. Mahalanobis norm of ones=1.77

The first dot product that distances along the principal directions of dispersion are contracted:

print(np.dot(ones, Prec))

[0.35714286 0.35714286 0.35714286 0.52631579 0.52631579 1. ]

4.3. Multivariate statistics 125


Statistics and Machine Learning in Python, Release 0.3 beta

import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import pystatsml.plot_utils
%matplotlib inline
np.random.seed(40)
colors = sns.color_palette()

mean = np.array([0, 0])


Cov = np.array([[1, .8],
[.8, 1]])
samples = np.random.multivariate_normal(mean, Cov, 100)
x1 = np.array([0, 2])
x2 = np.array([2, 2])

plt.scatter(samples[:, 0], samples[:, 1], color=colors[0])


plt.scatter(mean[0], mean[1], color=colors[0], s=200, label="mean")
plt.scatter(x1[0], x1[1], color=colors[1], s=200, label="x1")
plt.scatter(x2[0], x2[1], color=colors[2], s=200, label="x2")

# plot covariance ellipsis


pystatsml.plot_utils.plot_cov_ellipse(Cov, pos=mean, facecolor='none',
linewidth=2, edgecolor=colors[0])
# Compute distances
d2_m_x1 = scipy.spatial.distance.euclidean(mean, x1)
d2_m_x2 = scipy.spatial.distance.euclidean(mean, x2)

Covi = scipy.linalg.inv(Cov)
dm_m_x1 = scipy.spatial.distance.mahalanobis(mean, x1, Covi)
dm_m_x2 = scipy.spatial.distance.mahalanobis(mean, x2, Covi)

# Plot distances
vm_x1 = (x1 - mean) / d2_m_x1
vm_x2 = (x2 - mean) / d2_m_x2
jitter = .1
plt.plot([mean[0] - jitter, d2_m_x1 * vm_x1[0] - jitter],
[mean[1], d2_m_x1 * vm_x1[1]], color='k')
plt.plot([mean[0] - jitter, d2_m_x2 * vm_x2[0] - jitter],
[mean[1], d2_m_x2 * vm_x2[1]], color='k')

plt.plot([mean[0] + jitter, dm_m_x1 * vm_x1[0] + jitter],


[mean[1], dm_m_x1 * vm_x1[1]], color='r')
plt.plot([mean[0] + jitter, dm_m_x2 * vm_x2[0] + jitter],
[mean[1], dm_m_x2 * vm_x2[1]], color='r')

plt.legend(loc='lower right')
plt.text(-6.1, 3,
'Euclidian: d(m, x1) = %.1f<d(m, x2) = %.1f' % (d2_m_x1, d2_m_x2), color='k')
plt.text(-6.1, 3.5,
'Mahalanobis: d(m, x1) = %.1f>d(m, x2) = %.1f' % (dm_m_x1, dm_m_x2), color='r')

plt.axis('equal')
print('Euclidian d(m, x1) = %.2f < d(m, x2) = %.2f' % (d2_m_x1, d2_m_x2))
print('Mahalanobis d(m, x1) = %.2f > d(m, x2) = %.2f' % (dm_m_x1, dm_m_x2))

126 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

Euclidian d(m, x1) = 2.00 < d(m, x2) = 2.83


Mahalanobis d(m, x1) = 3.33 > d(m, x2) = 2.11

If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Eu-
clidean distance. If the covariance matrix is diagonal, then the resulting distance measure is
called a normalized Euclidean distance.
More generally, the Mahalanobis distance is a measure of the distance between a point x and a
distribution 𝒩 (x|𝜇, Σ). It is a multi-dimensional generalization of the idea of measuring how
many standard deviations away x is from the mean. This distance is zero if x is at the mean,
and grows as x moves away from the mean: along each principal component axis, it measures
the number of standard deviations from x to the mean of the distribution.

4.3.7 Multivariate normal distribution

The distribution, or probability density function (PDF) (sometimes just density), of a continuous
random variable is a function that describes the relative likelihood for this random variable to
take on a given value.
The multivariate normal distribution, or multivariate Gaussian distribution, of a 𝑃 -dimensional
random vector x = [𝑥1 , 𝑥2 , . . . , 𝑥𝑃 ]𝑇 is
1 1
𝒩 (x|𝜇, Σ) = exp{− (x − 𝜇)𝑇 Σ−1 (x − 𝜇)}.
(2𝜋)𝑃/2 |Σ|1/2 2

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from scipy.stats import multivariate_normal
from mpl_toolkits.mplot3d import Axes3D

(continues on next page)

4.3. Multivariate statistics 127


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


def multivariate_normal_pdf(X, mean, sigma):
"""Multivariate normal probability density function over X (n_samples x n_features)"""
P = X.shape[1]
det = np.linalg.det(sigma)
norm_const = 1.0 / (((2*np.pi) ** (P/2)) * np.sqrt(det))
X_mu = X - mu
inv = np.linalg.inv(sigma)
d2 = np.sum(np.dot(X_mu, inv) * X_mu, axis=1)
return norm_const * np.exp(-0.5 * d2)

# mean and covariance


mu = np.array([0, 0])
sigma = np.array([[1, -.5],
[-.5, 1]])

# x, y grid
x, y = np.mgrid[-3:3:.1, -3:3:.1]
X = np.stack((x.ravel(), y.ravel())).T
norm = multivariate_normal_pdf(X, mean, sigma).reshape(x.shape)

# Do it with scipy
norm_scpy = multivariate_normal(mu, sigma).pdf(np.stack((x, y), axis=2))
assert np.allclose(norm, norm_scpy)

# Plot
fig = plt.figure(figsize=(10, 7))
ax = fig.gca(projection='3d')
surf = ax.plot_surface(x, y, norm, rstride=3,
cstride=3, cmap=plt.cm.coolwarm,
linewidth=1, antialiased=False
)

ax.set_zlim(0, 0.2)
ax.zaxis.set_major_locator(plt.LinearLocator(10))
ax.zaxis.set_major_formatter(plt.FormatStrFormatter('%.02f'))

ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('p(x)')

plt.title('Bivariate Normal/Gaussian distribution')


fig.colorbar(surf, shrink=0.5, aspect=7, cmap=plt.cm.coolwarm)
plt.show()

128 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

4.3.8 Exercises

Dot product and Euclidean norm

Given a = [2, 1]𝑇 and b = [1, 1]𝑇


1. Write a function euclidean(x) that computes the Euclidean norm of vector, x.
2. Compute the Euclidean norm of a.
3. Compute the Euclidean distance of ‖a − b‖2 .
4. Compute the projection of b in the direction of vector a: 𝑏𝑎 .
5. Simulate a dataset X of 𝑁 = 100 samples of 2-dimensional vectors.
6. Project all samples in the direction of the vector a.

Covariance matrix and Mahalanobis norm

1. Sample a dataset X of 𝑁 = 100 samples of 2-dimensional [︂ vectors


]︂ from the bivariate
1 0.8
normal distribution 𝒩 (𝜇, Σ) where 𝜇 = [1, 1]𝑇 and Σ = .
0.8, 1
2. Compute the mean vector x̄ and center X. Compare the estimated mean x̄ to the true
mean, 𝜇.
3. Compute the empirical covariance matrix S. Compare the estimated covariance matrix S
to the true covariance matrix, Σ.

4.3. Multivariate statistics 129


Statistics and Machine Learning in Python, Release 0.3 beta

4. Compute S−1 (Sinv) the inverse of the covariance matrix by using scipy.linalg.inv(S).
5. Write a function mahalanobis(x, xbar, Sinv) that computes the Mahalanobis distance
of a vector x to the mean, x̄.
6. Compute the Mahalanobis and Euclidean distances of each sample x𝑖 to the mean x̄. Store
the results in a 100 × 2 dataframe.

4.4 Time Series in python

Two libraries:
• Pandas: https://pandas.pydata.org/pandas-docs/stable/timeseries.html
• scipy http://www.statsmodels.org/devel/tsa.html

4.4.1 Stationarity

A TS is said to be stationary if its statistical properties such as mean, variance remain constant
over time.
• constant mean
• constant variance
• an autocovariance that does not depend on time.
what is making a TS non-stationary. There are 2 major reasons behind non-stationaruty of a
TS:
1. Trend – varying mean over time. For eg, in this case we saw that on average, the number
of passengers was growing over time.
2. Seasonality – variations at specific time-frames. eg people might have a tendency to buy
cars in a particular month because of pay increment or festivals.

4.4.2 Pandas Time Series Data Structure

A Series is similar to a list or an array in Python. It represents a series of values (numeric


or otherwise) such as a column of data. It provides additional functionality, methods, and
operators, which make it a more powerful version of a list.

import pandas as pd
import numpy as np

# Create a Series from a list


ser = pd.Series([1, 3])
print(ser)

# String as index
prices = {'apple': 4.99,
'banana': 1.99,
'orange': 3.99}
ser = pd.Series(prices)
(continues on next page)

130 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


print(ser)

x = pd.Series(np.arange(1,3), index=[x for x in 'ab'])


print(x)
print(x['b'])

0 1
1 3
dtype: int64
apple 4.99
banana 1.99
orange 3.99
dtype: float64
a 1
b 2
dtype: int64
2

4.4.3 Time Series Analysis of Google Trends

source: https://www.datacamp.com/community/tutorials/time-series-analysis-tutorial
Get Google Trends data of keywords such as ‘diet’ and ‘gym’ and see how they vary over time
while learning about trends and seasonality in time series data.
In the Facebook Live code along session on the 4th of January, we checked out Google trends
data of keywords ‘diet’, ‘gym’ and ‘finance’ to see how they vary over time. We asked ourselves
if there could be more searches for these terms in January when we’re all trying to turn over a
new leaf?
In this tutorial, you’ll go through the code that we put together during the session step by step.
You’re not going to do much mathematics but you are going to do the following:
• Read data
• Recode data
• Exploratory Data Analysis

4.4.4 Read data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Plot appears on its own windows


%matplotlib inline
# Tools / Preferences / Ipython Console / Graphics / Graphics Backend / Backend:␣
˓→“automatic”

# Interactive Matplotlib Jupyter Notebook


# %matplotlib inline

(continues on next page)

4.4. Time Series in python 131


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


try:
url = "https://raw.githubusercontent.com/datacamp/datacamp_facebook_live_ny_
˓→resolution/master/datasets/multiTimeline.csv"
df = pd.read_csv(url, skiprows=2)
except:
df = pd.read_csv("../datasets/multiTimeline.csv", skiprows=2)

print(df.head())

# Rename columns
df.columns = ['month', 'diet', 'gym', 'finance']

# Describe
print(df.describe())

Month diet: (Worldwide) gym: (Worldwide) finance: (Worldwide)


0 2004-01 100 31 48
1 2004-02 75 26 49
2 2004-03 67 24 47
3 2004-04 70 22 48
4 2004-05 72 22 43
diet gym finance
count 168.000000 168.000000 168.000000
mean 49.642857 34.690476 47.148810
std 8.033080 8.134316 4.972547
min 34.000000 22.000000 38.000000
25% 44.000000 28.000000 44.000000
50% 48.500000 32.500000 46.000000
75% 53.000000 41.000000 50.000000
max 100.000000 58.000000 73.000000

4.4.5 Recode data

Next, you’ll turn the ‘month’ column into a DateTime data type and make it the index of the
DataFrame.
Note that you do this because you saw in the result of the .info() method that the ‘Month’
column was actually an of data type object. Now, that generic data type encapsulates everything
from strings to integers, etc. That’s not exactly what you want when you want to be looking
at time series data. That’s why you’ll use .to_datetime() to convert the ‘month’ column in your
DataFrame to a DateTime.
Be careful! Make sure to include the inplace argument when you’re setting the index of the
DataFrame df so that you actually alter the original index and set it to the ‘month’ column.

df.month = pd.to_datetime(df.month)
df.set_index('month', inplace=True)

print(df.head())

diet gym finance


month
2004-01-01 100 31 48
(continues on next page)

132 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


2004-02-01 75 26 49
2004-03-01 67 24 47
2004-04-01 70 22 48
2004-05-01 72 22 43

4.4.6 Exploratory Data Analysis

You can use a built-in pandas visualization method .plot() to plot your data as 3 line plots on a
single figure (one for each column, namely, ‘diet’, ‘gym’, and ‘finance’).

df.plot()
plt.xlabel('Year');

# change figure parameters


# df.plot(figsize=(20,10), linewidth=5, fontsize=20)

# Plot single column


# df[['diet']].plot(figsize=(20,10), linewidth=5, fontsize=20)
# plt.xlabel('Year', fontsize=20);

Note that this data is relative. As you can read on Google trends:
Numbers represent search interest relative to the highest point on the chart for the given region
and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term
is half as popular. Likewise a score of 0 means the term was less than 1% as popular as the
peak.

4.4.7 Resampling, Smoothing, Windowing, Rolling average: Trends

Rolling average, for each time point, take the average of the points on either side of it. Note
that the number of points is specified by a window size.

4.4. Time Series in python 133


Statistics and Machine Learning in Python, Release 0.3 beta

Remove Seasonality with pandas Series.


See: http://pandas.pydata.org/pandas-docs/stable/timeseries.html A: ‘year end frequency’
year frequency

diet = df['diet']

diet_resamp_yr = diet.resample('A').mean()
diet_roll_yr = diet.rolling(12).mean()

ax = diet.plot(alpha=0.5, style='-') # store axis (ax) for latter plots


diet_resamp_yr.plot(style=':', label='Resample at year frequency', ax=ax)
diet_roll_yr.plot(style='--', label='Rolling average (smooth), window size=12', ax=ax)
ax.legend()

<matplotlib.legend.Legend at 0x7f0db4e0a2b0>

Rolling average (smoothing) with Numpy

x = np.asarray(df[['diet']])
win = 12
win_half = int(win / 2)
# print([((idx-win_half), (idx+win_half)) for idx in np.arange(win_half, len(x))])

diet_smooth = np.array([x[(idx-win_half):(idx+win_half)].mean() for idx in np.arange(win_


˓→half, len(x))])
plt.plot(diet_smooth)

[<matplotlib.lines.Line2D at 0x7f0db4cfea90>]

134 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

Trends Plot Diet and Gym


Build a new DataFrame which is the concatenation diet and gym smoothed data

gym = df['gym']

df_avg = pd.concat([diet.rolling(12).mean(), gym.rolling(12).mean()], axis=1)


df_avg.plot()
plt.xlabel('Year')

Text(0.5, 0, 'Year')

Detrending

4.4. Time Series in python 135


Statistics and Machine Learning in Python, Release 0.3 beta

df_dtrend = df[["diet", "gym"]] - df_avg


df_dtrend.plot()
plt.xlabel('Year')

Text(0.5, 0, 'Year')

4.4.8 First-order differencing: Seasonal Patterns

# diff = original - shiftted data


# (exclude first term for some implementation details)
assert np.all((diet.diff() == diet - diet.shift())[1:])

df.diff().plot()
plt.xlabel('Year')

Text(0.5, 0, 'Year')

136 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

4.4.9 Periodicity and Correlation

df.plot()
plt.xlabel('Year');
print(df.corr())

diet gym finance


diet 1.000000 -0.100764 -0.034639
gym -0.100764 1.000000 -0.284279
finance -0.034639 -0.284279 1.000000

4.4. Time Series in python 137


Statistics and Machine Learning in Python, Release 0.3 beta

Plot correlation matrix

sns.heatmap(df.corr(), cmap="coolwarm")

<matplotlib.axes._subplots.AxesSubplot at 0x7f0db29f3ba8>

‘diet’ and ‘gym’ are negatively correlated! Remember that you have a seasonal and a trend
component. From the correlation coefficient, ‘diet’ and ‘gym’ are negatively correlated:
• trends components are negatively correlated.
• seasonal components would positively correlated and their
The actual correlation coefficient is actually capturing both of those.
Seasonal correlation: correlation of the first-order differences of these time series

df.diff().plot()
plt.xlabel('Year');

print(df.diff().corr())

diet gym finance


diet 1.000000 0.758707 0.373828
gym 0.758707 1.000000 0.301111
finance 0.373828 0.301111 1.000000

138 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

Plot correlation matrix


sns.heatmap(df.diff().corr(), cmap="coolwarm")

<matplotlib.axes._subplots.AxesSubplot at 0x7f0db28aeb70>

Decomposing time serie in trend, seasonality and residuals


from statsmodels.tsa.seasonal import seasonal_decompose

x = gym

x = x.astype(float) # force float


(continues on next page)

4.4. Time Series in python 139


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


decomposition = seasonal_decompose(x)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

plt.subplot(411)
plt.plot(x, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()

4.4.10 Autocorrelation

A time series is periodic if it repeats itself at equally spaced intervals, say, every 12 months.
Autocorrelation Function (ACF): It is a measure of the correlation between the TS with a lagged
version of itself. For instance at lag 5, ACF would compare series at time instant t1. . . t2 with
series at instant t1-5. . . t2-5 (t1-5 and t2 being end points).
Plot

# from pandas.plotting import autocorrelation_plot


from pandas.plotting import autocorrelation_plot
(continues on next page)

140 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)

x = df["diet"].astype(float)
autocorrelation_plot(x)

<matplotlib.axes._subplots.AxesSubplot at 0x7f0db25b2dd8>

Compute Autocorrelation Function (ACF)

from statsmodels.tsa.stattools import acf

x_diff = x.diff().dropna() # first item is NA


lag_acf = acf(x_diff, nlags=36)
plt.plot(lag_acf)
plt.title('Autocorrelation Function')

/home/edouard/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/stattools.py:541:␣
˓→FutureWarning: fft=True will become the default in a future version of statsmodels. To␣
˓→suppress this warning, explicitly set fft=False.
warnings.warn(msg, FutureWarning)

Text(0.5, 1.0, 'Autocorrelation Function')

4.4. Time Series in python 141


Statistics and Machine Learning in Python, Release 0.3 beta

ACF peaks every 12 months: Time series is correlated with itself shifted by 12 months.

4.4.11 Time Series Forecasting with Python using Autoregressive Moving Average
(ARMA) models

Source:
• https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/
9781783553358/7/ch07lvl1sec77/arma-models
• http://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model
• ARIMA: https://www.analyticsvidhya.com/blog/2016/02/
time-series-forecasting-codes-python/
ARMA models are often used to forecast a time series. These models combine autoregressive
and moving average models. In moving average models, we assume that a variable is the sum
of the mean of the time series and a linear combination of noise components.
The autoregressive and moving average models can have different orders. In general, we can
define an ARMA model with p autoregressive terms and q moving average terms as follows:
𝑝
∑︁ 𝑞
∑︁
𝑥𝑡 = 𝑎𝑖 𝑥𝑡−𝑖 + 𝑏𝑖 𝜀𝑡−𝑖 + 𝜀𝑡
𝑖 𝑖

Choosing p and q

Plot the partial autocorrelation functions for an estimate of p, and likewise using the autocorre-
lation functions for an estimate of q.
Partial Autocorrelation Function (PACF): This measures the correlation between the TS with a
lagged version of itself but after eliminating the variations already explained by the intervening

142 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

comparisons. Eg at lag 5, it will check the correlation but remove the effects already explained
by lags 1 to 4.

from statsmodels.tsa.stattools import acf, pacf

x = df["gym"].astype(float)

x_diff = x.diff().dropna() # first item is NA


# ACF and PACF plots:

lag_acf = acf(x_diff, nlags=20)


lag_pacf = pacf(x_diff, nlags=20, method='ols')

#Plot ACF:
plt.subplot(121)
plt.plot(lag_acf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(x_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(x_diff)),linestyle='--',color='gray')
plt.title('Autocorrelation Function (q=1)')

#Plot PACF:
plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(x_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(x_diff)),linestyle='--',color='gray')
plt.title('Partial Autocorrelation Function (p=1)')
plt.tight_layout()

In this plot, the two dotted lines on either sides of 0 are the confidence interevals. These can be
used to determine the p and q values as:
• p: The lag value where the PACF chart crosses the upper confidence interval for the first

4.4. Time Series in python 143


Statistics and Machine Learning in Python, Release 0.3 beta

time, in this case p=1.


• q: The lag value where the ACF chart crosses the upper confidence interval for the first
time, in this case q=1.

Fit ARMA model with statsmodels

1. Define the model by calling ARMA() and passing in the p and q parameters.
2. The model is prepared on the training data by calling the fit() function.
3. Predictions can be made by calling the predict() function and specifying the index of the
time or times to be predicted.

from statsmodels.tsa.arima_model import ARMA

model = ARMA(x, order=(1, 1)).fit() # fit model

print(model.summary())
plt.plot(x)
plt.plot(model.predict(), color='red')
plt.title('RSS: %.4f'% sum((model.fittedvalues-x)**2))

/home/edouard/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/base/tsa_model.
˓→py:165: ValueWarning: No frequency information was provided, so inferred frequency MS␣

˓→will be used.
% freq, ValueWarning)
/home/edouard/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/kalmanf/kalmanfilter.
˓→py:221: RuntimeWarning: divide by zero encountered in true_divide
Z_mat, R_mat, T_mat)

ARMA Model Results


==============================================================================
Dep. Variable: gym No. Observations: 168
Model: ARMA(1, 1) Log Likelihood -436.852
Method: css-mle S.D. of innovations 3.229
Date: Tue, 29 Oct 2019 AIC 881.704
Time: 11:47:14 BIC 894.200
Sample: 01-01-2004 HQIC 886.776
- 12-01-2017
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 36.4316 8.827 4.127 0.000 19.131 53.732
ar.L1.gym 0.9967 0.005 220.566 0.000 0.988 1.006
ma.L1.gym -0.7494 0.054 -13.931 0.000 -0.855 -0.644
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 1.0033 +0.0000j 1.0033 0.0000
MA.1 1.3344 +0.0000j 1.3344 0.0000
-----------------------------------------------------------------------------

Text(0.5, 1.0, 'RSS: 1794.4661')

144 Chapter 4. Statistics


Statistics and Machine Learning in Python, Release 0.3 beta

4.4. Time Series in python 145


Statistics and Machine Learning in Python, Release 0.3 beta

146 Chapter 4. Statistics


CHAPTER

FIVE

MACHINE LEARNING

5.1 Dimension reduction and feature extraction

5.1.1 Introduction

In machine learning and statistics, dimensionality reduction or dimension reduction is the pro-
cess of reducing the number of features under consideration, and can be divided into feature
selection (not addressed here) and feature extraction.
Feature extraction starts from an initial set of measured data and builds derived values (fea-
tures) intended to be informative and non-redundant, facilitating the subsequent learning and
generalization steps, and in some cases leading to better human interpretations. Feature extrac-
tion is related to dimensionality reduction.
The input matrix X, of dimension 𝑁 × 𝑃 , is
⎡ ⎤
𝑥11 . . . 𝑥1𝑃
⎢ ⎥
⎢ .. .. ⎥
⎢ ⎥
⎢ . X . ⎥
⎢ ⎥
⎣ ⎦
𝑥𝑁 1 . . . 𝑥 𝑁 𝑃

where the rows represent the samples and columns represent the variables.
The goal is to learn a transformation that extracts a few relevant features. This is generally
done by exploiting the covariance ΣXX between the input features.

5.1.2 Singular value decomposition and matrix factorization

Matrix factorization principles

Decompose the data matrix X𝑁 ×𝑃 into a product of a mixing matrix U𝑁 ×𝐾 and a dictionary
matrix V𝑃 ×𝐾 .

X = UV𝑇 ,

If we consider only a subset of components 𝐾 < 𝑟𝑎𝑛𝑘(X) < min(𝑃, 𝑁 − 1) , X is approximated


by a matrix X̂:

X ≈ X̂ = UV𝑇 ,

147
Statistics and Machine Learning in Python, Release 0.3 beta

Each line of xi is a linear combination (mixing ui ) of dictionary items V.


𝑁 𝑃 -dimensional data points lie in a space whose dimension is less than 𝑁 − 1 (2 dots lie on a
line, 3 on a plane, etc.).

Fig. 1: Matrix factorization

Singular value decomposition (SVD) principles

Singular-value decomposition (SVD) factorises the data matrix X𝑁 ×𝑃 into a product:

X = UDV𝑇 ,

where
⎡ ⎤ ⎡ ⎤
𝑥11 𝑥1𝑃 𝑢11⎡ 𝑢1𝐾
⎤⎡ ⎤
⎢ ⎥ ⎢ ⎥ 𝑑1 0 𝑣11 𝑣1𝑃
⎢ ⎥ ⎢ ⎥

⎢ X ⎥=⎢
⎥ ⎢ U ⎥

⎣ D ⎦ ⎣ V𝑇 ⎦.
⎣ ⎦ ⎣ ⎦ 0 𝑑𝐾 𝑣𝐾1 𝑣𝐾𝑃
𝑥𝑁 1 𝑥𝑁 𝑃 𝑢𝑁 1 𝑢𝑁 𝐾

U: right-singular
• V = [v1 , · · · , v𝐾 ] is a 𝑃 × 𝐾 orthogonal matrix.
• It is a dictionary of patterns to be combined (according to the mixing coefficients) to
reconstruct the original samples.
• V perfoms the initial rotations (projection) along the 𝐾 = min(𝑁, 𝑃 ) principal compo-
nent directions, also called loadings.
• Each v𝑗 performs the linear combination of the variables that has maximum sample vari-
ance, subject to being uncorrelated with the previous v𝑗−1 .
D: singular values
• D is a 𝐾 × 𝐾 diagonal matrix made of the singular values of X with 𝑑1 ≥ 𝑑2 ≥ · · · ≥
𝑑𝐾 ≥ 0.
• D scale the projection along the coordinate axes by 𝑑1 , 𝑑2 , · · · , 𝑑𝐾 .

148 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3 beta

• Singular values are the square roots of the eigenvalues of X𝑇 X.


V: left-singular vectors
• U = [u1 , · · · , u𝐾 ] is an 𝑁 × 𝐾 orthogonal matrix.
• Each row vi provides the mixing coefficients of dictionary items to reconstruct the sample
xi
• It may be understood as the coordinates on the new orthogonal basis (obtained after the
initial rotation) called principal components in the PCA.

SVD for variables transformation

V transforms correlated variables (X) into a set of uncorrelated ones (UD) that better expose
the various relationships among the original data items.

X = UDV𝑇 , (5.1)
𝑇
XV = UDV V, (5.2)
XV = UDI, (5.3)
XV = UD (5.4)

At the same time, SVD is a method for identifying and ordering the dimensions along which
data points exhibit the most variation.

import numpy as np
import scipy
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

np.random.seed(42)

# dataset
n_samples = 100
experience = np.random.normal(size=n_samples)
salary = 1500 + experience + np.random.normal(size=n_samples, scale=.5)
X = np.column_stack([experience, salary])

# PCA using SVD


X -= X.mean(axis=0) # Centering is required
U, s, Vh = scipy.linalg.svd(X, full_matrices=False)
# U : Unitary matrix having left singular vectors as columns.
# Of shape (n_samples,n_samples) or (n_samples,n_comps), depending on
# full_matrices.
#
# s : The singular values, sorted in non-increasing order. Of shape (n_comps,),
# with n_comps = min(n_samples, n_features).
#
# Vh: Unitary matrix having right singular vectors as rows.
# Of shape (n_features, n_features) or (n_comps, n_features) depending
# on full_matrices.
(continues on next page)

5.1. Dimension reduction and feature extraction 149


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)

plt.figure(figsize=(9, 3))

plt.subplot(131)
plt.scatter(U[:, 0], U[:, 1], s=50)
plt.axis('equal')
plt.title("U: Rotated and scaled data")

plt.subplot(132)

# Project data
PC = np.dot(X, Vh.T)
plt.scatter(PC[:, 0], PC[:, 1], s=50)
plt.axis('equal')
plt.title("XV: Rotated data")
plt.xlabel("PC1")
plt.ylabel("PC2")

plt.subplot(133)
plt.scatter(X[:, 0], X[:, 1], s=50)
for i in range(Vh.shape[0]):
plt.arrow(x=0, y=0, dx=Vh[i, 0], dy=Vh[i, 1], head_width=0.2,
head_length=0.2, linewidth=2, fc='r', ec='r')
plt.text(Vh[i, 0], Vh[i, 1],'v%i' % (i+1), color="r", fontsize=15,
horizontalalignment='right', verticalalignment='top')
plt.axis('equal')
plt.ylim(-4, 4)

plt.title("X: original data (v1, v2:PC dir.)")


plt.xlabel("experience")
plt.ylabel("salary")

plt.tight_layout()

5.1.3 Principal components analysis (PCA)

Sources:
• C. M. Bishop Pattern Recognition and Machine Learning, Springer, 2006
• Everything you did and didn’t know about PCA
• Principal Component Analysis in 3 Simple Steps

150 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3 beta

Principles

• Principal components analysis is the main method used for linear dimension reduction.
• The idea of principal component analysis is to find the 𝐾 principal components di-
rections (called the loadings) V𝐾×𝑃 that capture the variation in the data as much as
possible.
• It converts a set of 𝑁 𝑃 -dimensional observations N𝑁 ×𝑃 of possibly correlated variables
into a set of 𝑁 𝐾-dimensional samples C𝑁 ×𝐾 , where the 𝐾 < 𝑃 . The new variables are
linearly uncorrelated. The columns of C𝑁 ×𝐾 are called the principal components.
• The dimension reduction is obtained by using only 𝐾 < 𝑃 components that exploit corre-
lation (covariance) among the original variables.
• PCA is mathematically defined as an orthogonal linear transformation V𝐾×𝑃 that trans-
forms the data to a new coordinate system such that the greatest variance by some projec-
tion of the data comes to lie on the first coordinate (called the first principal component),
the second greatest variance on the second coordinate, and so on.

C𝑁 ×𝐾 = X𝑁 ×𝑃 V𝑃 ×𝐾

• PCA can be thought of as fitting a 𝑃 -dimensional ellipsoid to the data, where each axis of
the ellipsoid represents a principal component. If some axis of the ellipse is small, then the
variance along that axis is also small, and by omitting that axis and its corresponding prin-
cipal component from our representation of the dataset, we lose only a commensurately
small amount of information.
• Finding the 𝐾 largest axes of the ellipse will permit to project the data onto a space having
dimensionality 𝐾 < 𝑃 while maximizing the variance of the projected data.

Dataset preprocessing

Centering

Consider a data matrix, X , with column-wise zero empirical mean (the sample mean of each
column has been shifted to zero), ie. X is replaced by X − 1x̄𝑇 .

Standardizing

Optionally, standardize the columns, i.e., scale them by their standard-deviation. Without stan-
dardization, a variable with a high variance will capture most of the effect of the PCA. The
principal direction will be aligned with this variable. Standardization will, however, raise noise
variables to the save level as informative variables.
The covariance matrix of centered standardized data is the correlation matrix.

Eigendecomposition of the data covariance matrix

To begin with, consider the projection onto a one-dimensional space (𝐾 = 1). We can define
the direction of this space using a 𝑃 -dimensional vector v, which for convenience (and without
loss of generality) we shall choose to be a unit vector so that ‖v‖2 = 1 (note that we are only

5.1. Dimension reduction and feature extraction 151


Statistics and Machine Learning in Python, Release 0.3 beta

interested in the direction defined by v, not in the magnitude of v itself). PCA consists of two
mains steps:
Projection in the directions that capture the greatest variance
Each 𝑃 -dimensional data point x𝑖 is then projected onto v, where the coordinate (in the co-
ordinate system of v) is a scalar value, namely x𝑇𝑖 v. I.e., we want to find the vector v that
maximizes these coordinates along v, which we will see corresponds to maximizing the vari-
ance of the projected data. This is equivalently expressed as
1 ∑︁ (︀ 𝑇 )︀2
v = arg max x𝑖 v .
‖v‖=1 𝑁
𝑖

We can write this in matrix form as


1 1
v = arg max ‖Xv‖2 = v𝑇 X𝑇 Xv = v𝑇 SXX v,
‖v‖=1 𝑁 𝑁

where SXX is a biased estiamte of the covariance matrix of the data, i.e.
1 𝑇
SXX = X X.
𝑁
We now maximize the projected variance v𝑇 SXX v with respect to v. Clearly, this has to be a
constrained maximization to prevent ‖v2 ‖ → ∞. The appropriate constraint comes from the
normalization condition ‖v‖2 ≡ ‖v‖22 = v𝑇 v = 1. To enforce this constraint, we introduce a
Lagrange multiplier that we shall denote by 𝜆, and then make an unconstrained maximization
of

v𝑇 SXX v − 𝜆(v𝑇 v − 1).

By setting the gradient with respect to v equal to zero, we see that this quantity has a stationary
point when

SXX v = 𝜆v.

We note that v is an eigenvector of SXX .


If we left-multiply the above equation by v𝑇 and make use of v𝑇 v = 1, we see that the variance
is given by

v𝑇 SXX v = 𝜆,

and so the variance will be at a maximum when v is equal to the eigenvector corresponding to
the largest eigenvalue, 𝜆. This eigenvector is known as the first principal component.
We can define additional principal components in an incremental fashion by choosing each new
direction to be that which maximizes the projected variance amongst all possible directions that
are orthogonal to those already considered. If we consider the general case of a 𝐾-dimensional
projection space, the optimal linear projection for which the variance of the projected data is
maximized is now defined by the 𝐾 eigenvectors, v1 , . . . , vK , of the data covariance matrix
SXX that corresponds to the 𝐾 largest eigenvalues, 𝜆1 ≥ 𝜆2 ≥ · · · ≥ 𝜆𝐾 .

152 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3 beta

Back to SVD

The sample covariance matrix of centered data X is given by


1
SXX = X𝑇 X.
𝑁 −1

We rewrite X𝑇 X using the SVD decomposition of X as

X𝑇 X = (UDV𝑇 )𝑇 (UDV𝑇 )
= VD𝑇 U𝑇 UDV𝑇
= VD2 V𝑇
V𝑇 X𝑇 XV = D2
1 1
V𝑇 X𝑇 XV = D2
𝑁 −1 𝑁 −1
1
V𝑇 SXX V = D2
𝑁 −1
.
Considering only the 𝑘 𝑡ℎ right-singular vectors v𝑘 associated to the singular value 𝑑𝑘
1
vk 𝑇 SXX vk = 𝑑2 ,
𝑁 −1 𝑘
It turns out that if you have done the singular value decomposition then you already have
the Eigenvalue decomposition for X𝑇 X. Where - The eigenvectors of SXX are equivalent to
the right singular vectors, V, of X. - The eigenvalues, 𝜆𝑘 , of SXX , i.e. the variances of the
components, are equal to 𝑁 1−1 times the squared singular values, 𝑑𝑘 .
Moreover computing PCA with SVD do not require to form the matrix X𝑇 X, so computing the
SVD is now the standard way to calculate a principal components analysis from a data matrix,
unless only a handful of components are required.

PCA outputs

The SVD or the eigendecomposition of the data covariance matrix provides three main quanti-
ties:
1. Principal component directions or loadings are the eigenvectors of X𝑇 X. The V𝐾×𝑃
or the right-singular vectors of an SVD of X are called principal component directions of
X. They are generally computed using the SVD of X.
2. Principal components is the 𝑁 × 𝐾 matrix C which is obtained by projecting X onto the
principal components directions, i.e.
C𝑁 ×𝐾 = X𝑁 ×𝑃 V𝑃 ×𝐾 .
Since X = UDV𝑇 and V is orthogonal (V𝑇 V = I):

5.1. Dimension reduction and feature extraction 153


Statistics and Machine Learning in Python, Release 0.3 beta

C𝑁 ×𝐾 = UDV𝑇𝑁 ×𝑃 V𝑃 ×𝐾 (5.5)
C𝑁 ×𝐾 = UD𝑇𝑁 ×𝐾 I𝐾×𝐾 (5.6)
C𝑁 ×𝐾 = UD𝑇𝑁 ×𝐾 (5.7)
(5.8)

Thus c𝑗 = Xv𝑗 = u𝑗 𝑑𝑗 , for 𝑗 = 1, . . . 𝐾. Hence u𝑗 is simply the projection of the row vectors of
X, i.e., the input predictor vectors, on the direction v𝑗 , scaled by 𝑑𝑗 .
⎡ ⎤
𝑥1,1 𝑣1,1 + . . . + 𝑥1,𝑃 𝑣1,𝑃
⎢ 𝑥2,1 𝑣1,1 + . . . + 𝑥2,𝑃 𝑣1,𝑃 ⎥
c1 = ⎢ ..
⎢ ⎥

⎣ . ⎦
𝑥𝑁,1 𝑣1,1 + . . . + 𝑥𝑁,𝑃 𝑣1,𝑃

3. The variance of each component is given by the eigen values 𝜆𝑘 , 𝑘 = 1, . . . 𝐾. It can be


obtained from the singular values:

1
𝑣𝑎𝑟(c𝑘 ) = (Xv𝑘 )2 (5.9)
𝑁 −1
1
= (u𝑘 𝑑𝑘 )2 (5.10)
𝑁 −1
1
= 𝑑2 (5.11)
𝑁 −1 𝑘

Determining the number of PCs

We must choose 𝐾 * ∈ [1, . . . , 𝐾], the number of required components. This can be done by
calculating the explained variance ratio of the 𝐾 * first components and by choosing 𝐾 * such
that the cumulative explained variance ratio is greater than some given threshold (e.g., ≈
90%). This is expressed as
∑︀𝐾 *
𝑗 𝑣𝑎𝑟(c𝑘 )
cumulative explained variance(c𝑘 ) = ∑︀𝐾 .
𝑗 𝑣𝑎𝑟(c𝑘 )

Interpretation and visualization

PCs
Plot the samples projeted on first the principal components as e.g. PC1 against PC2.
PC directions
Exploring the loadings associated with a component provides the contribution of each original
variable in the component.
Remark: The loadings (PC directions) are the coefficients of multiple regression of PC on origi-
nal variables:

154 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3 beta

c = Xv (5.12)
𝑇 𝑇
X c = X Xv (5.13)
(X𝑇 X)−1 X𝑇 c = v (5.14)

Another way to evaluate the contribution of the original variables in each PC can be obtained
by computing the correlation between the PCs and the original variables, i.e. columns of X,
denoted x𝑗 , for 𝑗 = 1, . . . , 𝑃 . For the 𝑘 𝑡ℎ PC, compute and plot the correlations with all original
variables

𝑐𝑜𝑟(c𝑘 , x𝑗 ), 𝑗 = 1 . . . 𝐾, 𝑗 = 1 . . . 𝐾.

These quantities are sometimes called the correlation loadings.

import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

np.random.seed(42)

# dataset
n_samples = 100
experience = np.random.normal(size=n_samples)
salary = 1500 + experience + np.random.normal(size=n_samples, scale=.5)
X = np.column_stack([experience, salary])

# PCA with scikit-learn


pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)

PC = pca.transform(X)

plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1])
plt.xlabel("x1"); plt.ylabel("x2")

plt.subplot(122)
plt.scatter(PC[:, 0], PC[:, 1])
plt.xlabel("PC1 (var=%.2f)" % pca.explained_variance_ratio_[0])
plt.ylabel("PC2 (var=%.2f)" % pca.explained_variance_ratio_[1])
plt.axis('equal')
plt.tight_layout()

[0.93646607 0.06353393]

5.1. Dimension reduction and feature extraction 155


Statistics and Machine Learning in Python, Release 0.3 beta

5.1.4 Multi-dimensional Scaling (MDS)

Resources:
• http://www.stat.pitt.edu/sungkyu/course/2221Fall13/lec8_mds_combined.pdf
• https://en.wikipedia.org/wiki/Multidimensional_scaling
• Hastie, Tibshirani and Friedman (2009). The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. New York: Springer, Second Edition.
The purpose of MDS is to find a low-dimensional projection of the data in which the pairwise
distances between data points is preserved, as closely as possible (in a least-squares sense).
• Let D be the (𝑁 × 𝑁 ) pairwise distance matrix where 𝑑𝑖𝑗 is a distance between points 𝑖
and 𝑗.
• The MDS concept can be extended to a wide variety of data types specified in terms of a
similarity matrix.
Given the dissimilarity (distance) matrix D𝑁 ×𝑁 = [𝑑𝑖𝑗 ], MDS attempts to find 𝐾-dimensional
projections of the 𝑁 points x1 , . . . , x𝑁 ∈ R𝐾 , concatenated in an X𝑁 ×𝐾 matrix, so that 𝑑𝑖𝑗 ≈
‖x𝑖 − x𝑗 ‖ are as close as possible. This can be obtained by the minimization of a loss function
called the stress function
∑︁
stress(X) = (𝑑𝑖𝑗 − ‖x𝑖 − x𝑗 ‖)2 .
𝑖̸=𝑗

This loss function is known as least-squares or Kruskal-Shepard scaling.


A modification of least-squares scaling is the Sammon mapping
∑︁ (𝑑𝑖𝑗 − ‖x𝑖 − x𝑗 ‖)2
stressSammon (X) = .
𝑑𝑖𝑗
𝑖̸=𝑗

156 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3 beta

The Sammon mapping performs better at preserving small distances compared to the least-
squares scaling.

Classical multidimensional scaling

Also known as principal coordinates analysis, PCoA.


• The distance matrix, D, is transformed to a similarity matrix, B, often using centered
inner products.
• The loss function becomes
∑︁ (︀ )︀2
stressclassical (X) = 𝑏𝑖𝑗 − ⟨x𝑖 , x𝑗 ⟩ .
𝑖̸=𝑗

• The stress function in classical MDS is sometimes called strain.


• The solution for the classical MDS problems can be found from the eigenvectors of the
similarity matrix.
• If the distances in D are Euclidean and double centered inner products are used, the
results are equivalent to PCA.

Example

The eurodist datset provides the road distances (in kilometers) between 21 cities in Europe.
Given this matrix of pairwise (non-Euclidean) distances D = [𝑑𝑖𝑗 ], MDS can be used to recover
the coordinates of the cities in some Euclidean referential whose orientation is arbitrary.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Pairwise distance between European cities


try:
url = '../datasets/eurodist.csv'
df = pd.read_csv(url)
except:
url = 'https://raw.github.com/neurospin/pystatsml/master/datasets/eurodist.csv'
df = pd.read_csv(url)

print(df.iloc[:5, :5])

city = df["city"]
D = np.array(df.iloc[:, 1:]) # Distance matrix

# Arbitrary choice of K=2 components


from sklearn.manifold import MDS
mds = MDS(dissimilarity='precomputed', n_components=2, random_state=40, max_iter=3000,␣
˓→eps=1e-9)
X = mds.fit_transform(D)

city Athens Barcelona Brussels Calais


0 Athens 0 3313 2963 3175
1 Barcelona 3313 0 1318 1326
(continues on next page)

5.1. Dimension reduction and feature extraction 157


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)


2 Brussels 2963 1318 0 204
3 Calais 3175 1326 204 0
4 Cherbourg 3339 1294 583 460

Recover coordinates of the cities in Euclidean referential whose orientation is arbitrary:

from sklearn import metrics


Deuclidean = metrics.pairwise.pairwise_distances(X, metric='euclidean')
print(np.round(Deuclidean[:5, :5]))

[[ 0. 3116. 2994. 3181. 3428.]


[3116. 0. 1317. 1289. 1128.]
[2994. 1317. 0. 198. 538.]
[3181. 1289. 198. 0. 358.]
[3428. 1128. 538. 358. 0.]]

Plot the results:

# Plot: apply some rotation and flip


theta = 80 * np.pi / 180.
rot = np.array([[np.cos(theta), -np.sin(theta)],
[np.sin(theta), np.cos(theta)]])
Xr = np.dot(X, rot)
# flip x
Xr[:, 0] *= -1
plt.scatter(Xr[:, 0], Xr[:, 1])

for i in range(len(city)):
plt.text(Xr[i, 0], Xr[i, 1], city[i])
plt.axis('equal')

(-1894.1017744377398,
2914.3652937179477,
-1712.9885463201906,
2145.4522453884565)

158 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3 beta

Determining the number of components

We must choose 𝐾 * ∈ {1, . . . , 𝐾} the number of required components. Plotting the values of
the stress function, obtained using 𝑘 ≤ 𝑁 − 1 components. In general, start with 1, . . . 𝐾 ≤ 4.
Choose 𝐾 * where you can clearly distinguish an elbow in the stress curve.
Thus, in the plot below, we choose to retain information accounted for by the first two compo-
nents, since this is where the elbow is in the stress curve.

k_range = range(1, min(5, D.shape[0]-1))


stress = [MDS(dissimilarity='precomputed', n_components=k,
random_state=42, max_iter=300, eps=1e-9).fit(D).stress_ for k in k_range]

print(stress)
plt.plot(k_range, stress)
plt.xlabel("k")
plt.ylabel("stress")

[48644495.28571428, 3356497.365752386, 2858455.495887962, 2756310.637628011]

Text(0, 0.5, 'stress')

5.1. Dimension reduction and feature extraction 159


Statistics and Machine Learning in Python, Release 0.3 beta

5.1.5 Nonlinear dimensionality reduction

Sources:
• Scikit-learn documentation
• Wikipedia
Nonlinear dimensionality reduction or manifold learning cover unsupervised methods that
attempt to identify low-dimensional manifolds within the original 𝑃 -dimensional space that
represent high data density. Then those methods provide a mapping from the high-dimensional
space to the low-dimensional embedding.

Isomap

Isomap is a nonlinear dimensionality reduction method that combines a procedure to compute


the distance matrix with MDS. The distances calculation is based on geodesic distances evalu-
ated on neighborhood graph:
1. Determine the neighbors of each point. All points in some fixed radius or K nearest neigh-
bors.
2. Construct a neighborhood graph. Each point is connected to other if it is a K nearest
neighbor. Edge length equal to Euclidean distance.
3. Compute shortest path between pairwise of points 𝑑𝑖𝑗 to build the distance matrix D.
4. Apply MDS on D.

import matplotlib.pyplot as plt


from mpl_toolkits.mplot3d import Axes3D
from sklearn import manifold, datasets
(continues on next page)

160 Chapter 5. Machine Learning


Statistics and Machine Learning in Python, Release 0.3 beta

(continued from previous page)

X, color = datasets.samples_generator.make_s_curve(1000, random_state=42)

fig = plt.figure(figsize=(10, 5))


plt.suptitle("Isomap Manifold Learning", fontsize=14)

ax = fig.add_subplot(121, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=color, cmap=plt.cm.Spectral)
ax.view_init(4, -72)
plt.title('2D "S shape" manifold in 3D')

Y = manifold.Isomap(n_neighbors=10, n_components=2).fit_transform(X)
ax = fig.add_subplot(122)
plt.scatter(Y[:, 0], Y[:, 1], c=color, cmap=plt.cm.Spectral)
plt.title("Isomap")
plt.xlabel("First component")
plt.ylabel("Second component")
plt.axis('tight')

(-5.276311544714793, 5.4164373180970316, -1.23497771017066, 1.2910940054965336)

5.1.6 Exercises

PCA

Write a basic PCA class

Write a class BasicPCA with two methods:

5.1. Dimension reduction and feature extraction 161


Statistics and Machine Learning in Python, Release 0.3 beta

• fit(X) that estimates the data mean, principal components directions V and the explained
variance of each component.
• transform(X) that projects the data onto the principal components.
Check that your BasicPCA gave similar results, compared to the results from sklearn.

Apply your Basic PCA on the iris dataset

The data set is available at: https://raw.github.com/neurospin/pystatsml/master/datasets/iris.


csv
• Describe the data set. Should the dataset been standardized?
• Describe the structure of correlations among variables.
• Compute a PCA with the maximum number of components.
• Compute the cumulative explained variance ratio. Determine the number of components
𝐾 by your computed values.
• Print the 𝐾 principal components directions and correlations of the 𝐾 principal compo-
nents with the original variables. Interpret the contribution of the original variables into
the PC.
• Plot the samples projected into the 𝐾 first PCs.
• Color samples by their species.

MDS

Apply MDS from sklearn on the iris dataset available at:


https://raw.github.com/neurospin/pystatsml/master/datasets/iris.csv
• Center and scale the dataset.
• Compute Euclidean pairwise distances matrix.
• Select the number of components.
• Show that classical MDS on Euclidean pairwise distances matrix is equivalent to PCA.

5.2 Clustering

Wikipedia: Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense or another)
to each other than to those in other groups (clusters). Clustering is one of the main task of
exploratory data mining, and a common technique for statistical data analysis, used in many
fields, including machine learning, pattern recognition, image analysis, information retrieval,
and bioinformatics.
Sources: http://scikit-learn.org/stable/modules/clustering.html

162 Chapter 5. Machine Learning

You might also like