PANDAS
Pandas is short for panel data.
It is a python library used for working with datasets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
Pandas allows us to analyze big data and make conclusions based on statistical
theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
:}
Data Science: is a branch of computer science where we study how to store,
use and analyze data for deriving information from it.
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values,
like empty or NULL values. This is called cleaning the data.
Once Pandas is installed, import it in your applications by adding
the import keyword:
import pandas
Example;
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = [Link](mydataset)
print(myvar)
What is a Series?
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
Example;
Create a simple Pandas Series from a list:
import pandas as pd
a = [1, 7, 2]
myvar = [Link](a)
print(myvar)
Create Labels
With the index argument, you can name your own labels.
Example
Create your own labels:
import pandas as pd
a = [1, 7, 2]
myvar = [Link](a, index = ["x", "y", "z"])
print(myvar)
When you have created labels, you can access an item by referring to the label.
Example
Return the value of "y":
print(myvar["y"])
Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.
Example
Create a simple Pandas Series from a dictionary:
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = [Link](calories)
print(myvar)
Note: The keys of the dictionary become the labels.
To select only some of the items in the dictionary, use the index argument and
specify only the items you want to include in the Series.
Example
Create a Series using only data from "day1" and "day2":
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = [Link](calories, index = ["day1", "day2"])
print(myvar)
DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
Series is like a column, a DataFrame is the whole table.
Example
Create a DataFrame from two Series:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = [Link](data)
print(myvar)
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array,
or a table with rows and columns.
Example;
Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
#load data into a DataFrame object:
df = [Link](data)
print(df)
As you can see from the result above, the DataFrame is like a table with rows and
columns.
Pandas use the loc attribute to return one or more specified row(s)
Example
Return row 0:
#refer to the row index:
print([Link][0])
Note: This example returns a Pandas Series.
Example
Return row 0 and 1:
#use a list of indexes:
print([Link][[0, 1]])
Note: When using [], the result is a Pandas DataFrame.
Named Indexes
With the index argument, you can name your own indexes.
Example
Add a list of names to give each row a name:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
df = [Link](data, index = ["day1", "day2", "day3"])
print(df)
Locate Named Indexes
Use the named index in the loc attribute to return the specified row(s).
Example
Return "day2":
#refer to the named index:
print([Link]["day2"])
Load Files Into a DataFrame
If your data sets are stored in a file, Pandas can load them into a DataFrame.
Example
Load a comma separated file (CSV file) into a DataFrame:
import pandas as pd
df = pd.read_csv('[Link]')
print(df)
Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).
CSV files contains plain text and is a well know format that can be read by
everyone including Pandas.
In our examples we will be using a CSV file called '[Link]'.
Download [Link]. or Open [Link]
Load the CSV into a DataFrame:
import pandas as pd
df = pd.read_csv('[Link]')
print(df.to_string())
Tip: use to_string() to print the entire DataFrame.
If you have a large DataFrame with many rows, Pandas will only return the first 5
rows, and the last 5 rows:
Example
Print the DataFrame without the to_string() method:
import pandas as pd
df = pd.read_csv('[Link]')
print(df)
The number of rows returned is defined in Pandas option settings.
You can check your system's maximum rows with the
[Link].max_rows statement.
Example
Check the number of maximum returned rows:
import pandas as pd
print([Link].max_rows)
n my system the number is 60, which means that if the DataFrame contains more
than 60 rows, the print(df) statement will return only the headers and the first and
last 5 rows.
You can change the maximum rows number with the same statement.
Read JSON
Big data sets are often stored, or extracted as JSON.
JSON is plain text, but has the format of an object, and is well known in the world
of programming, including Pandas.
In our examples we will be using a JSON file called '[Link]'.
Open [Link].
Example;
Load the JSON file into a DataFrame:
import pandas as pd
df = pd.read_json('[Link]')
print(df.to_string())
Tip: use to_string() to print the entire DataFrame
JSON = Python Dictionary
JSON objects have the same format as Python dictionaries.
If your JSON code is not in a file, but in a Python Dictionary, you can load it into a
DataFrame directly
Example
Load a Python Dictionary into a DataFrame:
import pandas as pd
data = {
"Duration":{
"0":60,
"1":60,
"2":60,
"3":45,
"4":45,
"5":60
},
"Pulse":{
"0":110,
"1":117,
"2":103,
"3":109,
"4":117,
"5":102
},
"Maxpulse":{
"0":130,
"1":145,
"2":135,
"3":175,
"4":148,
"5":127
},
"Calories":{
"0":409,
"1":479,
"2":340,
"3":282,
"4":406,
"5":300
}
}
df = [Link](data)
print(df)
Pandas - Analyzing DataFrames
Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the
head() method.
The head() method returns the headers and a specified number of rows, starting
from the top
ExampleGet your own Python Server
Get a quick overview by printing the first 10 rows of the DataFrame:
import pandas as pd
df = pd.read_csv('[Link]')
print([Link](10))
Note: if the number of rows is not specified, the head() method will return the top 5
rows.
Example
Print the first 5 rows of the DataFrame:
import pandas as pd
df = pd.read_csv('[Link]')
print([Link]())
There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting from
the bottom.
Example
Print the last 5 rows of the DataFrame:
print([Link]())
Info About the Data
The DataFrames object has a method called info(), that gives you more information
about the data set.
Example
Print information about the data:
print([Link]())
Null Values
The info() method also tells us how many Non-Null values there are present in
each column, and in our data set it seems like there are 164 of 169 Non-Null values
in the "Calories" column.
Which means that there are 5 rows with no value at all, in the "Calories" column,
for whatever reason.
Empty values, or Null values, can be bad when analyzing data, and you should
consider removing rows with empty values. This is a step towards what is called
cleaning data, and you will learn more about that in the next chapters.