9
9
TITLE
P:F:-LTL-UG / 03 / R1
Assignment No. 9
Aim:
Summary statistics, data visualization, boxplot for the features on the ‘titanic’
dataset or any other dataset.
Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for
distribution of age with respect to each gender along with the information about
whether they survived or not. (Column names: 'sex' and 'age')
o Write observations on the inference from the above statistics.
Prerequisites
Learning Objectives
Learning Outcome:
o Students will be able to compute statistics on the features of the dataset, use
histograms and boxplot on the features of the dataset.
Theory:
Data analysis is a process of inspecting, cleansing, transforming, and
modelling data with the goal of discovering useful information, informing
conclusions, and supporting decision-making. Data analysis has multiple facets and
approaches, encompassing diverse techniques under a variety of names, while being
used in different business, science, and social science domains.
A data set (or dataset) is a collection of data. Most commonly a data set corresponds
to the contents of a single database table, or a single statistical data matrix, where
every column of the table represents a particular variable, and each row corresponds
to a given member of the data set in question.
Boxplot is a very interesting plot that basically plots a 5 number summary. to get 5
number summary some terms we need to describe.
Percentile – Gives any number which is number of values present before this
percentile like for example 50 under 25th percentile so it explains total of 50 values
are there below 25th percentile
Minimum and Maximum – These are not minimum and maximum values, rather
they describe the lower and upper boundary of standard deviation which is
calculated using Interquartile range(IQR).
Titanic dataset:
It is one of the most popular datasets used for understanding machine learning
basics. It contains information of all the passengers aboard the RMS Titanic, which
unfortunately was shipwrecked. This dataset can be used to predict whether a given
passenger survived or not. The csv file can be downloaded from Kaggle.
Description:
Following is a box plot for distribution of age with respect to each gender along
with the information about whether they survived or not.
import numpy as np
import pandas pd
import matplotlib.pyplot as plt
import seaborn as sns
from seaborn import load_dataset
#titanic dataset
data = pd.read_csv("titanic_train.csv")