Ex1_Plotting and Visualization using Numpy and Pandas
Ex1_Plotting and Visualization using Numpy and Pandas
Introduction
Data visualization is the most important step in the life cycle of data science, data analytics,
or we can say in data engineering. It is more impressive, interesting and understanding when
we represent our study or analysis with the help of colors and graphics. Using visualization
elements like graphs, charts, maps, etc., it becomes easier for clients to understand the
underlying structure, trends, patterns and relationships among variables within the dataset.
Simply explaining the data summary and analysis using plain numbers becomes complicated
for both people coming from technical and non-technical backgrounds. Data visualization
gives us a clear idea of what the data wants to convey to us. It makes data neutral for us to
understand the data insights.
Data visualization involves operating a huge amount of data and converts it into meaningful
and knowledgeable visuals using various tools. For visualizing data we need the best
software tools to handle various types of data in structured or unstructured format from
different sources such as files, web API, databases, and many more. We must choose the
best visualization tool that fulfills all our requirements. The tool should support interactive
plots generation, connectivity to data sources, combining data sources, automatically
refresh the data, secure access to data sources, and exporting widgets. All these features
allow us to make the best visuals of our data and also save time.
Data Visualization with Pandas:
Pandas library in python is mainly used for data analysis. It is not a data visualization library
but, we can create basic plots using Pandas. Pandas is highly useful and practical if we want
to create exploratory data analysis plots. We do not need to import other data visualization
libraries in addition to Pandas for such tasks.
As Pandas is Python’s popular data analysis library, it provides several different functions to
visualize our data with the help of the .plot() function. The one more advantage of using
Pandas for visualization is we can serialize or create a pipeline of data analysis functions and
plotting functions. It simplifies the task.
Pandas is an essential data analysis toolkit for Python. It is a Python package providing fast,
flexible, and expressive data structures designed to make working with relational or labeled
data. It aims to be the fundamental high-level building block for doing practical, real-world
data analysis in python.
The Pandas plot() Method
Pandas comes with a couple of plotting functionalities applicable on DataFrame- or series
objects that use the Matplotlib library under the hood, which means any plot created by the
Pandas library is a Matplotlib object.
Technically, the Pandas plot() method provides a set of plot styles through the kind keyword
argument to create decent-looking plots. The default value of the kind argument is the line
string value. However, there are eleven different string values that can be assigned to the
kind argument, which determines what kind of plot we'll create.
The .plot is also an attribute of Pandas DataFrame and series objects, providing a small
subset of plots available with Matplotlib. In fact, Pandas makes plotting as simple as just
writing a single line of code by automating much of the data visualization procedure for us.
AIM:
To analyze, plot and visualize the given dataset using Numpy and Pandas data structure.
DESCRIPTION:
Data visualization is a powerful way to capture trends and share the insights gained
from data. It is one of the important steps of data analysis. There are plenty of data
visualization tools on the shelf with a lot of outstanding features. In this exercise, we're
going to learn plotting and visualization with the Pandas, Numpy and Matplotlib
packages. Numpy and Pandas are Python’s most important libraries used for data
preprocessing and data cleaning. You can also use the methods in Numpy and Pandas to
draw the plots. These methods allow one to visualize arrays, Series and DataFrames
more easily.
ALGORITHM:
3. Load the weekly closing price of the Facebook, Microsoft, and Apple stocks over the
last previous months as a CSV file and read it using the read_csv function.
4. Plot and visualize the given dataset “iris_data” with the following plots scatter, bar,
line, histogram, area, box, hexagonal bin, pie, density and scatter matrix plot
1. The plot method is an amazing method that helps one to draw plots more easily.
Import the necessary libraries.
2. Import the necessary libraries and the dataset required for visualization and then
display the content of the DataFrame on the output. The %matplotlib inline
magic command is also added to the code to ensure the plotted figures appear in
the notebook cells correctly:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset_url = ('https://raw.githubusercontent.com/m-
mehdi/pandas_tutorials/main/weekly_stocks.csv')
df = pd.read_csv(dataset_url, parse_dates=['Date'], index_col='Date')
pd.set_option('display.max.columns', None)
print(df.head())
Output:
MSFT FB AAPL
Date
2021-05-24 249.679993 328.730011 124.610001
2021-05-31 250.789993 330.350006 125.889999
2021-06-07 257.890015 331.260010 127.349998
2021-06-14 259.429993 329.660004 130.460007
2021-06-21 265.019989 341.369995 133.110001
Line Plot
3. The default plot is the line plot that plots the index on the x-axis and the other numeric
columns in the DataFrame on the y-axis. Plot a line plot and see how Microsoft performed over
the previous 12 months:
df.plot(y='MSFT', figsize=(9,6))
NOTE
The figsize argument takes two arguments, width and height in inches, and
allows us to change the size of the output figure. The default values of the width
and height are 6.4 and 4.8, respectively.
4. We can plot multiple lines from the data by providing a list of column names and
assigning it to the y-axis. For example, let's see how the three companies performed over the
previous year:
df.plot.line(y=['FB', 'AAPL', 'MSFT'], figsize=(10,6))
5. Use the other parameters provided by the plot() method to add more details to a plot,
like this:
df_3Months = df.resample(rule='M').mean()[-3:]
print(df_3Months)
MSFT FB AAPL
Date
2022-03-31 298.400002 212.692505 166.934998
2022-04-30 282.087494 204.272499 163.704994
2022-05-31 262.803335 198.643331 147.326665
7. Create a bar chart based on the aggregated data by assigning the bar string value to the
kind argument:
df_3Months.plot(kind='bar', figsize=(10,6), ylabel='Price')
8. Create horizontal bar charts by assigning the barh string value to the kind argument.
df_3Months.plot(kind='barh', figsize=(9,6))
9. One can also plot the data on the stacked vertical or horizontal bar charts, which
represent different groups on top of each other. The height of the resulting bar shows the
combined result of the groups. To create a stacked bar chart we need to assign True to the
stacked argument, like this:
df_3Months.plot(kind='bar', stacked=True, figsize=(9,6))
Histogram
10. A histogram is a type of bar chart that represents the distribution of numerical data
where the x-axis represents the bin ranges while the y-axis represents the data frequency
within a certain interval. The bins argument specifies the number of bin intervals, and the alpha
argument specifies the degree of transparency.
Box Plot
12. A box plot consists of three quartiles and two whiskers that summarize the data in a set
of indicators: minimum, first quartile, median, third quartile, and maximum values. A box plot
conveys useful information, such as the interquartile range (IQR), the median, and the outliers
of each data group.
df.plot(kind='box', figsize=(9,6))
13. Create horizontal box plots, like horizontal bar charts, by assigning False to the vert
argument as shown below:
Area Plot
14. An area plot is an extension of a line chart that fills the region between the line chart
and the x-axis with a color. If more than one area chart displays in the same plot, different
colors distinguish different area charts.
df.plot(kind='area', figsize=(9,6))
15. The Pandas plot() method creates a stacked area plot by default. It's a common task to
unstack the area chart by assigning False to the stacked argument:
Pie Plot
16. A pie plot is a great proportional representation of numerical data in a column. The
following example shows the average Apple stock price distribution over the previous three
months:
df_3Months.index=['March', 'April', 'May']
df_3Months.plot(kind='pie', y='AAPL', legend=False, autopct='%.f')
17. A legend will display on pie plots by default, so assign False to the legend keyword to
hide the legend. The new keyword argument in the code above is autopct, which shows the
percent value on the pie chart slices.
If we want to represent the data of all the columns in multiple pie charts as
subplots, assign True to the subplots argument as given below:
Scatter Plot
18. Scatter plots- plot data points on the x and y axes to show the correlation between two
variables. The below scatter plot shows the relationship between Microsoft and Apple stock
prices.
df.plot(kind='scatter', x='MSFT', y='AAPL', figsize=(9,6), color='Green')
Hexbin Plot
19. When the data is very dense, a hexagon bin plot, also known as a hexbin plot, can be an
alternative to a scatter plot. In other words, when the number of data points is enormous, and
each data point can't be plotted separately, it's better to use this kind of plot that represents
data in the form of a honeycomb. Also, the color of each hexbin defines the density of data
points in that range.
The gridsize argument specifies the number of hexagons in the x-direction. A
larger grid size means more and smaller bins. The default value of the gridsize
argument is 100.
df.plot(kind='hexbin', x='MSFT', y='AAPL', gridsize=10, figsize=(10,6))
KDE Plot
20. The plot Kernel Density Estimate, also known as KDE, visualizes the probability density
of a continuous and non-parametric data variable. This plot uses Gaussian kernels to estimate
the probability density function (PDF) internally.
df.plot(kind='kde')
21. Also specify the bandwidth that affects the plot smoothness in the KDE plot, like this:
df.plot(kind='kde', bw_method=0.1)
22. As shown in the plot below, selecting a small bandwidth leads to under-smoothing,
which means the density plot appears as a combination of individual peaks. On the contrary, a
huge bandwidth leads to over-smoothing, which means the density plot appears as a unimodal
distribution.
df.plot(kind='kde', bw_method=1)
Conclusion
Thus the Plotting and visualization using Numpy and Pandas data structure was
executed and the output was verified.