Unit3_4) Matplotlib and seaborn.ipynb - Colab
Unit3_4) Matplotlib and seaborn.ipynb - Colab
Summary:
First quartile (Q1): 25th percentile, the lower edge of the box.
Third quartile (Q3): 75th percentile, the upper edge of the box.
Interquartile Range (IQR): The height of the box represents the IQR (Q3 - Q1).
Whiskers: Extend to the smallest and largest values within 1.5 × IQR of the quartiles.
Outliers: Points outside 1.5 × IQR from Q1 and Q3 are plotted individually.
Grouped data: Comparing continuous data across categories (e.g., test scores by gender or region).
Box plots are not suitable for discrete or categorical data without an inherent order or quantitative value.
Histogram
Summary:
Distribution Representation: Visualizes the frequency distribution of a dataset.
Bins (Intervals): The x-axis is divided into intervals or "bins," representing ranges of data values.
Frequency: The y-axis represents the frequency or count of data points within each bin.
Bar Heights: The height of each bar indicates the number of data points in the corresponding bin.
Continuous Data: The bars touch each other, indicating the continuous nature of the data.
Customizable Bin Size: Bin width affects the level of detail; narrower bins show more detail, while wider bins summarize data.
Grouped or binned discrete data: Example: Exam scores (e.g., 0–10, 11–20, etc.).
Histograms are not suitable for categorical data (nominal or ordinal) as they are designed for numerical, continuous, or grouped discrete
data. For categorical data, a bar chart is more appropriate.
Column chart
Summary:
Bars: Individual bars represent categories. Height (vertical bar chart) or length (horizontal bar chart) represents the value associated with the
category.
Axis Labels:
Space Between Bars: Bars are separated by gaps to emphasize discrete categories.
Customization: Can be clustered (side-by-side for multiple variables) or stacked (segments within a bar).
Ordinal data: Example: Rankings or ratings with a meaningful order (e.g., good, better, best).
Aggregated Numerical Data: Example: Sales totals, survey counts, or averages grouped by category.
Bar charts are not suitable for continuous data unless it has been grouped into discrete intervals. For continuous data, a histogram or line
chart is more appropriate.
Pie Chart
Summary:
Labels: Categories and their corresponding values or percentages are often labeled directly on the chart or in a legend.
Limitations: Becomes less effective when there are too many slices or when values are very similar.
Percentages or Proportions: Example: Survey results (e.g., "Yes," "No," "Maybe" responses).
Limited Categories: Works best with fewer categories (usually less than 6–8).
Pie charts are not suitable for datasets with a large number of categories, overlapping categories, or comparisons across multiple groups.
A bar chart or stacked bar chart might be a better alternative for such cases.
Line Plot:
Summary:
Continuous Data Representation: Visualizes trends, changes, or patterns over time or across sequential data points.
X-Axis: Represents independent variables, often time or a sequence (e.g., days, months, years).
Y-Axis: Represents dependent variables, showing values corresponding to each point on the x-axis.
Customization: Can include markers, dashed lines, or smoothing for enhanced interpretation.
Time-series data: Example: Monthly sales, daily website traffic, or yearly population growth.
Sequential Data: Example: Progression of a measurement (e.g., growth rate, cumulative counts).
Line charts are not suitable for categorical data without a natural order or for visualizing proportions (use bar or pie charts for these).
They work best when showing data relationships over a continuum.
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
plt.plot(x, y, marker='o', color='green', label="Line")
plt.title("Line Chart")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend()
plt.show()
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
z = [3, 7, 4, 6, 12]
plt.plot(x, y, marker='o', color='green', label="Line1")
plt.plot(x, z, marker='v', color='red', label="Line2")
plt.title("Line Chart")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend()
plt.show()
Area Chart
Summary:
Continuous Data Representation: Similar to a line chart but with the area below the line filled with color or shading.
Trend Visualization: Highlights trends and changes over time for one or multiple datasets.
Cumulative Impact: Stacked area charts show cumulative totals across categories or groups.
Y-Axis: Represents dependent variables, showing values corresponding to each x-axis point.
Cumulative Data: Example: Total expenses vs. income, energy usage over a period.
Comparative Data: Example: Stacked area charts to show contributions of different sources to a total.
Area charts are not suitable for datasets with too many overlapping series or for precise value comparison between groups (use line or
bar charts instead). They work best when the emphasis is on trends and magnitude over time.
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
plt.fill_between(x, y, color='lightblue', alpha=0.7)
plt.plot(x, y, color='blue', label="Area")
plt.title("Area Chart")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend()
plt.show()
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
z = [3, 7, 4, 6, 12]
plt.title("Area Chart")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend()
plt.show()
Scatter Plot
Summary:
Bivariate Data Representation: Plots two variables against each other to observe their relationship.
Points: Each point represents a single data observation, with coordinates corresponding to values of the two variables.
X-Axis and Y-Axis: The x-axis represents the independent variable, and the y-axis represents the dependent variable.
Outliers: Visually reveals data points that deviate significantly from the overall pattern.
Customization: Points can vary in size (bubble chart), color, or shape to represent additional dimensions of data.
Types of Data Suitable for a Scatter Plot:
Numerical data: Example: Exam scores vs. study hours, height vs. weight.
Bivariate data: Example: Age vs. income, temperature vs. electricity consumption.
Data for Correlation Analysis: Example: Relationship between advertising spend and sales.
Multi-Dimensional Data: With color or size encoding, it can show three or more variables (e.g., GDP, population, and region).
Scatter plots are not suitable for categorical data without numeric relationships or for datasets requiring summaries of proportions or
distributions. They excel in exploring relationships, trends, and variability in bivariate data.
x = [1, 2, 3, 4, 5]
y = [2, 4, 1, 3, 5]
plt.scatter(x, y, color='red', label='Points')
plt.title("Scatter Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend()
plt.show()
Heatmap:
Summary:
Matrix Representation: Displays data in a matrix format where each cell's color represents a value or intensity.
Color Gradient: The color scale (often from low to high) represents the magnitude of values, providing a visual way to understand data
distribution.
X-Axis and Y-Axis: Used to represent categorical or ordinal variables (e.g., time periods, regions, or features) that define the rows and columns
of the matrix.
Correlation Visualization: Often used to show relationships, such as correlation matrices or similarity between variables.
Dense Information: Allows easy identification of patterns, clusters, or outliers in large datasets.
Customizable Colors: Color schemes can be adjusted to suit the context or to emphasize specific trends.
Numerical data: Example: Correlation between variables, performance metrics across time and categories.
Tabular Data: Example: Sales figures across months and regions, temperature variations across time and locations.
Matrix Data: Example: Gene expression data, user activity on a website (user-item interaction matrix).
Comparative Data: Example: Survey responses (e.g., rating scales) across multiple groups.
Categorical Data (with aggregated numerical values): Example: Counts or frequencies of categories over different time periods.
Heatmaps are not ideal for visualizing highly detailed individual values, as they focus on patterns and trends in large datasets. For precise
comparisons of individual data points, a scatter plot, bar chart, or table may be more appropriate.
data = np.random.rand(5, 5)
sns.heatmap(data, annot=True, cmap='coolwarm')
plt.title("Heatmap")
plt.show()