Course_ Introduction to Data Science (SD211105)
Course_ Introduction to Data Science (SD211105)
Session: 2
Welcome to this tutorial designed to introduce you to the world of data science. Whether you are
a complete beginner or have no prior knowledge of the subject, this guide will help you get
started. Throughout the tutorial, you'll learn key concepts, techniques, and tools used in data
science. By the end of each section, you will have gained new abilities to work with and analyze
data.
To begin, here’s a famous quote from Josh Wills that perfectly encapsulates the role of a data
scientist:
"A data scientist is a person who is better at statistics than any software engineer and
better at software engineering than any statistician."
This definition emphasizes the interdisciplinary nature of data science, requiring a balance of
statistical expertise and programming skills.
In this section, we will walk through the process of importing data and writing functions that help
us get an initial understanding of it. Data comes in various formats, such as CSV, Excel, or
databases, and it's crucial to understand how to load it into your analysis environment.
Steps:
a. Importing Data
To start working with data, we first need to import it into our environment. In Python, we often
use libraries like pandas to handle data efficiently. Let’s import a dataset and take a quick look
at it.
Python code:
# Importing necessary library
import pandas as pd
# Reading a CSV file into a DataFrame
data = pd.read_csv('your_dataset.csv')
After loading the data, it's important to understand its structure—columns, rows, and data types.
Let’s write a function to summarize this:
Python code:
# Function to get a quick summary of the dataset
def data_summary(df):
print(f"Data Shape: {df.shape}") # Number of rows and columns
print("\nColumn Info:")
print(df.info()) # Data types and non-null counts
print("\nMissing Values:\n", df.isnull().sum()) # Count of
missing values per column
Python code:
# Function to check basic statistics of numeric columns
def check_statistics(df):
print("\nBasic Statistics:\n")
print(df.describe()) # Descriptive statistics for numeric data
These foundational steps will help you gain insights and prepare the data for deeper analysis in
future sections!
Matplotlib is a widely-used library in Python for creating static, animated, and interactive
visualizations. It plays a crucial role in data analysis by enabling you to visualize patterns,
trends, and relationships within data. In this section, we'll introduce two common plot types:
scatter plots and histograms.
a. Scatter Plot
A scatter plot is useful when you want to observe the relationship or correlation between two
numerical variables. For example, if we want to see how one variable affects another, we plot
them as points on a graph.
Python code:
# Importing necessary libraries
import matplotlib.pyplot as plt
# Example usage
scatter_plot(data['variable1'], data['variable2'], 'Variable 1',
'Variable 2', 'Scatter Plot of Variable 1 vs Variable 2')
Output:
b. Histogram
A histogram is helpful for understanding the distribution of numerical data, showing how
frequently data points fall within specified ranges.
Python code:
# Function to create a histogram
def histogram(data_column, bins, xlabel, ylabel, title):
plt.hist(data_column, bins=bins, edgecolor='black')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()
# Example usage
histogram(data['numeric_variable'], bins=50, xlabel='Value',
ylabel='Frequency', title='Distribution of Numeric Variable')
Pandas is a powerful data manipulation library in Python. It provides easy-to-use structures like
Series and DataFrame to manage and analyze data.
a. Comparison Operators
Comparison operators compare two values and return a boolean result (True or False). They
are useful for filtering data, making decisions, or applying conditions.
Python code:
# Example of comparison operators
result = data['column1'] < 5 # Check if values in column1 are less
than 5
print(result.head())
b. Boolean Operators
Boolean operators (and, or, not) evaluate logical expressions and are often used in
conjunction with comparison operators to filter data.
Python code:
# Example of boolean operators
result = (data['column1'] < 5) & (data['column2'] < 10) # Checking
multiple conditions
print(result.head())
c. Series
A Series is a one-dimensional data structure in Pandas. It can hold any data type and is
essentially a column in a DataFrame.
Python code:
# Creating a Pandas Series
series_example = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd',
'e'])
print(series_example)
d. DataFrame
Python code:
# Creating a DataFrame from a dictionary
df_example = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']
})
print(df_example)
● DataFrames can be created from various inputs, such as dictionaries, lists, or another
DataFrame.
4. Filtering Data
Filtering is a common task when working with data. It allows us to extract specific rows from a
DataFrame based on conditions we set.
You can filter rows by applying conditions to columns, using comparison and boolean operators.
Python code:
# Example: Filtering rows where 'Happiness Score' is greater than 7
filtered_data = data["Happiness Score"] > 7
print(filtered_data)
You can apply multiple conditions using logical operators like & and |.
Python code:
Filtered_1 = data[ (data["Happiness Score"]> 5) & (data["Freedom"]>
0.35) ]
Filtered_1.head()
# logical_and() function
Filtered_2 = data[np.logical_and(data["Health (Life Expectancy)"]>
0.94, data["Happiness Score"]>7 )]
Filtered_2.head()
These tools and techniques will form the backbone of your data analysis and visualization
journey in Python!
In Python, loops allow you to iterate over data structures like lists, arrays, and DataFrames. You
can use loops to perform repetitive tasks efficiently. Python provides two types of loops: for
loops and while loops.
a. for Loop
A for loop iterates over items of a sequence (like a list or DataFrame) and performs actions on
each item.
Python code:
numbers = [1, 2, 3, 4, 5]
print(num)
You can also loop through a DataFrame using the for loop to access each column or row.
Python code:
b. while Loop
A while loop keeps running as long as a specified condition remains True. It’s useful when
you don’t know the exact number of iterations beforehand.
Python code:
i = 0
while i < 5:
print(i)
i += 1
Loops can be combined with conditionals and data structures to perform complex operations
efficiently.
Indexing and slicing are essential techniques to access specific rows, columns, or subsets of
data in a DataFrame.
a. Indexing
country_column = data[‘Country’]
print(country_column)
You can also use the iloc and loc methods for more advanced indexing.
Python code:
print(subset_data)
print(subset_data)
b. Slicing
Slicing allows you to extract portions of the DataFrame based on rows or columns.
Python code:
first_five_rows = data[:5]
print(first_five_rows)
# Example: Slicing specific columns of the DataFrame
print(Country_and_Region)
By using indexing and slicing, you can effectively manage and manipulate different parts of your
DataFrame for analysis. These are powerful tools to extract meaningful subsets of your data
and streamline your workflow.
Summary
These concepts are critical for data manipulation and will help you navigate complex datasets
more easily in Python.