0% found this document useful (0 votes)
2 views10 pages

Course_ Introduction to Data Science (SD211105)

This document is a tutorial for an Introduction to Data Science course, covering essential concepts and techniques in data science using Python. It includes sections on importing data, data visualization with Matplotlib, data manipulation with Pandas, filtering data, looping through data structures, and indexing and slicing DataFrames. By the end of the tutorial, learners will have foundational skills to analyze and visualize data effectively.

Uploaded by

aldisusilo19
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
2 views10 pages

Course_ Introduction to Data Science (SD211105)

This document is a tutorial for an Introduction to Data Science course, covering essential concepts and techniques in data science using Python. It includes sections on importing data, data visualization with Matplotlib, data manipulation with Pandas, filtering data, looping through data structures, and indexing and slicing DataFrames. By the end of the tutorial, learners will have foundational skills to analyze and visualize data effectively.

Uploaded by

aldisusilo19
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 10

Course: Introduction to Data Science (SD211105)

Session: 2

Topic: Basic Data Science Tutorial

Download Dataset & Try Below Tutorial Using Python:


https://drive.google.com/file/d/1aO1XLlsV3rft4Z3xRZErTjPAhbi4-uye/view?usp=sharing

Welcome to this tutorial designed to introduce you to the world of data science. Whether you are
a complete beginner or have no prior knowledge of the subject, this guide will help you get
started. Throughout the tutorial, you'll learn key concepts, techniques, and tools used in data
science. By the end of each section, you will have gained new abilities to work with and analyze
data.

To begin, here’s a famous quote from Josh Wills that perfectly encapsulates the role of a data
scientist:

"A data scientist is a person who is better at statistics than any software engineer and
better at software engineering than any statistician."

This definition emphasizes the interdisciplinary nature of data science, requiring a balance of
statistical expertise and programming skills.

1. Import and First Look at the Data

In this section, we will walk through the process of importing data and writing functions that help
us get an initial understanding of it. Data comes in various formats, such as CSV, Excel, or
databases, and it's crucial to understand how to load it into your analysis environment.

Steps:

a. Importing Data

To start working with data, we first need to import it into our environment. In Python, we often
use libraries like pandas to handle data efficiently. Let’s import a dataset and take a quick look
at it.

Python code:
# Importing necessary library
import pandas as pd
# Reading a CSV file into a DataFrame
data = pd.read_csv('your_dataset.csv')

# Display the first few rows of the data


print(data.head())

Hint!: your_dataset.csv => 2015.csv

● pandas: A powerful library for data manipulation and analysis.


● pd.read_csv(): Reads a CSV file into a DataFrame, the most common data structure
for handling tabular data in Python.
● data.head(): Displays the first few rows of the dataset for a quick glance at the
contents.

b. Understanding Data Structure

After loading the data, it's important to understand its structure—columns, rows, and data types.
Let’s write a function to summarize this:

Python code:
# Function to get a quick summary of the dataset
def data_summary(df):
print(f"Data Shape: {df.shape}") # Number of rows and columns
print("\nColumn Info:")
print(df.info()) # Data types and non-null counts
print("\nMissing Values:\n", df.isnull().sum()) # Count of
missing values per column

# Call the function


data_summary(data)

● df.shape: Provides the dimensions of the dataset (rows, columns).


● df.info(): Displays information on the column names, data types, and the number of
non-null values.
● df.isnull().sum(): Identifies missing data in each column.

c. Exploring Data Distributions

To explore data distributions, we can use basic statistical measures:

Python code:
# Function to check basic statistics of numeric columns
def check_statistics(df):
print("\nBasic Statistics:\n")
print(df.describe()) # Descriptive statistics for numeric data

# Call the function


check_statistics(data)

● df.describe(): Returns summary statistics such as mean, median, and standard


deviation for numeric columns in the dataset.

By the end of this section, you’ve learned how to:

1. Import data using pandas.


2. Understand the structure and basic information about your data.
3. Check for missing values and basic statistics.

These foundational steps will help you gain insights and prepare the data for deeper analysis in
future sections!

2. Matplotlib: Visualization with Python

Matplotlib is a widely-used library in Python for creating static, animated, and interactive
visualizations. It plays a crucial role in data analysis by enabling you to visualize patterns,
trends, and relationships within data. In this section, we'll introduce two common plot types:
scatter plots and histograms.

a. Scatter Plot

A scatter plot is useful when you want to observe the relationship or correlation between two
numerical variables. For example, if we want to see how one variable affects another, we plot
them as points on a graph.

Python code:
# Importing necessary libraries
import matplotlib.pyplot as plt

# Function to create a scatter plot


def scatter_plot(x, y, xlabel, ylabel, title):
plt.scatter(x, y)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()

# Example usage
scatter_plot(data['variable1'], data['variable2'], 'Variable 1',
'Variable 2', 'Scatter Plot of Variable 1 vs Variable 2')

Hint!: data['variable1'] => data['Happiness Score'], data['variable2'] => data['Economy


(GDP per Capita)']

● plt.scatter(): Plots data points as a scatter plot.


● plt.xlabel(), plt.ylabel(), plt.title(): Add labels and title to the chart.

Output:

b. Histogram

A histogram is helpful for understanding the distribution of numerical data, showing how
frequently data points fall within specified ranges.

Python code:
# Function to create a histogram
def histogram(data_column, bins, xlabel, ylabel, title):
plt.hist(data_column, bins=bins, edgecolor='black')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.show()

# Example usage
histogram(data['numeric_variable'], bins=50, xlabel='Value',
ylabel='Frequency', title='Distribution of Numeric Variable')

Hint!: data['numeric_variable'] => data[“Generosity”]

● plt.hist(): Plots the histogram.


● bins: Defines the number of intervals to split the data into.
3. Pandas: Data Manipulation with Python

Pandas is a powerful data manipulation library in Python. It provides easy-to-use structures like
Series and DataFrame to manage and analyze data.

a. Comparison Operators

Comparison operators compare two values and return a boolean result (True or False). They
are useful for filtering data, making decisions, or applying conditions.

Python code:
# Example of comparison operators
result = data['column1'] < 5 # Check if values in column1 are less
than 5
print(result.head())

Hint!: data['column1'] => data[‘Happiness Rank’]

b. Boolean Operators

Boolean operators (and, or, not) evaluate logical expressions and are often used in
conjunction with comparison operators to filter data.

Python code:
# Example of boolean operators
result = (data['column1'] < 5) & (data['column2'] < 10) # Checking
multiple conditions
print(result.head())

Hint!: data['column1'] => data[‘Happiness Rank’], data['column2'] => data[‘Happiness


Score’]

● &: Logical "and"


● |: Logical "or"
● ~: Logical "not"

c. Series

A Series is a one-dimensional data structure in Pandas. It can hold any data type and is
essentially a column in a DataFrame.

Python code:
# Creating a Pandas Series
series_example = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd',
'e'])
print(series_example)

d. DataFrame

A DataFrame is the most commonly used structure in Pandas. It is a two-dimensional table,


similar to an Excel sheet, and can hold multiple types of data.

Python code:
# Creating a DataFrame from a dictionary
df_example = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']
})
print(df_example)

● DataFrames can be created from various inputs, such as dictionaries, lists, or another
DataFrame.

4. Filtering Data

Filtering is a common task when working with data. It allows us to extract specific rows from a
DataFrame based on conditions we set.

a. Filtering with Conditions

You can filter rows by applying conditions to columns, using comparison and boolean operators.

Python code:
# Example: Filtering rows where 'Happiness Score' is greater than 7
filtered_data = data["Happiness Score"] > 7
print(filtered_data)

b. Filtering Multiple Conditions

You can apply multiple conditions using logical operators like & and |.
Python code:
Filtered_1 = data[ (data["Happiness Score"]> 5) & (data["Freedom"]>
0.35) ]
Filtered_1.head()

# logical_and() function
Filtered_2 = data[np.logical_and(data["Health (Life Expectancy)"]>
0.94, data["Happiness Score"]>7 )]
Filtered_2.head()

By the end of this section, you've learned how to:

1. Create scatter plots and histograms using Matplotlib.


2. Manipulate data with Pandas, including comparison operators, boolean operations,
Series, and DataFrames.
3. Filter data based on conditions to extract specific rows for analysis.

These tools and techniques will form the backbone of your data analysis and visualization
journey in Python!

5. Looping through Data Structures: for and while

In Python, loops allow you to iterate over data structures like lists, arrays, and DataFrames. You
can use loops to perform repetitive tasks efficiently. Python provides two types of loops: for
loops and while loops.

a. for Loop

A for loop iterates over items of a sequence (like a list or DataFrame) and performs actions on
each item.

Python code:

# Example: Looping through a list of numbers

numbers = [1, 2, 3, 4, 5]

for num in numbers:

print(num)
You can also loop through a DataFrame using the for loop to access each column or row.

Python code:

# Example: Iterating through DataFrame columns

for column in data.columns:

print(f"Column name: {column}")

b. while Loop

A while loop keeps running as long as a specified condition remains True. It’s useful when
you don’t know the exact number of iterations beforehand.

Python code:

# Example: Using while loop to print numbers less than 5

i = 0

while i < 5:

print(i)

i += 1

Loops can be combined with conditionals and data structures to perform complex operations
efficiently.

6. Indexing and Slicing DataFrames

Indexing and slicing are essential techniques to access specific rows, columns, or subsets of
data in a DataFrame.

a. Indexing

Indexing is used to select rows or columns by their labels or position.


Python code:

# Selecting a column by its name

country_column = data[‘Country’]

print(country_column)

You can also use the iloc and loc methods for more advanced indexing.

● iloc: Selects rows and columns by integer positions (zero-based indexing).


● loc: Selects rows and columns by labels.

Python code:

# Example: Selecting specific rows and columns using iloc

subset_data = data.iloc[0:5, 1:3] # Selects rows 0-4 and columns 1-2

print(subset_data)

# Example: Selecting specific rows and columns using loc

subset_data = data.loc[0:4, [‘Happiness Rank’, ‘Country’]] # Selects


rows 0-4 and ‘Happiness Rank’, ‘Country’ columns

print(subset_data)

b. Slicing

Slicing allows you to extract portions of the DataFrame based on rows or columns.

Python code:

# Example: Slicing first 5 rows of the DataFrame

first_five_rows = data[:5]

print(first_five_rows)
# Example: Slicing specific columns of the DataFrame

Country_and_Region = data[['Country', 'Region']]

print(Country_and_Region)

By using indexing and slicing, you can effectively manage and manipulate different parts of your
DataFrame for analysis. These are powerful tools to extract meaningful subsets of your data
and streamline your workflow.

Summary

By now, you've learned how to:

1. Loop through data structures using for and while loops.


2. Access specific rows, columns, or subsets of data using indexing and slicing in
DataFrames.

These concepts are critical for data manipulation and will help you navigate complex datasets
more easily in Python.

You might also like