0% found this document useful (0 votes)
95 views

NumPy and Pandas Tutorial

Uploaded by

omvati343
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

NumPy and Pandas Tutorial

Uploaded by

omvati343
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

NumPy and Pandas for Data Analysis AI ML Training

NumPy Tutorial
Introduction

NumPy (Numerical Python) is a library for the Python programming language, adding support
for large, multi-dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays.

Installation

To install NumPy, use the following command:

pip install numpy

Basic Operations

Importing NumPy

import numpy as np

Creating Arrays

# Create a 1D array
array_1d = np.array([1, 2, 3, 4, 5])
print(array_1d)

# Create a 2D array
array_2d = np.array([[1, 2, 3], [4, 5, 6]])
print(array_2d)

# Create an array with zeros


zeros_array = np.zeros((3, 4))
print(zeros_array)

# Create an array with ones


ones_array = np.ones((2, 3))
print(ones_array)

# Create an identity matrix


identity_matrix = np.eye(3)
print(identity_matrix)

# Create an array with a range of values


range_array = np.arange(10, 20, 2)
print(range_array)

# Create an array with evenly spaced values


linspace_array = np.linspace(0, 1, 5)
print(linspace_array)

Array Operations

# Arithmetic operations
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 1 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

print(a + b) # Addition
print(a - b) # Subtraction
print(a * b) # Element-wise multiplication
print(a / b) # Element-wise division

# Matrix multiplication
matrix_a = np.array([[1, 2], [3, 4]])
matrix_b = np.array([[5, 6], [7, 8]])
print(np.dot(matrix_a, matrix_b))

# Broadcasting
array_broadcast = np.array([1, 2, 3])
print(array_broadcast + 1) # Adds 1 to each element

# Statistical operations
print(np.mean(a)) # Mean
print(np.median(a)) # Median
print(np.std(a)) # Standard deviation
print(np.sum(a)) # Sum
print(np.min(a)) # Minimum
print(np.max(a)) # Maximum

Indexing and Slicing

array = np.array([1, 2, 3, 4, 5, 6])

# Indexing
print(array[0]) # First element
print(array[-1]) # Last element

# Slicing
print(array[1:4]) # Elements from index 1 to 3
print(array[:3]) # First three elements
print(array[3:]) # Elements from index 3 to end
print(array[::2]) # Every second element

Reshaping Arrays

array = np.arange(1, 10)


reshaped_array = array.reshape((3, 3))
print(reshaped_array)

# Flattening arrays
flattened_array = reshaped_array.flatten()
print(flattened_array)

Pandas Tutorial
Introduction

Pandas is a library providing high-performance, easy-to-use data structures and data analysis
tools for the Python programming language.

Installation

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 2 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

To install Pandas, use the following command:

pip install pandas

Basic Operations

Importing Pandas

import pandas as pd

Creating DataFrames

# Create a DataFrame from a dictionary


data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
print(df)

# Create a DataFrame from a CSV file


df_from_csv = pd.read_csv('path_to_csv_file.csv')
print(df_from_csv)

Viewing Data

# Display the first few rows


print(df.head())

# Display the last few rows


print(df.tail())

# Display the data types of columns


print(df.dtypes)

# Display the shape of the DataFrame


print(df.shape)

# Display summary statistics


print(df.describe())

Selecting Data

# Select a single column


print(df['Name'])

# Select multiple columns


print(df[['Name', 'City']])

# Select rows by index


print(df.iloc[0]) # First row
print(df.iloc[0:2]) # First two rows

# Select rows by label


print(df.loc[0]) # First row
print(df.loc[0:2]) # First three rows (inclusive)

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 3 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

# Conditional selection
print(df[df['Age'] > 30])

Adding and Dropping Columns

# Add a new column


df['Country'] = ['USA', 'France', 'Germany', 'UK']
print(df)

# Drop a column
df = df.drop('Country', axis=1)
print(df)

Handling Missing Data

# Create a DataFrame with missing values


data_with_nan = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, None, 35, 32],
'City': ['New York', 'Paris', None, 'London']
}
df_nan = pd.DataFrame(data_with_nan)
print(df_nan)

# Drop rows with missing values


df_dropped_nan = df_nan.dropna()
print(df_dropped_nan)

# Fill missing values


df_filled_nan = df_nan.fillna({'Age': df_nan['Age'].mean(), 'City':
'Unknown'})
print(df_filled_nan)

Grouping and Aggregating Data

# Group by a column and calculate mean


print(df.groupby('City').mean())

# Group by multiple columns and calculate sum


print(df.groupby(['City', 'Name']).sum())

Merging DataFrames

# Create two DataFrames


df1 = pd.DataFrame({'Name': ['John', 'Anna'], 'Age': [28, 24]})
df2 = pd.DataFrame({'Name': ['Peter', 'Linda'], 'City': ['Berlin',
'London']})

# Concatenate DataFrames
df_concat = pd.concat([df1, df2], ignore_index=True)
print(df_concat)

# Merge DataFrames
df_merge = pd.merge(df1, df2, on='Name', how='inner')
print(df_merge)

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 4 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

Exporting Data

# Export DataFrame to CSV


df.to_csv('output.csv', index=False)

# Export DataFrame to Excel


df.to_excel('output.xlsx', index=False)

Advanced Pandas Tutorial


Handling Time Series Data

Pandas provides robust support for time series data. Here's how to work with it.

Creating Time Series Data

# Create a date range


date_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
print(date_range)

# Create a DataFrame with time series data


time_series_data = {
'Date': date_range,
'Value': np.random.randn(10)
}
df_time_series = pd.DataFrame(time_series_data)
df_time_series.set_index('Date', inplace=True)
print(df_time_series)

Resampling Time Series Data

# Resample to weekly frequency and calculate the mean


df_resampled = df_time_series.resample('W').mean()
print(df_resampled)

# Resample to monthly frequency and calculate the sum


df_resampled_monthly = df_time_series.resample('M').sum()
print(df_resampled_monthly)

Working with Categorical Data


# Create a DataFrame with categorical data
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'City': ['New York', 'Paris', 'Berlin', 'London'],
'Gender': ['Male', 'Female', 'Male', 'Female']
}
df_categorical = pd.DataFrame(data)

# Convert a column to categorical type


df_categorical['Gender'] = df_categorical['Gender'].astype('category')
print(df_categorical)

# Get the categories and codes


print(df_categorical['Gender'].cat.categories)

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 5 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

print(df_categorical['Gender'].cat.codes)

Pivot Tables
# Create a DataFrame
data = {
'Name': ['John', 'Anna', 'John', 'Anna', 'John', 'Anna'],
'Month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar'],
'Sales': [150, 200, 130, 210, 170, 220]
}
df_sales = pd.DataFrame(data)

# Create a pivot table


pivot_table = df_sales.pivot_table(values='Sales', index='Name',
columns='Month', aggfunc='sum')
print(pivot_table)

Handling Large Datasets


# Read a large CSV file in chunks
chunk_size = 1000
chunks = pd.read_csv('large_dataset.csv', chunksize=chunk_size)

# Process each chunk


for chunk in chunks:
# Perform operations on the chunk
print(chunk.shape)

Applying Functions

Using apply()

# Create a DataFrame
data = {
'A': [1, 2, 3],
'B': [4, 5, 6]
}
df = pd.DataFrame(data)

# Define a function
def add_one(x):
return x + 1

# Apply the function to each element


print(df.applymap(add_one))

# Apply the function to each column


print(df.apply(lambda x: x + 1))

# Apply the function to each row


print(df.apply(lambda x: x + 1, axis=1))

Joining DataFrames
# Create two DataFrames
df1 = pd.DataFrame({
'key': ['A', 'B', 'C', 'D'],

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 6 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

'value': [1, 2, 3, 4]
})
df2 = pd.DataFrame({
'key': ['B', 'D', 'E', 'F'],
'value': [5, 6, 7, 8]
})

# Inner join
inner_joined = pd.merge(df1, df2, on='key', how='inner')
print(inner_joined)

# Left join
left_joined = pd.merge(df1, df2, on='key', how='left')
print(left_joined)

# Right join
right_joined = pd.merge(df1, df2, on='key', how='right')
print(right_joined)

# Outer join
outer_joined = pd.merge(df1, df2, on='key', how='outer')
print(outer_joined)

Window Functions
# Create a DataFrame with time series data
data = {
'Date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
'Value': np.random.randn(10)
}
df = pd.DataFrame(data)
df.set_index('Date', inplace=True)

# Calculate rolling mean


rolling_mean = df['Value'].rolling(window=3).mean()
print(rolling_mean)

# Calculate expanding sum


expanding_sum = df['Value'].expanding().sum()
print(expanding_sum)

# Calculate exponentially weighted mean


ewm_mean = df['Value'].ewm(span=3).mean()
print(ewm_mean)

Handling JSON Data


# Create a JSON string
json_str = '''
[
{"Name": "John", "Age": 28, "City": "New York"},
{"Name": "Anna", "Age": 24, "City": "Paris"},
{"Name": "Peter", "Age": 35, "City": "Berlin"}
]
'''

# Read JSON string into DataFrame


df_json = pd.read_json(json_str)
print(df_json)

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 7 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

# Export DataFrame to JSON


df_json.to_json('output.json', orient='records', lines=True)

Advanced Indexing with MultiIndex


# Create a MultiIndex DataFrame
arrays = [
['A', 'A', 'B', 'B'],
['one', 'two', 'one', 'two']
]
index = pd.MultiIndex.from_arrays(arrays, names=('first', 'second'))
df_multi = pd.DataFrame({'value': [1, 2, 3, 4]}, index=index)
print(df_multi)

# Accessing data in MultiIndex DataFrame


print(df_multi.loc['A'])
print(df_multi.loc[('A', 'one')])

Combining DataFrames with concat and append


# Create DataFrames
df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})
df2 = pd.DataFrame({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
})

# Concatenate DataFrames
concatenated = pd.concat([df1, df2], ignore_index=True)
print(concatenated)

# Append DataFrames
appended = df1.append(df2, ignore_index=True)
print(appended)

Performance Tips
# Use vectorized operations instead of loops
data = pd.DataFrame({
'A': range(1000000),
'B': range(1000000)
})

# Inefficient way: Using loops


data['C'] = [x + y for x, y in zip(data['A'], data['B'])]

# Efficient way: Using vectorized operations


data['C'] = data['A'] + data['B']

LinkedIn: www.linkedin.com/in/nidhi-grover-raheja-904211138 8 |Pa ge

You might also like