Data Science With Python_ From
Data Science With Python_ From
Acknowledgments
This book is the result of countless hours of research, coding, and
collaboration. I would like to express my deep gratitude to the
Python and data science communities, whose contributions have
made it possible to create this comprehensive guide. Their
dedication to open-source software and knowledge sharing is what
makes Python such a powerful tool for data science.
I would also like to thank my family, friends, and colleagues for their
unwavering support throughout the writing process. Their
encouragement and feedback have been invaluable in bringing this
book to life.
Happy learning!
László Bocsó (Microsoft Certified Trainer)
Table of Contents
Introduction
Chapter 1: The Data Science
Landscape
What is Data Science?
Data science is an interdisciplinary field that combines various
aspects of mathematics, statistics, computer science, and domain
expertise to extract meaningful insights and knowledge from data. It
involves the use of advanced analytical techniques, algorithms, and
scientific methods to process and analyze large volumes of
structured and unstructured data.
2. Healthcare:
Disease prediction and early diagnosis
Personalized treatment recommendations
Drug discovery and development
3. E-commerce:
Recommendation systems
Pricing optimization
Supply chain management
Sentiment analysis
Content recommendation
Network analysis and user behavior prediction
Route optimization
Predictive maintenance
Demand forecasting
6. Scalability
8. Jupyter Notebooks
Python has become the primary language for machine learning and
artificial intelligence research and development, with most cutting-
edge algorithms and models being implemented first in Python.
2. Data Collection
5. Data Modeling
6. Model Evaluation
7. Model Deployment
1. Python Distribution
8. Database Connectors
PySpark: The Python API for Apache Spark, a fast and general
engine for large-scale data processing.
Dask: A flexible library for parallel computing in Python.
10. Time Series Analysis
Getting Started
Remember that you don't need to master all of these tools at once.
Begin with the core libraries (NumPy, Pandas, Matplotlib) and
gradually expand your toolkit as you take on more complex projects.
The key is to practice regularly and apply these tools to real-world
problems or datasets that interest you.
How to Use This Book
This book is designed to guide you through the process of becoming
proficient in data science using Python. To get the most out of this
resource, consider the following approach:
Before diving in, define what you want to achieve by the end of the
book. Are you looking to:
Having clear goals will help you focus on the most relevant sections
and tailor your learning experience.
Run the Code: Type out and execute all code examples in your
own Python environment.
Experiment: Modify the code examples to see how changes
affect the output.
Complete Exercises: Attempt all exercises at the end of each
chapter.
Work on Projects: Apply what you've learned to small projects
or datasets that interest you.
9. Stay Updated
if True:
print("This is indented")
if True:
print("This is further indented")
print("This is not indented")
1.3 Lists
1.4 Tuples
person = {
"name": "Alice",
"age": 30,
"city": "New York"
}
print(person["name"]) # Output: Alice
person["job"] = "Engineer"
1.6 Sets
unique_numbers = {1, 2, 3, 4, 5}
unique_numbers.add(6)
print(unique_numbers) # Output: {1, 2, 3, 4, 5, 6}
For loops in Python can iterate over sequences (like lists, tuples,
dictionaries, sets, or strings).
for i in range(5):
print(i)
2.4 While Loops
count = 0
while count < 5:
print(count)
count += 1
x = 10
if x > 0:
print("Positive")
elif x < 0:
print("Negative")
else:
print("Zero")
3.1 NumPy
import numpy as np
# Create a 1D array
arr1 = np.array([1, 2, 3, 4, 5])
# Create a 2D array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
# Array operations
print(arr1 * 2) # Element-wise multiplication
print(np.sum(arr2)) # Sum of all elements
print(np.mean(arr2, axis=0)) # Mean of each column
3.2 Pandas
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 34, 29, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
})
# Basic operations
print(df.head()) # Display first few rows
print(df['Age'].mean()) # Calculate mean age
print(df[df['Age'] > 30]) # Filter data
3.3 Matplotlib
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot')
plt.show()
# Scatter plot
plt.scatter(x, y)
plt.show()
# Bar chart
plt.bar(x, y)
plt.show()
jupyter notebook
3. This will open a new tab in your web browser where you can
create and manage notebooks.
Basic Usage:
While Jupyter Notebooks are great for exploratory data analysis and
sharing results, IDEs provide a more comprehensive environment for
larger projects and software development.
1. PyCharm:
3. Spyder:
Open VS Code
Go to the Extensions view (Ctrl+Shift+X)
Search for "Python"
Install the Python extension by Microsoft
On Windows: myenv\Scripts\activate
On macOS and Linux: source myenv/bin/activate
Conclusion
This chapter has provided a refresher on Python basics, including
syntax, data structures, functions, loops, and conditionals. We've
also introduced key libraries for data science: NumPy for numerical
computing, Pandas for data manipulation, and Matplotlib for data
visualization.
CSV is one of the most common formats for storing tabular data.
Python provides built-in support for working with CSV files through
the csv module, but Pandas offers a more convenient and powerful
interface.
import pandas as pd
Excel
Excel files are widely used in business and data analysis. Pandas
provides functions to read and write Excel files, but you'll need to
install additional dependencies like openpyxl or xlrd .
import pandas as pd
JSON is a popular data format for web applications and APIs. Pandas
can read and write JSON data, and Python's built-in json module
provides lower-level JSON handling.
import pandas as pd
import pandas as pd
from sqlalchemy import create_engine
# Create a database connection
engine = create_engine('sqlite:///database.db')
import pandas as pd
Series operations:
# Accessing elements
print(s['a'])
print(s[0])
# Slicing
print(s[1:3])
# Operations
print(s + 2)
print(s[s > 3])
# Apply functions
print(s.apply(lambda x: x * 2))
DataFrame
import pandas as pd
DataFrame operations:
# Accessing columns
print(df['name'])
print(df.age)
# Selecting rows
print(df.loc[0])
print(df.iloc[1:3])
# Filtering
print(df[df['age'] > 30])
# Sorting
print(df.sort_values('age', ascending=False))
Basic statistics
import pandas as pd
import numpy as np
# Specific statistics
print(df.mean())
print(df.median())
print(df.std())
print(df.min())
print(df.max())
# Covariance
print(df.cov())
Data summaries
# Memory usage
print(df.memory_usage())
df = pd.DataFrame({
'category': ['A', 'B', 'A', 'B', 'A', 'B'],
'value1': [1, 2, 3, 4, 5, 6],
'value2': [10, 20, 30, 40, 50, 60]
})
# Multiple aggregations
print(df.groupby('category').agg({'value1': 'mean',
'value2': ['min', 'max']}))
print(df.groupby('category').agg({'value1': range_func,
'value2': 'sum'}))
Visualization
# Histogram
df['value1'].hist()
plt.title('Histogram of value1')
plt.show()
# Box plot
df.boxplot(column=['value1', 'value2'])
plt.title('Box plot of value1 and value2')
plt.show()
# Scatter plot
plt.scatter(df['value1'], df['value2'])
plt.xlabel('value1')
plt.ylabel('value2')
plt.title('Scatter plot of value1 vs value2')
plt.show()
Handling Missing Data and Basic Data
Cleaning
Real-world datasets often contain missing or inconsistent data.
Pandas provides various tools for identifying and handling these
issues.
import pandas as pd
import numpy as np
# Fill null values with the next valid value (backward fill)
df_filled = df.fillna(method='bfill')
# Interpolate missing values
df_interpolated = df.interpolate()
# Remove duplicates
df_unique = df.drop_duplicates()
# Rename columns
df_renamed = df.rename(columns={'A': 'col1', 'B': 'col2'})
# String cleaning
df['C'] = df['C'].str.strip() # Remove leading/trailing
whitespace
df['C'] = df['C'].str.lower() # Convert to lowercase
# Replace values
df['C'] = df['C'].replace({'a': 'apple', 'b': 'banana'})
df['C'] = df['C'].apply(clean_text)
# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['C'])
import pandas as pd
import numpy as np
# Cross-section using xs
print(df.xs('one', level='second'))
# Concatenate DataFrames
df3 = pd.concat([df1, df2], axis=0, ignore_index=True)
# Join DataFrames
df4 = df1.set_index('key').join(df2.set_index('key'),
how='outer')
# Melt DataFrame
melted = pd.melt(df, id_vars=['A', 'B'], value_vars=['D',
'E'])
# Rolling statistics
rolling_mean = df.rolling(window=3).mean()
# Shift data
shifted = df.shift(1)
df_custom = df.applymap(custom_function)
# Apply a function along an axis
df_sum = df.apply(np.sum, axis=0)
Conclusion
This chapter has covered the essentials of working with data in
Python, focusing on the Pandas library. We've explored how to load
and export data from various formats, introduced the fundamental
data structures in Pandas (Series and DataFrame), and discussed
methods for exploring and summarizing data. We've also covered
techniques for handling missing data and performing basic data
cleaning tasks.
1. Visual Methods:
1. Remove Outliers:
2. Transform Data:
3. Winsorization:
Replace extreme values with a specified percentile of the data.
Preserves the data point while reducing its impact.
3. Regression Imputation:
4. Multiple Imputation:
4.3.1 Normalization
Min-Max Normalization
Where:
Pros:
Cons:
Sensitive to outliers
Decimal Scaling
Pros:
Cons:
4.3.2 Standardization
Z-score Standardization
X_standardized = (X - μ) / σ
Where:
Cons:
1. Label Encoding
Pros:
Cons:
Example:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded = le.fit_transform(['red', 'green', 'blue', 'green',
'red'])
# Output: [2, 1, 0, 1, 2]
2. One-Hot Encoding
Pros:
Cons:
Example:
ohe = OneHotEncoder(sparse=False)
encoded = ohe.fit_transform(np.array(['red', 'green',
'blue', 'green', 'red']).reshape(-1, 1))
# Output:
# [[0. 0. 1.]
# [0. 1. 0.]
# [1. 0. 0.]
# [0. 1. 0.]
# [0. 0. 1.]]
3. Binary Encoding
Pros:
Cons:
Example:
be = BinaryEncoder(cols=['category'])
encoded = be.fit_transform(pd.DataFrame({'category': ['A',
'B', 'C', 'D']}))
4. Target Encoding
Target encoding replaces categories with the mean target value for
that category.
Pros:
Cons:
Example:
te = TargetEncoder(cols=['category'])
encoded = te.fit_transform(X['category'], y)
5. Frequency Encoding
Pros:
Cons:
May not be suitable when frequency doesn't correlate with the
target variable
Example:
freq_map = (df['category'].value_counts() /
len(df)).to_dict()
df['category_freq'] = df['category'].map(freq_map)
Conclusion
Data cleaning and preprocessing are fundamental steps in the data
science pipeline. Properly handling outliers, missing data, variable
scaling, and categorical encoding can significantly impact the
performance and reliability of your analyses and models.
Key takeaways:
Merging DataFrames
Types of Merges
1. Inner Merge: Returns only the rows that have matching values
in both DataFrames.
2. Left Merge: Returns all rows from the left DataFrame and the
matched rows from the right DataFrame.
3. Right Merge: Returns all rows from the right DataFrame and
the matched rows from the left DataFrame.
4. Outer Merge: Returns all rows when there is a match in either
DataFrame.
import pandas as pd
# Sample DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value':
[1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value':
[5, 6, 7, 8]})
# Inner merge
inner_merge = pd.merge(df1, df2, on='key', how='inner')
# Left merge
left_merge = pd.merge(df1, df2, on='key', how='left')
# Right merge
right_merge = pd.merge(df1, df2, on='key', how='right')
# Outer merge
outer_merge = pd.merge(df1, df2, on='key', how='outer')
Joining DataFrames
# Left join
left_join = df1.join(df2, how='left')
# Right join
right_join = df1.join(df2, how='right')
# Inner join
inner_join = df1.join(df2, how='inner')
# Outer join
outer_join = df1.join(df2, how='outer')
Concatenation
# Vertical concatenation
vertical_concat = pd.concat([df1, df2], axis=0)
# Horizontal concatenation
horizontal_concat = pd.concat([df1, df2], axis=1)
# Inner join
inner_concat = pd.concat([df1, df2], axis=0, join='inner')
import pandas as pd
import numpy as np
# Sample data
data = {
'Date': ['2023-01-01', '2023-01-01', '2023-01-02',
'2023-01-02'],
'Product': ['A', 'B', 'A', 'B'],
'Sales': [100, 150, 120, 180]
}
df = pd.DataFrame(data)
Cross-Tabulation
# Sample data
data = {
'Gender': ['M', 'F', 'M', 'F', 'M', 'F'],
'Product': ['A', 'B', 'A', 'A', 'B', 'B']
}
df = pd.DataFrame(data)
# Create a crosstab
crosstab = pd.crosstab(df['Gender'], df['Product'])
Advanced Cross-Tabulation
# Multiple variables
crosstab_multi = pd.crosstab([df['Gender'], df['Age']],
df['Product'])
# Normalization
crosstab_norm = pd.crosstab(df['Gender'], df['Product'],
normalize='index')
# Sample data
data = {
'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 20, 30, 40, 50, 60]
}
df = pd.DataFrame(data)
# Group by category
grouped = df.groupby('Category')
data = {
'Category': ['A', 'A', 'B', 'B', 'A', 'B'],
'Subcategory': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
'Value': [10, 20, 30, 40, 50, 60]
}
df = pd.DataFrame(data)
Aggregating Data
def custom_agg(x):
return x.max() - x.min()
group_custom = grouped['Value'].agg(custom_agg)
Transforming Data
Filtering Groups
import pandas as pd
Rolling Windows
Seasonal Decomposition
For time series with seasonal patterns, you can use seasonal
decomposition to separate the trend, seasonal, and residual
components:
trend = result.trend
seasonal = result.seasonal
residual = result.resid
# Make predictions
forecast = fitted_model.forecast(steps=12)
Conclusion
This chapter has covered essential techniques for merging,
reshaping, and analyzing data in pandas. We've explored methods
for combining DataFrames through merging, joining, and
concatenation, as well as reshaping data using pivot tables and
cross-tabulation. We've also delved into grouping and aggregating
data for analysis, which is crucial for extracting insights from large
datasets.
1. Mathematical Transformations
import numpy as np
df['log_income'] = np.log(df['income'] + 1)
2. Aggregation Features
# Lag features
df['sales_lag_1'] = df.groupby('store_id')['sales'].shift(1)
df['sales_lag_7'] = df.groupby('store_id')['sales'].shift(7)
# Rolling statistics
df['sales_rolling_mean_7'] = df.groupby('store_id')
['sales'].rolling(window=7).mean().reset_index(0, drop=True)
4. Categorical Encoding
# One-hot encoding
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(sparse=False)
X_onehot = onehot.fit_transform(X[['category']])
# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X['category_encoded'] = le.fit_transform(X['category'])
# Target encoding
target_means = X.groupby('category')['target'].mean()
X['category_target_encoded'] =
X['category'].map(target_means)
# Frequency encoding
category_counts = X['category'].value_counts(normalize=True)
X['category_freq_encoded'] =
X['category'].map(category_counts)
5. Text-based Features
# Bag of words
cv = CountVectorizer()
X_bow = cv.fit_transform(texts)
# TF-IDF
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(texts)
def text_to_vec(text):
words = text.lower().split()
return np.mean([word2vec_model[w] for w in words if w in
word2vec_model], axis=0)
X['text_embedding'] = X['text'].apply(text_to_vec)
# Text statistics
X['word_count'] = X['text'].apply(lambda x: len(x.split()))
X['char_count'] = X['text'].apply(len)
X['avg_word_length'] = X['char_count'] / X['word_count']
6. Domain-specific Features
rfm = df.groupby('customer_id').agg({
'order_date': lambda x: (current_date -
x.max()).days, # Recency
'order_id': 'count', # Frequency
'order_amount': 'sum' # Monetary
})
1. Filter methods
2. Wrapper methods
3. Embedded methods
1. Filter Methods
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
selected_features = correlation_feature_selection(X, y,
threshold=0.5)
Mutual Information
selected_features = mutual_info_feature_selection(X, y,
k=10)
Chi-squared Test
selected_features = chi_squared_feature_selection(X, y,
k=10)
2. Wrapper Methods
def recursive_feature_elimination(X, y,
n_features_to_select=10):
model = LogisticRegression()
rfe = RFE(estimator=model,
n_features_to_select=n_features_to_select)
rfe.fit(X, y)
selected_features = X.columns[rfe.support_].tolist()
return selected_features
selected_features = recursive_feature_elimination(X, y,
n_features_to_select=10)
for i in range(k):
best_score = 0
best_feature = None
features.append(best_feature)
remaining.remove(best_feature)
return features
3. Embedded Methods
def random_forest_feature_selection(X, y,
n_features_to_select=10):
rf = RandomForestClassifier(n_estimators=100,
random_state=42)
selector = SelectFromModel(rf,
max_features=n_features_to_select, prefit=False)
selector.fit(X, y)
selected_features =
X.columns[selector.get_support()].tolist()
return selected_features
selected_features = random_forest_feature_selection(X, y,
n_features_to_select=10)
df['is_holiday_season'] =
df['date'].apply(is_holiday_season)
df['days_to_christmas'] = (pd.to_datetime('2023-12-25') -
df['date']).dt.days
customer_aggregations.columns = ['avg_purchase',
'total_spend', 'purchase_count', 'first_purchase',
'last_purchase']
customer_aggregations['customer_lifetime'] =
(customer_aggregations['last_purchase'] -
customer_aggregations['first_purchase']).dt.days
customer_aggregations['purchase_frequency'] =
customer_aggregations['purchase_count'] /
customer_aggregations['customer_lifetime']
df = df.merge(customer_aggregations, left_on='customer_id',
right_index=True, how='left')
9. Incorporate Domain-specific Thresholds or
Benchmarks
df['credit_score_category'] =
df['credit_score'].apply(credit_score_category)
seasonal_patterns = df.groupby(['product_category',
'month'])['sales'].mean().unstack()
df = df.merge(seasonal_patterns, left_on=
['product_category', 'month'], right_index=True, suffixes=
('', '_seasonal_avg'))
Supports various plot types (line plots, scatter plots, bar plots,
etc.)
Allows for multiple plots in a single figure
Customizable axes, labels, titles, and other plot elements
Supports both object-oriented and pyplot interfaces
Can be used with various GUI toolkits (e.g., Qt, Gtk, Tk)
Seaborn
# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, edgecolor='black')
plt.title('Histogram of Normal Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Creating a histogram with Seaborn:
# Create histogram
plt.figure(figsize=(10, 6))
sns.histplot(data, kde=True)
plt.title('Histogram of Normal Distribution with KDE')
plt.xlabel('Value')
plt.ylabel('Count')
plt.show()
Boxplots
# Create boxplot
plt.figure(figsize=(10, 6))
plt.boxplot(data)
plt.title('Boxplot of Multiple Distributions')
plt.xlabel('Group')
plt.ylabel('Value')
plt.show()
# Create boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(data=data)
plt.title('Boxplot of Multiple Distributions')
plt.ylabel('Value')
plt.show()
Violin Plots
# Calculate KDE
kde = stats.gaussian_kde(data)
x = np.linspace(data.min(), data.max(), 100)
y = kde(x)
Scatter plots are useful for visualizing the relationship between two
continuous variables.
Pair plots
Pair plots create a grid of scatter plots for all pairs of variables in a
dataset, along with histograms or KDE plots on the diagonal.
Heatmaps
3D scatter plots
Contour plots
# Calculate KDE
xy = np.vstack([x, y])
kde = stats.gaussian_kde(xy)
import datashader as ds
import datashader.transfer_functions as tf
import numpy as np
import matplotlib.pyplot as plt
plt.tight_layout()
plt.show()
Interactive visualizations
import plotly.graph_objects as go
import numpy as np
To save plots for later use or inclusion in reports, you can use
Matplotlib's savefig function:
# Save plot
plt.savefig('histogram.png', dpi=300, bbox_inches='tight')
plt.close()
Conclusion
Visualizing data distributions is a crucial skill for any data scientist or
machine learning practitioner. It allows you to gain insights into the
underlying patterns and relationships in your data, which can inform
your analysis and modeling decisions. In this chapter, we've covered
a wide range of visualization techniques using Matplotlib and
Seaborn, from basic histograms and boxplots to more advanced
techniques for visualizing multivariate distributions.
Key takeaways:
import numpy as np
Joint Plots
# Create heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm',
vmin=-1, vmax=1, center=0)
plt.title('Correlation Heatmap')
plt.show()
import numpy as np
Bar Plots
x = np.arange(len(categories))
width = 0.35
ax.set_xlabel('Groups')
ax.set_ylabel('Scores')
ax.set_title('Grouped Bar Plot')
ax.set_xticks(x)
ax.set_xticklabels(categories)
ax.legend()
plt.show()
Count Plots
1. Bubble Plots
plt.figure(figsize=(10, 6))
plt.scatter(x, y, s=sizes, alpha=0.5)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Bubble Plot')
plt.show()
2. Parallel Coordinates
4. Radar Charts
Radar charts (also known as spider charts) are useful for comparing
multiple quantitative variables.
Conclusion
Understanding relationships in data is a crucial skill for any data
scientist or analyst. The techniques and visualizations discussed in
this chapter provide a solid foundation for exploring and interpreting
connections between variables in your datasets.
Remember that while these tools are powerful, they should be used
in conjunction with domain knowledge and critical thinking. Always
consider the context of your data and be cautious about drawing
causal conclusions from correlational evidence.
Line Charts
# Example data
dates = pd.date_range(start='2022-01-01', end='2022-12-31',
freq='D')
values = np.random.randn(len(dates)).cumsum()
plt.figure(figsize=(12, 6))
plt.plot(dates, values)
plt.title('Time Series Line Chart')
plt.xlabel('Date')
plt.ylabel('Value')
plt.grid(True)
plt.show()
Line charts are excellent for showing overall trends and patterns in
the data. They can reveal:
Area Charts
Area charts are similar to line charts but fill the area between the
line and the x-axis. They're useful for visualizing cumulative totals or
comparing multiple series.
plt.figure(figsize=(12, 6))
plt.fill_between(dates, values1, alpha=0.5, label='Series
1')
plt.fill_between(dates, values2, alpha=0.5, label='Series
2')
plt.title('Time Series Area Chart')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
Heatmaps
Heatmaps are useful for visualizing time series data with multiple
dimensions, such as hourly data over several days or weeks.
# Create a DataFrame
df = pd.DataFrame(data, index=dates, columns=hours)
# Plot heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df, cmap='YlOrRd')
plt.title('Time Series Heatmap')
plt.xlabel('Hour of Day')
plt.ylabel('Date')
plt.show()
Interactive Visualizations
import plotly.graph_objects as go
import pandas as pd
import numpy as np
dates = pd.date_range(start='2022-01-01', end='2022-12-31',
freq='D')
values = np.random.randn(len(dates)).cumsum()
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Create DataFrame
df = pd.DataFrame({'Date': dates, 'Value': values})
df.set_index('Date', inplace=True)
# Plot
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Value'], label='Original')
plt.plot(df.index, df['SMA7'], label='7-day SMA')
plt.plot(df.index, df['SMA30'], label='30-day SMA')
plt.title('Time Series with Simple Moving Averages')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Create DataFrame
df = pd.DataFrame({'Date': dates, 'Value': values})
df.set_index('Date', inplace=True)
# Calculate EMAs
df['EMA7'] = df['Value'].ewm(span=7, adjust=False).mean()
df['EMA30'] = df['Value'].ewm(span=30, adjust=False).mean()
# Plot
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Value'], label='Original')
plt.plot(df.index, df['EMA7'], label='7-day EMA')
plt.plot(df.index, df['EMA30'], label='30-day EMA')
plt.title('Time Series with Exponential Moving Averages')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
EMAs are often used in financial analysis as they react more quickly
to recent price changes.
Trend Lines
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
# Create DataFrame
df = pd.DataFrame({'Date': dates, 'Value': values})
df['Days'] = (df['Date'] - df['Date'].min()).dt.days
# Plot
plt.figure(figsize=(12, 6))
plt.scatter(df['Date'], df['Value'], alpha=0.5,
label='Original')
plt.plot(df['Date'], df['Trend'], color='red', label='Trend
Line')
plt.title('Time Series with Linear Trend Line')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
Additive Decomposition
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
# Create DataFrame
df = pd.DataFrame({'Date': dates, 'Value': values})
df.set_index('Date', inplace=True)
# Perform decomposition
result = seasonal_decompose(df['Value'], model='additive',
period=365)
# Plot
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(12,
16))
result.observed.plot(ax=ax1)
ax1.set_title('Observed')
result.trend.plot(ax=ax2)
ax2.set_title('Trend')
result.seasonal.plot(ax=ax3)
ax3.set_title('Seasonal')
result.resid.plot(ax=ax4)
ax4.set_title('Residual')
plt.tight_layout()
plt.show()
Multiplicative Decomposition
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
import numpy as np
# Create DataFrame
df = pd.DataFrame({'Date': dates, 'Value': values})
df.set_index('Date', inplace=True)
# Perform decomposition
result = seasonal_decompose(df['Value'],
model='multiplicative', period=365)
# Plot
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(12,
16))
result.observed.plot(ax=ax1)
ax1.set_title('Observed')
result.trend.plot(ax=ax2)
ax2.set_title('Trend')
result.seasonal.plot(ax=ax3)
ax3.set_title('Seasonal')
result.resid.plot(ax=ax4)
ax4.set_title('Residual')
plt.tight_layout()
plt.show()
Seasonal Adjustment
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
# Plot
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Value'], label='Original')
plt.plot(df.index, df['Seasonally_Adjusted'],
label='Seasonally Adjusted')
plt.title('Original vs Seasonally Adjusted Time Series')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Create DataFrame
df = pd.DataFrame({'Date': dates, 'Value': values})
df.set_index('Date', inplace=True)
# Identify anomalies
threshold = 3
df['Anomaly'] = df['Value'].apply(lambda x: abs(x - mean) >
threshold * std)
# Plot
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Value'], label='Original')
plt.scatter(df[df['Anomaly']].index, df[df['Anomaly']]
['Value'], color='red', label='Anomaly')
plt.title('Time Series with Anomalies')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
This method is simple and works well for normally distributed data,
but it may not be suitable for all types of time series, especially
those with strong trends or seasonality.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Create DataFrame
df = pd.DataFrame({'Date': dates, 'Value': values})
df.set_index('Date', inplace=True)
# Identify anomalies
threshold = 3 * df['Diff'].std()
df['Anomaly'] = df['Diff'].abs() > threshold
# Plot
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Value'], label='Original')
plt.plot(df.index, df['MA'], label='Moving Average')
plt.scatter(df[df['Anomaly']].index, df[df['Anomaly']]
['Value'], color='red', label='Anomaly')
plt.title('Time Series with Anomalies (Moving Average
Method)')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
This method can adapt to local trends in the data, making it more
robust than the simple statistical method for many types of time
series.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import IsolationForest
# Create DataFrame
df = pd.DataFrame({'Date': dates, 'Value': values})
df.set_index('Date', inplace=True)
# Plot
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Value'], label='Original')
plt.scatter(df[df['Anomaly'] == -1].index, df[df['Anomaly']
== -1]['Value'], color='red', label='Anomaly')
plt.title('Time Series with Anomalies (Isolation Forest
Method)')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
For time series with strong seasonal patterns, we can use seasonal
decomposition to identify anomalies. After decomposing the series,
we can look for anomalies in the residual component.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
# Create DataFrame
df = pd.DataFrame({'Date': dates, 'Value': values})
df.set_index('Date', inplace=True)
# Perform decomposition
result = seasonal_decompose(df['Value'], model='additive',
period=365)
# Plot
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))
result.observed.plot(ax=ax1)
ax1.scatter(df[df['Anomaly']].index, df[df['Anomaly']]
['Value'], color='red', label='Anomaly')
ax1.set_title('Original Time Series with Anomalies')
ax1.legend()
result.resid.plot(ax=ax2)
ax2.set_title('Residuals')
ax2.axhline(y=threshold, color='r', linestyle='--',
label='Threshold')
ax2.axhline(y=-threshold, color='r', linestyle='--')
ax2.legend()
plt.tight_layout()
plt.show()
1. Bernoulli Distribution
Applications:
Applications:
3. Poisson Distribution
Applications:
Applications:
2. Exponential Distribution
Applications:
3. Uniform Distribution
Applications:
plt.title("Normal Distribution")
plt.xlabel("Value")
plt.ylabel("Density")
plt.show()
Sampling Methods
Advantages:
Easy to implement
Unbiased
Disadvantages:
Example in Python:
import numpy as np
2. Stratified Sampling
Advantages:
Disadvantages:
Example in Python:
import pandas as pd
import numpy as np
3. Cluster Sampling
Advantages:
Disadvantages:
Example in Python:
import numpy as np
import pandas as pd
# Create a sample dataset with clusters
clusters = pd.DataFrame({
'Cluster_ID': range(100),
'Region': np.random.choice(['North', 'South', 'East',
'West'], 100)
})
individuals = pd.DataFrame({
'ID': range(10000),
'Cluster_ID': np.random.randint(0, 100, 10000),
'Age': np.random.randint(18, 80, 10000)
})
Key points:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
plt.figure(figsize=(10, 4))
plt.hist(sample_means, bins=50, density=True, alpha=0.7)
plt.title(f"Distribution of Sample Means (n = {size})")
plt.xlabel("Sample Mean")
plt.ylabel("Density")
plt.show()
1. One-sample t-test
Example in Python:
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")
# Interpret results
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis")
else:
print("Fail to reject the null hypothesis")
Null hypothesis (H₀): The means of the two groups are equal.
Example in Python:
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")
# Interpret results
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis")
else:
print("Fail to reject the null hypothesis")
Example in Python:
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")
# Interpret results
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis")
else:
print("Fail to reject the null hypothesis")
Chi-square Tests
Chi-square tests are used to analyze categorical data and test for
independence or goodness of fit.
Example in Python:
# Interpret results
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis")
else:
print("Fail to reject the null hypothesis")
Example in Python:
# Interpret results
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis")
else:
print("Fail to reject the null hypothesis")
Confidence Intervals
Key points:
1. The width of the interval indicates the precision of the estimate.
2. The confidence level (e.g., 95%) represents the long-run
frequency with which the interval will contain the true
parameter if the sampling process is repeated.
CI = x̄ ± (z * (σ / √n))
Where:
CI = x̄ ± (t * (s / √n))
Where:
Example in Python:
import numpy as np
from scipy import stats
P-values
Key points:
1. A small p-value (typically < 0.05) suggests strong evidence
against the null hypothesis.
2. P-values do not measure the probability that the null hypothesis
is true or false.
3. P-values should be interpreted in conjunction with effect sizes
and practical significance.
Interpreting P-values
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
# Interpret p-value
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis")
print("There is strong evidence of a significant
difference between the two groups")
else:
print("Fail to reject the null hypothesis")
print("There is not enough evidence to conclude a
significant difference between the two groups")
Confidence level = 1 - α
import numpy as np
from scipy import stats
confidence_level = 0.95
degrees_of_freedom = sample_size - 1
t_value = stats.t.ppf((1 + confidence_level) / 2,
degrees_of_freedom)
Y = β₀ + β₁X + ε
Where:
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Make predictions
Y_pred = model.predict(X)
# Calculate R-squared
r2 = r2_score(Y, Y_pred)
# Calculate MSE
mse = mean_squared_error(Y, Y_pred)
# Calculate RMSE
rmse = np.sqrt(mse)
print(f"R-squared: {r2}")
print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
Where:
print(f"R-squared: {r2}")
print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
# Print the coefficients
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")
plt.scatter(Y_pred, residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.axhline(y=0, color='r', linestyle='--')
plt.show()
Q-Q Plot
fig, ax = plt.subplots()
stats.probplot(residuals, dist="norm", plot=ax)
ax.set_title("Q-Q Plot")
plt.show()
Scale-Location Plot
import numpy as np
plt.scatter(Y_pred, sqrt_standardized_residuals)
plt.xlabel('Predicted Values')
plt.ylabel('√|Standardized Residuals|')
plt.title('Scale-Location Plot')
plt.show()
If the assumption is met, there should be no clear pattern in the
plot.
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
plt.tight_layout()
plt.show()
Overfitting occurs when a model learns the training data too well,
including its noise and fluctuations, leading to poor generalization on
new, unseen data. Polynomial regression is particularly susceptible
to overfitting when using high-degree polynomials.
To address overfitting:
max_degree = 15
mean_scores = []
std_scores = []
optimal_degree = np.argmin(mean_scores) + 1
print(f"Optimal polynomial degree: {optimal_degree}")
This code uses cross-validation to find the optimal polynomial degree
that minimizes the mean squared error.
Regularization Techniques
Regularization is a technique used to prevent overfitting by adding a
penalty term to the loss function. The two most common
regularization methods for linear regression are:
Ridge Regression
# Print coefficients
print("Ridge coefficients:")
for i, coef in enumerate(ridge.coef_):
print(f"Feature {i}: {coef}")
Lasso Regression
# Print coefficients
print("Lasso coefficients:")
for i, coef in enumerate(lasso.coef_):
print(f"Feature {i}: {coef}")
Elastic Net
# Print coefficients
print("Elastic Net coefficients:")
for i, coef in enumerate(elastic_net.coef_):
print(f"Feature {i}: {coef}")
Conclusion
Regression analysis is a powerful tool in data science for
understanding relationships between variables and making
predictions. This chapter covered the basics of linear regression,
multiple linear regression, diagnostic plots, polynomial regression,
and regularization techniques. By mastering these concepts and their
implementation in Python, data scientists can effectively model and
analyze various types of data, making informed decisions and
predictions based on their findings.
σ(z) = 1 / (1 + e^(-z))
Where:
Where b0 is the bias term, and b1 to bn are the coefficients for the
input features x1 to xn.
The cost function used in logistic regression is the log loss (also
known as cross-entropy loss):
# Make predictions
y_pred = model.predict(X_test)
ROC Curve
Where:
Limitations of ROC-AUC
import numpy as np
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
# Assuming y_true are the true labels and y_scores are the
predicted probabilities
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)
Predicted
Pos Neg
Actual Pos TP FN
Neg FP TN
Where:
# Assuming y_true are the true labels and y_pred are the
predicted labels
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:\n", cm)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
2. One-vs-One (OvO):
# Make predictions
y_pred = classifier.predict(X_test)
Conclusion
The core idea behind machine learning is to create models that can
recognize patterns in data and use these patterns to make
predictions or decisions. This approach is particularly useful when
dealing with complex problems where traditional rule-based
programming would be impractical or impossible.
Supervised Learning
1. Linear Regression
2. Logistic Regression
3. Decision Trees
4. Random Forests
5. Support Vector Machines (SVM)
6. Neural Networks
Unsupervised Learning
1. K-means Clustering
2. Hierarchical Clustering
3. Principal Component Analysis (PCA)
4. t-SNE (t-Distributed Stochastic Neighbor Embedding)
5. Autoencoders
6. Gaussian Mixture Models
Semi-Supervised Learning
Speech analysis
Internet content classification
Protein sequence classification
Definitions
The Tradeoff
The goal in machine learning is to find the sweet spot that minimizes
both bias and variance. However, there's often a tradeoff between
the two:
As model complexity increases, bias tends to decrease (the
model can fit the data better), but variance tends to increase
(the model becomes more sensitive to changes in the training
data).
As model complexity decreases, bias tends to increase (the
model may be too simple to capture the true relationship), but
variance tends to decrease (the model is more stable across
different training sets).
Practical Implications
Linear Regression
Use Cases:
Logistic Regression
Key Characteristics:
Use Cases:
Decision Trees
Decision trees are versatile algorithms that can be used for both
classification and regression tasks. They make decisions based on a
series of questions about the input features.
Key Characteristics:
Use Cases:
Random Forests
Key Characteristics:
Use Cases:
Image classification
Predicting stock market trends
SVMs are powerful algorithms that find the hyperplane that best
separates different classes in high-dimensional space.
Key Characteristics:
Use Cases:
Text classification
Image recognition
Key Characteristics:
Use Cases:
Recommendation systems
Pattern recognition
Naive Bayes
Use Cases:
Spam detection
Sentiment analysis
Key Characteristics:
Use Cases:
Use Cases:
K-means Clustering
Key Characteristics:
Use Cases:
Customer segmentation
Image compression
Use Cases:
Feature extraction
Noise reduction in data
Conclusion
This chapter has provided an introduction to the fundamental
concepts of machine learning, including the distinction between
supervised and unsupervised learning, the importance of
understanding the bias-variance tradeoff, and an overview of popular
machine learning algorithms.
As you delve deeper into the field of data science and machine
learning, you'll encounter these concepts and algorithms repeatedly.
Understanding their strengths, weaknesses, and appropriate use
cases will be crucial in selecting the right approach for any given
problem.
Remember that while algorithms are important, they are just one
part of the machine learning pipeline. Equally important are data
preparation, feature engineering, model evaluation, and
interpretation of results. As you continue your journey in data
science, focus on developing a holistic understanding of the entire
machine learning process, from problem formulation to deployment
and monitoring of models in production environments.
In the following chapters, we'll dive deeper into specific algorithms,
exploring their mathematical foundations, implementation details,
and practical applications in real-world scenarios. We'll also cover
advanced topics such as ensemble methods, neural network
architectures, and techniques for handling large-scale data and
complex problems.
Chapter 14: Supervised
Learning with Scikit-Learn
Introduction
Supervised learning is a fundamental concept in machine learning
where models are trained on labeled data to make predictions or
classifications on new, unseen data. Scikit-Learn, a popular Python
library for machine learning, provides a wide range of tools and
algorithms for implementing supervised learning models. In this
chapter, we'll explore various supervised learning techniques using
Scikit-Learn, focusing on regression models, decision trees, random
forests, and support vector machines. We'll also cover essential
model evaluation techniques such as cross-validation and grid
search.
Linear Regression
# Make predictions
y_pred = model.predict(X_test)
Polynomial Regression
Ridge Regression
Lasso Regression
Decision Trees
Random Forests
print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)
SVM for Regression
Cross-Validation
Grid Search
model = RandomForestRegressor(n_estimators=100,
random_state=42)
model.fit(X_train_selected, y_train)
print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)
Class Weighting
print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)
Ensemble Methods
Ensemble methods combine multiple models to create a more robust
and accurate predictor. Scikit-Learn offers several ensemble methods
in addition to Random Forests.
Gradient Boosting
Voting Classifier
print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)
Model Interpretation
Understanding how a model makes predictions is crucial for building
trust in the model and gaining insights from the data. Several
techniques can be used to interpret machine learning models.
import shap
# Assuming 'model' is your trained model and 'X_test' is
your test data
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
Conclusion
This chapter has covered a wide range of supervised learning
techniques using Scikit-Learn, from basic regression models to
advanced ensemble methods. We've explored various model
evaluation techniques, methods for handling imbalanced datasets,
and approaches for model interpretation. By mastering these tools
and techniques, you'll be well-equipped to tackle a variety of
machine learning problems and extract valuable insights from your
data.
K-Means Clustering
Algorithm Steps:
Advantages of K-Means:
Disadvantages of K-Means:
Implementation in Python:
Hierarchical Clustering
Implementation in Python:
# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
Algorithm Steps:
Advantages of PCA:
Disadvantages of PCA:
Algorithm Steps:
Advantages of t-SNE:
Disadvantages of t-SNE:
Isolation Forest
Algorithm Steps:
Implementation in Python:
Algorithm Steps:
Advantages of LOF:
Disadvantages of LOF:
One-Class SVM
Algorithm Steps:
Implementation in Python:
# Clustering: K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X_scaled)
# Plotting functions
def plot_clusters(X, labels, title):
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X[:, 0], X[:, 1], c=labels,
cmap='viridis')
plt.colorbar(scatter)
plt.title(title)
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.show()
def plot_anomalies(X, labels, title):
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X[:, 0], X[:, 1], c=labels,
cmap='viridis')
plt.colorbar(scatter)
plt.title(title)
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.show()
# Plot results
plot_clusters(X_pca, kmeans_labels, 'K-Means Clustering
(PCA)')
plot_clusters(X_pca, hc_labels, 'Hierarchical Clustering
(PCA)')
plot_clusters(X_tsne, kmeans_labels, 'K-Means Clustering (t-
SNE)')
plot_clusters(X_tsne, hc_labels, 'Hierarchical Clustering
(t-SNE)')
3. Dimensionality Reduction:
4. Anomaly Detection:
5. Visualization:
The clustering results are visualized using both PCA and t-SNE
projections.
Anomaly detection results are plotted using the PCA projection.
Conclusion
Unsupervised learning techniques offer powerful tools for exploring
and analyzing data without the need for labeled examples. This
chapter covered key concepts and algorithms in clustering,
dimensionality reduction, and anomaly detection, along with their
implementation in Python using scikit-learn.
Key takeaways:
plt.legend()
plt.title('Custom Color Palette Example')
plt.show()
plt.figure(figsize=(10, 6))
plt.plot(x, y)
plt.title(r'$\sin(x)$ Function', fontsize=16)
plt.xlabel(r'$x$ (radians)', fontsize=14)
plt.ylabel(r'$\sin(x)$', fontsize=14)
plt.text(0, 0.5, r'$\int_{-\infty}^{\infty} e^{-x^2} dx =
\sqrt{\pi}$',
fontsize=16, bbox=dict(facecolor='white',
alpha=0.8))
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()
plt.figure(figsize=(10, 6))
plt.plot(x, y1, linestyle='--', marker='o', markersize=8,
markerfacecolor='white', markeredgecolor='blue',
markeredgewidth=2, label='sin(x)')
plt.plot(x, y2, linestyle='-.', marker='^', markersize=8,
markerfacecolor='red', markeredgecolor='black',
markeredgewidth=2, label='cos(x)')
plt.legend()
plt.title('Custom Markers and Line Styles')
plt.show()
# Create FacetGrid
g = sns.FacetGrid(data, col='category', row='subcategory',
height=4, aspect=1.2)
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np
# Create sample data
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
y3 = np.tan(x)
fig.add_trace(go.Scatterpolar(
r=np.abs(y3),
theta=x * 180 / np.pi,
mode='markers',
name="tan(x)"
), row=1, col=2)
fig.add_trace(go.Scatter(
x=y1, y=y2,
mode='markers',
marker=dict(
size=8,
color=x,
colorscale='Viridis',
showscale=True
),
name="sin(x) vs cos(x)"
), row=2, col=1)
# Update layout
fig.update_layout(height=800, width=800, title_text="Complex
Plotly Layout")
fig.show()
2. Customizing Interactivity
import plotly.graph_objects as go
import pandas as pd
import numpy as np
# Update layout
fig.update_layout(
title='Interactive Scatter Plot with Custom Hover',
xaxis_title='X Axis',
yaxis_title='Y Axis'
)
fig.show()
3. Animated Visualizations
import plotly.graph_objects as go
import pandas as pd
import numpy as np
fig.show()
import folium
# Add a marker
folium.Marker(
location=[51.509865, -0.118092],
popup='London',
tooltip='Click for more info'
).add_to(m)
import folium
import pandas as pd
# Sample data
data = pd.DataFrame({
'name': ['London', 'Paris', 'Berlin', 'Rome'],
'lat': [51.509865, 48.856614, 52.520007, 41.902783],
'lon': [-0.118092, 2.352222, 13.404954, 12.496366],
'population': [8982000, 2140526, 3669491, 4342212]
})
3. Choropleth Maps
import folium
import pandas as pd
# Sample data
state_data = pd.DataFrame({
'State': ['California', 'Texas', 'New York', 'Florida'],
'Unemployment': [4.3, 3.8, 4.1, 3.3]
})
# Read a shapefile
world =
gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
plt.show()
Always consider your data type and the message you want to convey
when choosing a chart type.
plt.legend()
plt.grid(True, linestyle=':', alpha=0.7)
plt.tight_layout()
plt.show()
6. Be Mindful of Scale
8. Ensure Accessibility
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.express as px
import pandas as pd
# Main Header
## Subheader
### Sub-subheader
**Bold Text**
*Italic Text*
- Bullet point 1
- Bullet point 2
1. Numbered item 1
2. Numbered item 2
[Link Text](https://www.example.com)

You are a data scientist at an online retailer, and you've been tasked
with analyzing customer behavior to improve sales and customer
retention. You've conducted a thorough analysis of purchase
patterns, customer demographics, and website usage data. Now,
you need to present your findings to the executive team and provide
actionable recommendations.
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.express as px
import pandas as pd
app = dash.Dash(__name__)
app.layout = html.Div([
html.H1('Customer Segmentation'),
dcc.Dropdown(
id='x-axis',
options=[{'label': i, 'value': i} for i in ['Age',
'Income', 'Purchase Frequency']],
value='Age'
),
dcc.Dropdown(
id='y-axis',
options=[{'label': i, 'value': i} for i in ['Age',
'Income', 'Purchase Frequency']],
value='Income'
),
dcc.Graph(id='segmentation-plot')
])
@app.callback(
Output('segmentation-plot', 'figure'),
Input('x-axis', 'value'),
Input('y-axis', 'value')
)
def update_graph(x_axis, y_axis):
fig = px.scatter(df, x=x_axis, y=y_axis,
color='Segment',
hover_data=['CustomerID',
'TotalSpend'])
return fig
if __name__ == '__main__':
app.run_server(debug=True)
## Executive Summary
[Brief overview of analysis and key findings]
## 1. Introduction
### 1.1 Background
### 1.2 Objectives
### 1.3 Data Sources
## 2. Data Preparation
### 2.1 Data Cleaning
### 2.2 Feature Engineering
## 4. Customer Segmentation
### 4.1 Methodology
### 4.2 Segment Profiles
### 4.3 Segment Performance
## 7. Recommendations
### 7.1 Marketing Strategies
### 7.2 Product Recommendations
### 7.3 Customer Experience Improvements
## 8. Conclusion
## 9. Appendix
### 9.1 Statistical Tests
### 9.2 Additional Visualizations
### 9.3 Data Dictionary
Step 4: Present Your Findings
1. Start with the Big Picture: Begin with the most important
insights and their potential impact on the business.
2. Use Your Dashboard: Demonstrate key points using the
interactive dashboard you created.
3. Tell a Story: Connect your insights into a coherent narrative
about customer behavior and business opportunities.
4. Focus on Action: Emphasize your recommendations and how
they can be implemented.
5. Be Prepared for Questions: Anticipate potential questions
and have supporting data ready.
6. Provide Next Steps: Outline a clear plan for implementing
your recommendations and measuring their impact.
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
# Load data
data = pd.read_csv('your_data.csv')
# Add a title
st.title('My First Streamlit App')
This will launch a local server and open the app in your default web
browser.
@st.cache
def load_data():
# Your data loading logic here
return data
progress_bar = st.progress(0)
for i in range(100):
# Do some work
progress_bar.progress(i + 1)
Before integrating a model into your app, you need to train and save
it. Here's an example using scikit-learn:
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Save the model
joblib.dump(model, 'random_forest_model.joblib')
Once your model is saved, you can load it in your Streamlit app and
use it for predictions:
import streamlit as st
import joblib
import pandas as pd
# Make prediction
if st.button('Predict'):
input_data = pd.DataFrame([[feature1, feature2]],
columns=['feature1', 'feature2'])
prediction = model.predict(input_data)
st.write(f'The predicted class is: {prediction[0]}')
This example creates a simple interface where users can adjust
feature values using sliders and get predictions from the model.
Streamlit Sharing
Heroku Deployment
For more control over your deployment, you can use platforms like
Heroku:
Heroku will build and deploy your app, making it accessible via a
unique URL.
1. Create a Dockerfile:
FROM python:3.8-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
EXPOSE 8501
CMD streamlit run app.py
1. Security: Ensure that sensitive data and API keys are not
exposed in your code. Use environment variables for
configuration.
2. Performance: Optimize your app for performance, especially if
dealing with large datasets or complex computations.
3. Monitoring: Implement logging and monitoring to track usage
and identify issues.
4. Updates: Establish a process for updating your deployed app
as you make improvements or fix bugs.
5. Cost Management: Be aware of the costs associated with
hosting and data transfer, especially for apps with high traffic or
resource-intensive operations.
import yfinance as yf
import pandas as pd
def train_arima_model(data):
model = ARIMA(data['Close'], order=(5,1,0))
model_fit = model.fit()
return model_fit
import streamlit as st
import plotly.graph_objects as go
from datetime import datetime, timedelta
# User inputs
ticker = st.text_input('Enter Stock Ticker (e.g., AAPL)',
'AAPL')
start_date = st.date_input('Start Date', datetime.now() -
timedelta(days=365))
end_date = st.date_input('End Date', datetime.now())
if st.button('Analyze'):
# Fetch data
data = get_stock_data(ticker, start_date, end_date)
# Train model
model = train_arima_model(data)
# Make prediction
forecast = forecast_prices(model)
# Visualize results
fig = go.Figure()
fig.add_trace(go.Scatter(x=data.index, y=data['Close'],
name='Historical'))
fig.add_trace(go.Scatter(x=pd.date_range(start=end_date,
periods=30), y=forecast, name='Forecast'))
fig.update_layout(title=f'{ticker} Stock Price',
xaxis_title='Date', yaxis_title='Price')
st.plotly_chart(fig)
# Display metrics
st.metric('Current Price',
f"${data['Close'].iloc[-1]:.2f}")
st.metric('Predicted Price (30 days)',
f"${forecast.iloc[-1]:.2f}")
Step 4: Deployment
Conclusion
Building data applications is a powerful way to make data science
and machine learning accessible to a wider audience. By leveraging
tools like Streamlit, you can rapidly develop interactive apps that
showcase your data analysis and predictive models. The ability to
deploy these applications to the web further extends their reach and
impact.
Hadoop
Spark
Both Hadoop and Spark have their strengths and are often used
together in big data architectures. While Hadoop excels at batch
processing and storing vast amounts of data economically, Spark
shines in scenarios requiring fast processing and real-time analytics.
import dask.dataframe as dd
Dask Arrays
import dask.array as da
# Create a large array
x = da.random.random((10000, 10000), chunks=(1000, 1000))
# Perform computations
result = x.mean().compute()
Dask Bags
import dask.bag as db
# Count words
word_counts =
b.str.split().flatten().frequencies().compute()
Dask Delayed
@delayed
def process_chunk(chunk):
# Some time-consuming operation
return result
Setting Up PySpark
spark = SparkSession.builder \
.appName("MySparkApplication") \
.getOrCreate()
RDDs are the low-level API in Spark. While DataFrames are now
more commonly used, understanding RDDs is still valuable:
# Create an RDD
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
# Perform transformations
squared_rdd = rdd.map(lambda x: x**2)
# Perform actions
sum_of_squares = squared_rdd.reduce(lambda x, y: x + y)
# Perform operations
result = df.groupBy("column_name").agg({"value": "mean"})
# Show results
result.show()
SQL Operations
# Prepare data
assembler = VectorAssembler(inputCols=["feature1",
"feature2"], outputCol="features")
data = assembler.transform(df)
# Split data
train, test = data.randomSplit([0.7, 0.3])
# Make predictions
predictions = model.transform(test)
Best Practices for PySpark
Problem Statement
Dataset
Our dataset is a CSV file containing several years of flight data, with
each row representing a single flight. The file size is over 10GB,
making it too large to process efficiently with traditional Python
libraries.
Setup
spark = SparkSession.builder \
.appName("FlightDataAnalysis") \
.getOrCreate()
busiest_airports = df.groupBy("Origin") \
.count() \
.orderBy(col("count").desc()) \
.limit(10)
busiest_airports.show()
avg_delay_by_airline = df.groupBy("Airline") \
.agg(avg("ArrDelay").alias("AvgDelay")) \
.orderBy(col("AvgDelay").desc())
avg_delay_by_airline.show()
3. Weather Impact on Flight Delays
weather_impact = df.select(
when(col("WeatherDelay") > 0, "Weather Delay")
.otherwise("No Weather Delay")
.alias("WeatherCondition"),
"ArrDelay"
)
weather_impact_avg =
weather_impact.groupBy("WeatherCondition") \
.agg(avg("ArrDelay").alias("AvgDelay"))
weather_impact_avg.show()
Visualization
# Busiest Airports
busiest_airports_pd = busiest_airports.toPandas()
plt.figure(figsize=(12, 6))
plt.bar(busiest_airports_pd["Origin"],
busiest_airports_pd["count"])
plt.title("Top 10 Busiest Airports")
plt.xlabel("Airport")
plt.ylabel("Number of Departures")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Weather Impact
weather_impact_avg_pd = weather_impact_avg.toPandas()
plt.figure(figsize=(8, 6))
plt.bar(weather_impact_avg_pd["WeatherCondition"],
weather_impact_avg_pd["AvgDelay"])
plt.title("Impact of Weather on Flight Delays")
plt.xlabel("Weather Condition")
plt.ylabel("Average Delay (minutes)")
plt.tight_layout()
plt.show()
Interpretation of Results
Conclusion
Conclusion
The field of big data is rapidly evolving, and the tools and techniques
for handling large datasets are continually improving. As data
scientists, it's crucial to stay updated with these advancements and
understand when and how to apply different big data tools.
Hadoop, Spark, Dask, and PySpark each have their strengths and
are suited for different scenarios:
The choice of tool often depends on factors such as the size and
nature of your data, the type of analysis you need to perform, the
existing infrastructure, and the skills of your team.
As we've seen in the case study, these tools can unlock insights from
datasets that would be impractical to analyze with traditional
methods. They allow us to ask bigger questions and derive more
comprehensive answers from our data.
However, it's important to remember that big data tools are not
always necessary. For many data science tasks, traditional Python
libraries like Pandas and NumPy are sufficient and often more
straightforward to use. The key is to understand the limitations of
these tools and recognize when it's time to scale up to big data
solutions.
Remember, the goal of using big data tools is not just to handle
large volumes of data, but to derive meaningful insights that can
drive decision-making and create value. As you continue your
journey in data science, keep exploring these tools and their
applications, and always strive to ask the right questions of your
data, regardless of its size.
Chapter 20: Time Series
Forecasting
Introduction to Time Series Forecasting
Time series forecasting is a crucial aspect of data science and
predictive analytics, focusing on analyzing and predicting data points
collected over time. This technique is widely used in various fields,
including finance, economics, weather forecasting, and business
planning. Time series data is characterized by its sequential nature,
where observations are recorded at regular intervals, such as daily,
weekly, monthly, or yearly.
ARIMA Model
SARIMA Model
Advantages of SARIMA
# ARIMA example
model = ARIMA(data, order=(1,1,1))
results = model.fit()
# SARIMA example
model = SARIMAX(data, order=(1,1,1), seasonal_order=
(1,1,1,12))
results = model.fit()
# Make predictions
forecast = results.forecast(steps=10)
Prophet for Time Series Prediction
Facebook's Prophet is a powerful and user-friendly tool for time
series forecasting, especially suited for business forecasting tasks.
# Prepare data
df = pd.DataFrame({'ds': date_series, 'y': value_series})
# Make predictions
forecast = model.predict(future)
Limitations of Prophet
Problem Statement
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
This step helps us visualize the data and understand its components
(trend, seasonality, and residuals).
Step 2: SARIMA Model
# Make predictions
predictions = results.forecast(steps=12)
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(test['Passengers'],
predictions))
print(f'RMSE: {rmse}')
# Plot results
plt.figure(figsize=(12,6))
plt.plot(train, label='Training Data')
plt.plot(test, label='Actual Data')
plt.plot(predictions, label='Predictions')
plt.legend()
plt.title('SARIMA Model Forecast')
plt.show()
# Make predictions
future = model.make_future_dataframe(periods=12, freq='M')
forecast = model.predict(future)
# Calculate RMSE
predictions_prophet = forecast['yhat'][-12:].values
rmse_prophet = np.sqrt(mean_squared_error(test_prophet['y'],
predictions_prophet))
print(f'RMSE: {rmse_prophet}')
# Plot results
fig = model.plot(forecast)
plt.title('Prophet Model Forecast')
plt.show()
Conclusion
To get started with TensorFlow and Keras, you'll need to install them
in your Python environment:
import tensorflow as tf
from tensorflow import keras
Building a Simple Neural Network
1. Data cleaning
2. Feature scaling
3. Encoding categorical variables
4. Splitting the data into training, validation, and test sets
Here's an example of preparing a dataset for a binary classification
task:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
Regularization Techniques
model = keras.Sequential([
keras.layers.Dense(64, activation='relu', input_shape=
(X_train.shape[1],),
kernel_regularizer=regularizers.l2(0.01)),
keras.layers.Dropout(0.3),
keras.layers.Dense(32, activation='relu',
kernel_regularizer=regularizers.l2(0.01)),
keras.layers.Dropout(0.3),
keras.layers.Dense(16, activation='relu',
kernel_regularizer=regularizers.l2(0.01)),
keras.layers.Dropout(0.3),
keras.layers.Dense(1, activation='sigmoid')
])
Hyperparameter Tuning
param_dist = {
'neurons': [32, 64, 128],
'dropout_rate': [0.2, 0.3, 0.4],
'learning_rate': [0.001, 0.01, 0.1],
'batch_size': [16, 32, 64],
'epochs': [50, 100, 150]
}
random_search = RandomizedSearchCV(estimator=model,
param_distributions=param_dist, n_iter=10, cv=3, n_jobs=-1)
random_search_result = random_search.fit(X_train_scaled,
y_train)
CNN Architecture
import tensorflow as tf
from tensorflow import keras
import numpy as np
Data Augmentation
datagen = ImageDataGenerator(
rotation_range=15,
width_shift_range=0.1,
height_shift_range=0.1,
horizontal_flip=True,
zoom_range=0.1
)
Transfer Learning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
print('Confusion Matrix:')
print(metrics.confusion_matrix(y_test, y_pred_classes))
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC
curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
# Plot training history
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training
Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation
Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation
Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.tight_layout()
plt.show()
Conclusion
Deep learning has become an essential tool in the data scientist's
toolkit, enabling the development of highly accurate predictive
models for complex problems. In this chapter, we've explored the
fundamentals of neural networks, implemented deep learning
models for data classification, delved into convolutional neural
networks for image data, and applied these techniques to a real-
world case study on customer churn prediction.
As you continue to develop your skills in deep learning for data
science, remember that successful implementation requires not only
technical proficiency but also a deep understanding of the problem
domain and the ability to interpret and communicate results
effectively. By combining these skills, you'll be well-equipped to
tackle a wide range of data science challenges using deep learning
techniques.
Chapter 22: The Future of
Data Science with Python
Emerging Trends in Data Science and AI
The field of data science and artificial intelligence is rapidly evolving,
with new technologies and methodologies emerging at an
unprecedented pace. As we look towards the future, several key
trends are shaping the landscape of data science with Python:
iris = load_iris()
X_train, X_test, y_train, y_test =
train_test_split(iris.data, iris.target, test_size=0.2)
2. Explainable AI (XAI)
import shap
import xgboost as xgb
import tensorflow as tf
4. Reinforcement Learning
import gym
from stable_baselines3 import PPO
for i in range(100):
params = opt.step(quantum_circuit, params)
if (i + 1) % 20 == 0:
print(f"Step {i+1}: cost =
{quantum_circuit(params)}")
6. Federated Learning
Andrew Ng (@AndrewYNg)
Yann LeCun (@ylecun)
Fei-Fei Li (@drfeifei)
Sebastian Raschka (@rasbt)
François Chollet (@fchollet)
OpenAI (@OpenAI)
Google AI (@GoogleAI)
DeepMind (@DeepMind)
MIT Technology Review (@techreview)
PyData
PyCon
NIPS (Neural Information Processing Systems)
ICML (International Conference on Machine Learning)
KDD (Knowledge Discovery and Data Mining)
Regularly try out new Python libraries and tools. Set up a virtual
environment and experiment with emerging technologies:
1. GitHub Repository
mkdir data_science_portfolio
cd data_science_portfolio
git init
touch README.md
# Add your projects
git add .
git commit -m "Initial commit"
git remote add origin
https://github.com/yourusername/data_science_portfolio.git
git push -u origin master
2. Diverse Projects
3. Kaggle Competitions
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Train a model
model = RandomForestClassifier(n_estimators=100,
random_state=42)
model.fit(X_train, y_train)
## Data Preprocessing
```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
data = pd.read_csv('customer_churn.csv')
X = data.drop('Churn', axis=1)
y = data['Churn']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Model Training
Now, let's train a logistic regression model:
model = LogisticRegression()
model.fit(X_train, y_train)
...
### 5. Interactive Dashboards
```python
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.express as px
import pandas as pd
app = dash.Dash(__name__)
app.layout = html.Div([
dcc.Dropdown(
id='feature-dropdown',
options=[{'label': i, 'value': i} for i in
df.columns],
value='default_feature'
),
dcc.Graph(id='feature-histogram')
])
@app.callback(
Output('feature-histogram', 'figure'),
Input('feature-dropdown', 'value')
)
def update_graph(selected_feature):
fig = px.histogram(df, x=selected_feature)
return fig
if __name__ == '__main__':
app.run_server(debug=True)
app = Flask(__name__)
@app.route('/')
def home():
return render_template('home.html')
@app.route('/blog/<post_id>')
def blog_post(post_id):
with open(f'posts/{post_id}.md', 'r') as f:
content = f.read()
html_content = markdown2.markdown(content)
return render_template('blog_post.html',
content=html_content)
if __name__ == '__main__':
app.run(debug=True)
1. Specialize in a Domain
Healthcare Analytics
Financial Data Science
Marketing Analytics
Environmental Data Science
plt.figure(figsize=(12, 6))
sns.barplot(x=readmission_rates.index,
y=readmission_rates.values)
plt.title('Readmission Rates by Diagnosis Group')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Ensemble Methods
Bayesian Methods
Generative Adversarial Networks (GANs)
Reinforcement Learning
def make_generator_model():
model = tf.keras.Sequential([
tf.keras.layers.Dense(7*7*256, use_bias=False,
input_shape=(100,)),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.LeakyReLU(),
tf.keras.layers.Reshape((7, 7, 256)),
tf.keras.layers.Conv2DTranspose(128, (5, 5),
strides=(1, 1), padding='same', use_bias=False),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.LeakyReLU(),
generator = make_generator_model()
noise = tf.random.normal([1, 100])
generated_image = generator(noise, training=False)
# Prepare features
feature_columns = ["feature1", "feature2", "feature3"]
assembler = VectorAssembler(inputCols=feature_columns,
outputCol="features")
data_assembled = assembler.transform(data)
# Make predictions
predictions = model.transform(data_assembled)
predictions.select("prediction", "target",
"features").show()
# Calculate metrics
metric = BinaryLabelDatasetMetric(dataset,
unprivileged_groups=[{'gender': 0}], privileged_groups=
[{'gender': 1}])
Data Storytelling
Project Management
Team Collaboration
Business Acumen
def chatbot():
print(random.choice(greetings) + " Welcome to the Data
Science Community!")
while True:
user_input = input("What data science topic would
you like to discuss? ").lower()
if user_input in topics:
print(f"Great choice! {user_input.capitalize()}
is a fascinating area. What specific aspect interests you?")
elif user_input == "bye":
print("Thank you for chatting. Goodbye!")
break
else:
print(f"I'm not sure about that topic. How about
we discuss one of these: {', '.join(topics)}?")
chatbot()
Remember, the most successful data scientists are those who can
bridge the gap between technical expertise and real-world
application. As you advance in your career, focus on not just building
models, but on solving problems and creating value. Stay
passionate, stay ethical, and keep pushing the boundaries of what's
possible with data science and Python.
Conclusion
Appendix A: Python Data
Science Cheatsheet
Data Science with Python: From Data
Wrangling to Visualization
This comprehensive cheatsheet covers essential Python libraries and
techniques for data science, including data manipulation, analysis,
visualization, and machine learning. It serves as a quick reference
guide for data scientists, analysts, and developers working with
Python for data-related tasks.
Table of Contents
1. Data Manipulation with Pandas
2. Data Visualization with Matplotlib and Seaborn
3. Statistical Analysis with SciPy and StatsModels
4. Machine Learning with Scikit-learn
5. Deep Learning with TensorFlow and Keras
6. Natural Language Processing with NLTK
7. Big Data Processing with PySpark
8. Time Series Analysis with Statsmodels
9. Geospatial Analysis with GeoPandas
10. Web Scraping with BeautifulSoup and Scrapy
import pandas as pd
Creating DataFrames
# From a dictionary
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
# Select a column
df['column_name']
# Drop a column
df = df.drop('column_name', axis=1)
# Rename columns
df = df.rename(columns={'old_name': 'new_name'})
Data Cleaning
# Remove duplicates
df = df.drop_duplicates()
# Handle missing values
df = df.dropna() # Drop rows with any missing values
df = df.fillna(0) # Fill missing values with 0
# Replace values
df['column'] = df['column'].replace('old_value',
'new_value')
# Multiple aggregations
df.groupby('category').agg({'value': ['mean', 'sum',
'count']})
Importing Libraries
# Line plot
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot')
plt.show()
# Scatter plot
plt.scatter(x, y)
plt.show()
# Bar plot
plt.bar(categories, values)
plt.show()
# Histogram
plt.hist(data, bins=20)
plt.show()
# Box plot
plt.boxplot(data)
plt.show()
Matplotlib: Customizing Plots
# Adding a legend
plt.plot(x1, y1, label='Line 1')
plt.plot(x2, y2, label='Line 2')
plt.legend()
# Box plot
sns.boxplot(x='category', y='value', data=df)
# Violin plot
sns.violinplot(x='category', y='value', data=df)
# Heatmap
sns.heatmap(correlation_matrix, annot=True)
# Pair plot
sns.pairplot(df)
# Distribution plot
sns.distplot(df['column'])
# Set style
sns.set_style("whitegrid")
Importing Libraries
Descriptive Statistics
# Percentiles
np.percentile(data, [25, 50, 75])
Hypothesis Testing
# T-test
t_statistic, p_value = stats.ttest_ind(group1, group2)
# ANOVA
f_statistic, p_value = stats.f_oneway(group1, group2,
group3)
# Chi-square test
chi2_statistic, p_value =
stats.chi2_contingency(contingency_table)
# Pearson correlation
r, p_value = stats.pearsonr(x, y)
# Linear regression
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
print(model.summary())
4. Machine Learning with Scikit-learn
Scikit-learn is a comprehensive library for machine learning in
Python, offering various algorithms for classification, regression,
clustering, and more.
Importing Scikit-learn
Data Preprocessing
# Standardize features
scaler = preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Encode categorical variables
encoder = preprocessing.LabelEncoder()
y_encoded = encoder.fit_transform(y)
Classification
# Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Decision Tree
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# Random Forest
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)
Clustering
# K-Means Clustering
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
labels = kmeans.predict(X)
Model Evaluation
# Classification metrics
accuracy = metrics.accuracy_score(y_test, y_pred)
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
classification_report =
metrics.classification_report(y_test, y_pred)
# Regression metrics
mse = metrics.mean_squared_error(y_test, y_pred)
r2 = metrics.r2_score(y_test, y_pred)
# Cross-validation
cv_scores = model_selection.cross_val_score(model, X, y,
cv=5)
Importing Libraries
import tensorflow as tf
from tensorflow import keras
Building a Neural Network
model = keras.Sequential([
keras.layers.Dense(64, activation='relu', input_shape=
(input_dim,)),
keras.layers.Dense(32, activation='relu'),
keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
model = keras.Sequential([
keras.layers.Conv2D(32, (3, 3), activation='relu',
input_shape=(28, 28, 1)),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Flatten(),
keras.layers.Dense(64, activation='relu'),
keras.layers.Dense(10, activation='softmax')
])
model = keras.Sequential([
keras.layers.LSTM(64, input_shape=(sequence_length,
features)),
keras.layers.Dense(1)
])
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
Text Preprocessing
# Tokenization
tokens = word_tokenize(text)
# Lowercasing
tokens = [token.lower() for token in tokens]
# Removing stopwords
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in
stop_words]
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token
in tokens]
Part-of-Speech Tagging
pos_tags = nltk.pos_tag(tokens)
nltk.download('maxent_ne_chunker')
nltk.download('words')
named_entities = nltk.ne_chunk(pos_tags)
Sentiment Analysis
sia = SentimentIntensityAnalyzer()
sentiment_scores = sia.polarity_scores(text)
Creating a SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
Reading Data
df = spark.read.csv("data.csv", header=True,
inferSchema=True)
Data Manipulation
# Select columns
df_selected = df.select("col1", "col2")
# Filter rows
df_filtered = df.filter(df.age > 30)
# Join DataFrames
df_joined = df1.join(df2, on="key_column")
# Prepare features
assembler = VectorAssembler(inputCols=["feature1",
"feature2"], outputCol="features")
# Make predictions
predictions = model.transform(test_data)
8. Time Series Analysis with Statsmodels
Statsmodels provides classes and functions for statistical models,
including time series analysis.
Importing Statsmodels
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
decomposition = sm.tsa.seasonal_decompose(ts,
model='additive')
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
ARIMA Model
Importing GeoPandas
gdf = gpd.read_file("shapefile.shp")
Basic Operations
# Plot geometries
gdf.plot()
# Calculate area
gdf['area'] = gdf.geometry.area
# Calculate centroid
gdf['centroid'] = gdf.geometry.centroid
# Spatial join
joined = gpd.sjoin(gdf1, gdf2, how="inner", op="intersects")
# Check CRS
print(gdf.crs)
# Reproject to a different CRS
gdf = gdf.to_crs("EPSG:4326")
Spatial Analysis
# Buffer
gdf['buffer'] = gdf.geometry.buffer(1000)
# Intersection
intersection = gpd.overlay(gdf1, gdf2, how='intersection')
# Union
union = gpd.overlay(gdf1, gdf2, how='union')
BeautifulSoup
# Fetch webpage
url = "http://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find elements
title = soup.find('title').text
paragraphs = soup.find_all('p')
Scrapy
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com']
1. NumPy
Key Features:
Use Cases:
Example:
import numpy as np
# Create a 2D array
arr = np.array([[1, 2, 3], [4, 5, 6]])
2. Pandas
Key Features:
Use Cases:
Example:
import pandas as pd
# Filter data
filtered_df = df[df['column_name'] > 5]
3. Matplotlib
Wide variety of plots and charts (line plots, scatter plots, bar
charts, histograms, etc.)
Fine-grained control over plot elements
Support for multiple output formats
Customizable styles and layouts
Use Cases:
Example:
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot')
plt.show()
4. Seaborn
Key Features:
Use Cases:
Example:
Key Features:
Use Cases:
Example:
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
6. TensorFlow
Key Features:
Use Cases:
Example:
import tensorflow as tf
from tensorflow.keras import layers
model.compile(optimizer='adam', loss='mse')
7. PyTorch
Key Features:
Use Cases:
Example:
import torch
import torch.nn as nn
model = SimpleNet()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
8. SciPy
Key Features:
Use Cases:
Example:
9. Statsmodels
Key Features:
Use Cases:
Econometric analysis
Time series forecasting
Statistical hypothesis testing
Regression analysis
Example:
import statsmodels.api as sm
print(model.summary())
10. Plotly
Key Features:
Interactive plots
Wide variety of chart types
Customizable layouts and styles
Support for both online and offline plotting
Use Cases:
Example:
import plotly.graph_objects as go
fig.show()
11. Bokeh
Key Features:
Use Cases:
Example:
Key Features:
Use Cases:
Example:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower()
not in stop_words]
print(filtered_tokens)
13. Gensim
Key Features:
Use Cases:
Topic modeling
Document indexing
Similarity retrieval with large corpora
Natural language processing
Example:
# Sample documents
documents = [
"Human machine interface for lab abc computer
applications",
"A survey of user opinion of computer system response
time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error
measurement"
]
14. XGBoost
Key Features:
Use Cases:
Regression problems
Classification tasks
Ranking problems
User-defined prediction tasks
Example:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Set parameters
params = {
'max_depth': 3,
'eta': 0.1,
'objective': 'reg:squarederror',
'eval_metric': 'rmse'
}
# Make predictions
preds = model.predict(dtest)
15. Dask
Key Features:
Use Cases:
Example:
import dask.dataframe as dd
# Perform operations
result = df.groupby('column').mean().compute()
print(result)
16. Streamlit
Key Features:
Use Cases:
Example:
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
# Load data
@st.cache
def load_data():
return pd.read_csv('data.csv')
data = load_data()
# Create a title
st.title('Simple Data Explorer')
# Create a plot
fig, ax = plt.subplots()
data.plot(kind='scatter', x='column1', y='column2', ax=ax)
st.pyplot(fig)
17. Flask
Key Features:
Use Cases:
Example:
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():
# Get data from request
data = request.json
# Make prediction
prediction = model.predict(data)
if __name__ == '__main__':
app.run(debug=True)
18. PySpark
Key Features:
Use Cases:
Example:
# Load data
data = spark.read.csv("data.csv", header=True,
inferSchema=True)
# Prepare features
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["feature1",
"feature2"], outputCol="features")
data = assembler.transform(data)
# Split data
train, test = data.randomSplit([0.7, 0.3], seed=42)
# Train model
lr = LogisticRegression(featuresCol="features",
labelCol="label")
model = lr.fit(train)
# Make predictions
predictions = model.transform(test)
predictions.select("label", "prediction").show()
19. NetworkX
Key Features:
Example:
import networkx as nx
import matplotlib.pyplot as plt
# Create a graph
G = nx.Graph()
# Add edges
G.add_edges_from([(1, 2), (1, 3), (2, 3), (3, 4), (4, 5)])
Key Features:
Use Cases:
Example:
chart.show()
Introduction
Data science is a complex field that involves numerous steps, from
data collection to model deployment. Throughout this process, data
scientists encounter various challenges and problems that can hinder
their progress or affect the quality of their results. This appendix
aims to provide a comprehensive guide to troubleshooting common
data science problems, offering practical solutions and best practices
for each stage of the data science lifecycle.
Solutions:
Example:
Solutions:
Example:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
Solutions:
Example:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
print("Original data:")
print(data)
# KNN imputation
knn_imputer = KNNImputer(n_neighbors=2)
data_knn_imputed =
pd.DataFrame(knn_imputer.fit_transform(data),
columns=data.columns)
print("\nKNN imputation:")
print(data_knn_imputed)
Solutions:
Example:
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from scipy import stats
# Z-score method
z_scores = np.abs(stats.zscore(data))
z_score_outliers = np.where(z_scores > 3)[0]
# IQR method
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
iqr_outliers = np.where((data < lower_bound) | (data >
upper_bound))[0]
Solutions:
Example:
import pandas as pd
import numpy as np
from fuzzywuzzy import process
print("Original data:")
print(data)
data['Age'] = data['Age'].apply(clean_age)
def validate_email(email):
return '@' in email and '.' in email.split('@')[1]
data['Valid_Email'] = data['Email'].apply(validate_email)
unique_names = data['Name'].unique()
data['Name_Cleaned'] = data['Name'].apply(lambda x:
find_best_match(x, unique_names))
print("\nCleaned data:")
print(data)
Solutions:
Example:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectKBest,
f_regression
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
Solutions:
Example:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder,
LabelEncoder, StandardScaler
from category_encoders import TargetEncoder
print("Original data:")
print(data)
print("\nEncoded data:")
print(encoded_data)
Overfitting occurs when a model learns the training data too well,
including noise, while underfitting happens when a model is too
simple to capture the underlying patterns in the data.
Solutions:
Example:
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, Ridge,
Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
Solutions:
Example:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.ensemble import BalancedRandomForestClassifier
# SMOTE resampling
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
rf_smote = RandomForestClassifier(random_state=42)
rf_smote_scores = cross_val_score(rf_smote, X_resampled,
y_resampled, cv=5, scoring='f1')
print(f"Random Forest with SMOTE F1 scores:
{rf_smote_scores.mean():.3f} (+/- {rf_smote_scores.std() *
2:.3f})")
Solutions:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split,
GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from scipy.stats import randint
# Grid Search
grid_search =
GridSearchCV(RandomForestClassifier(random_state=42),
param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)
y_pred_grid = grid_search.predict(X_test)
print(f"Test accuracy: {accuracy_score(y_test,
y_pred_grid):.3f}")
# Random Search
random_search =
RandomizedSearchCV(RandomForestClassifier(random_state=42),
param_distributions=param_dist, n_iter=20, cv=5, n_jobs=-1,
random_state=42)
random_search.fit(X_train, y_train)
Solutions:
Example:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Create a DataFrame
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y
# PCA visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.figure(figsize=(10, 5))
plt.subplot(121)
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=y,
palette='viridis')
plt.title('PCA Visualization')
# t-SNE visualization
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)
plt.subplot(122)
sns.scatterplot(x=X_tsne[:, 0], y=X_tsne[:, 1], hue=y,
palette='viridis')
plt.title('t-SNE Visualization')
plt.tight_layout()
plt.show()
Solutions:
Example:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
import shap
# Feature Importance
feature_importance = rf.feature_importances_
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
plt.figure(figsize=(10, 6))
plt.barh(pos, feature_importance[sorted_idx],
align='center')
plt.yticks(pos, np.array(feature_names)[sorted_idx])
plt.title('Feature Importance')
plt.tight_layout()
plt.show()
# Permutation Importance
perm_importance = permutation_importance(rf, X_test, y_test,
n_repeats=10, random_state=42)
sorted_idx = perm_importance.importances_mean.argsort()
plt.figure(figsize=(10, 6))
plt.boxplot(perm_importance.importances[sorted_idx].T,
vert=False, labels=np.array(feature_names)[sorted_idx])
plt.title("Permutation Importance")
plt.tight_layout()
plt.show()
# SHAP Values
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
plt.figure(figsize=(10, 6))
shap.summary_plot(shap_values, X_test, plot_type="bar")
plt.tight_layout()
plt.show()
plt.figure(figsize=(12, 6))
plot_partial_dependence(rf, X_train, features=[0, 1, 2],
feature_names=feature_names)
plt.tight_layout()
plt.show()
Solutions:
import time
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from joblib import Parallel, delayed
start_time = time.time()
trees = [RandomForestRegressor(n_estimators=1,
random_state=i) for i in range(100)]
fitted_trees = Parallel(n_jobs=-1)(delayed(train_tree)(tree,
X_train, y_train) for tree in trees)
custom_parallel_time = time.time() - start_time
print(f"Custom parallel implementation training time:
{custom_parallel_time:.2f} seconds")
print(f"Speedup: {standard_time /
custom_parallel_time:.2f}x")
Solutions:
Example:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import
HashingVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
if i % 100000 == 0:
print(f"Processed {i} samples")
Solutions:
# Prepare features
feature_cols = [f"feature_{i}" for i in range(20)]
assembler = VectorAssembler(inputCols=feature_cols,
outputCol="features")
data = assembler.transform(data)
# Make predictions
predictions = model.transform(test_data)
Solutions:
import joblib
from flask import Flask, request, jsonify
import numpy as np
app = Flask(__name__)
if __name__ == '__main__':
app.run(debug=True)
FROM python:3.8-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "app.py"]
Solutions:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib
import time
# Make predictions
y_pred = current_model.predict(X_new)
# Calculate accuracy
accuracy = accuracy_score(y_new, y_pred)
print(f"Current model accuracy: {accuracy:.4f}")
Solutions:
# Create DataFrame
df = pd.DataFrame({
'age': age,
'gender': gender,
'income': income,
'high_income': high_income
})
# Split data
X = df[['age', 'gender']]
y = df['high_income']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# Train a model
model = RandomForestClassifier(random_state=42)
model.fit(pd.get_dummies(X_train), y_train)
# Make predictions
y_pred = model.predict(pd.get_dummies(X_test))
metric = BinaryLabelDatasetMetric(dataset,
unprivileged_groups=[{protected_attribute: 'female'}],
privileged_groups=
[{protected_attribute: 'male'}])
print("\nBefore mitigation:")
check_bias(X_test, y_test, y_pred)
dataset_train = BinaryLabelDataset(df=pd.concat([X_train,
y_train], axis=1),
label_names=
['high_income'],
protected_attribute_names
=['gender'])
dataset_train_transformed = rw.fit_transform(dataset_train)
print("\nAfter mitigation:")
check_bias(X_test, y_test, y_pred_fair)
import pandas as pd
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer
data = pd.DataFrame({
'age': np.random.randint(18, 80, n_samples),
'income': np.random.normal(50000, 15000, n_samples),
'zipcode': np.random.randint(10000, 99999, n_samples),
'sensitive_attribute': np.random.choice(['A', 'B', 'C'],
n_samples)
})
print("Original data:")
print(data.head())
# Anonymization techniques
print("\nAnonymized data:")
print(anonymized_data.head())
# Check k-anonymity
k =
anonymized_data.groupby(list(anonymized_data.columns)).size(
).min()
print(f"\nk-anonymity: {k}")
Solutions:
1. Version Control: Use Git for version control of code and small
datasets.
2. Data Version Control: Implement tools like DVC (Data Version
Control) for managing large datasets and ML models.
3. Containerization: Use Docker to create reproducible
development environments.
4. Notebook Version Control: Use tools like Jupytext to version
control Jupyter notebooks effectively.
5. Collaborative Platforms: Utilize platforms like Databricks or
Google Colab for collaborative data science work.
# Initialize DVC
dvc init
This example demonstrates how to use DVC alongside Git for version
control of large datasets and models:
Conclusion
Troubleshooting in data science is an essential skill that requires a
combination of technical knowledge, problem-solving abilities, and
domain expertise. By understanding common challenges and their
solutions across various stages of the data science lifecycle,
practitioners can more effectively navigate the complexities of real-
world projects.