0% found this document useful (0 votes)
13 views

Data Science

Cufucuucuucuc

Uploaded by

developer adarsh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Data Science

Cufucuucuucuc

Uploaded by

developer adarsh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Data Science

https://yashnote.notion.site/Data-Science-1180e70e8a0f80bbbfa2fdee5d1f1d85?
pvs=4
Unit 1
Introduction to Data Science
Difference among AI, Machine Learning, and Data Science
Comparison of AI, ML, and Data Science:
Basic Introduction of Python
Key Features of Python:
Common Use Cases of Python:
Python for Data Science
1. Pandas
2. NumPy
3. Scikit-learn
4. Data Visualization
5. Advanced Python Concepts for Data Science
Introduction to Google Colab
Key Features of Google Colab:
Use Cases of Google Colab:
Popular Dataset Repositories
Discussion on Some Datasets:
Data Pre-processing
Python Example: Data Cleaning (Handling Missing Values)
Data Scales
Python Example: Encoding Ordinal Data
Similarity and Dissimilarity Measures
Python Example: Cosine and Euclidean Similarity
Sampling and Quantization of Data
Sampling:
Quantization:
Python Example: Random Sampling and Quantization
Filtering
Python Example: Moving Average and Median Filter
Data Transformation
Python Example: Data Normalization and Log Transformation

Data Science 1
Data Merging
Python Example: Merging DataFrames
Data Visualization
Python Example: Basic Data Visualization using matplotlib
Principal Component Analysis (PCA)
Python Example: PCA in Python
Correlation
Python Example: Calculating Correlation
Chi-Square Test
Python Example: Chi-Square Test
Summary
Unit 2
Regression Analysis
Linear Regression
Python Example: Simple Linear Regression
Generalized Linear Models (GLM)
Python Example: Logistic Regression
Regularized Regression
Python Example: Ridge and Lasso Regression
Summary of Key Concepts
Cross-Validation
Types of Cross-Validation:
Python Example: K-Fold Cross-Validation
Training and Testing Data Set
Python Example: Train-Test Split
Overview of Nonlinear Regression
Python Example: Nonlinear Regression (Polynomial Regression)
Overview of Ridge Regression
Advantages:
Python Example: Ridge Regression
Summary of Key Concepts
Latent Variables
Examples:
Structural Equation Modeling (SEM)
Key Components of SEM:
Python Libraries for SEM:
Python Example: Factor Analysis (Latent Variable)
Factor Analysis Example (Latent Variables Extraction)

Data Science 2
SEM Example Using semopy
Structural Equation Model Example:
Explanation:
Summary of Key Concepts

Unit 1
Introduction to Data Science
Data Science is a multidisciplinary field that combines statistics, computer
science, mathematics, and domain-specific knowledge to extract insights and
knowledge from structured and unstructured data. Data Science applies scientific
methods, processes, algorithms, and systems to analyze vast amounts of data
and generate actionable insights. In today's world, where data is generated in
massive volumes from various sources such as social media, business
transactions, IoT devices, etc., Data Science plays a critical role in making sense
of that data.
Key Aspects of Data Science:

1. Data Collection: Gathering data from various sources (web scraping, APIs,
surveys, sensors, etc.).

2. Data Cleaning: Data often contains noise, missing values, and


inconsistencies, which need to be addressed through data cleaning
techniques.

3. Data Exploration and Analysis: Exploratory Data Analysis (EDA) involves


visualizing and summarizing the key properties of the data.

4. Statistical Analysis: Using statistics and probability to interpret data patterns,


trends, and relationships.

5. Data Modeling: Applying algorithms and machine learning models to make


predictions or discover insights from the data.

6. Data Visualization: Presenting data in visual formats (graphs, charts, etc.) to


communicate findings to stakeholders.

Skills Required for Data Science:

Data Science 3
1. Mathematics and Statistics: Understanding of concepts like probability,
distributions, hypothesis testing, linear algebra, etc.

2. Programming: Expertise in programming languages like Python, R, and SQL


for data manipulation and analysis.

3. Machine Learning: Knowledge of machine learning algorithms like linear


regression, decision trees, clustering, etc.

4. Data Wrangling and Cleaning: Ability to preprocess data, handle missing data,
and deal with data inconsistencies.

5. Data Visualization: Familiarity with tools like Matplotlib, Seaborn, Tableau, or


Power BI to create meaningful visualizations.

Applications of Data Science:

Healthcare: Predictive modeling for disease outbreaks, personalized


medicine, medical image analysis.

Finance: Fraud detection, algorithmic trading, risk management.

Retail: Recommendation engines, inventory management, market analysis.

Entertainment: Recommendation systems in streaming services, content


analysis.

Transportation: Route optimization, self-driving cars, traffic prediction.

Data Science is essentially a combination of statistics, domain expertise, and


computer science to interpret large-scale data. It is vital in decision-making
processes in various sectors such as business, healthcare, finance, and
government. With advancements in big data technologies and AI, Data Science is
a field with immense growth potential.

Difference among AI, Machine Learning, and Data


Science
Artificial Intelligence (AI):
AI is a broader concept that refers to machines or systems that mimic human
intelligence to perform tasks. It involves creating systems that can perceive their
environment and take actions to achieve specific goals. AI encompasses various

Data Science 4
subfields like Natural Language Processing (NLP), computer vision, robotics, and
more.

Key Points:

Goal of AI: To simulate human intelligence in machines.

Techniques in AI: Search algorithms, expert systems, neural networks, etc.

Types of AI:

Narrow AI: AI systems designed for specific tasks (e.g., Siri, Alexa,
recommendation engines).

General AI: A theoretical concept where machines would possess the


ability to perform any cognitive task that a human can.

Super AI: A future concept where AI surpasses human intelligence.

Examples:

Self-driving cars (AI-driven vehicles).

Image recognition software (AI-based vision systems).

Machine Learning (ML):


Machine Learning is a subset of AI that involves the development of algorithms
that allow computers to learn patterns and make decisions without being explicitly
programmed. ML systems improve their performance over time by learning from
data.

Key Points:

Goal of ML: To enable machines to learn from data and improve with
experience.

Techniques in ML: Supervised learning, unsupervised learning, reinforcement


learning, etc.

Types of ML:

Supervised Learning: The algorithm is trained on labeled data (e.g.,


classification, regression).

Unsupervised Learning: The algorithm is used to find hidden patterns in


unlabeled data (e.g., clustering, association).

Data Science 5
Reinforcement Learning: The model learns through trial and error to
maximize rewards.

Examples:

Spam detection in emails (Supervised ML).

Customer segmentation (Unsupervised ML).

Data Science:
Data Science is a more comprehensive field that integrates AI, ML, and other tools
to work with data in various forms. It focuses on extracting insights and
knowledge from data using a mix of statistics, algorithms, and domain knowledge.
While AI and ML are tools used in Data Science, Data Science is concerned with
the entire data lifecycle from collection to insight generation.

Key Points:

Goal of Data Science: To extract actionable insights from large datasets using
a mix of techniques.

Techniques in Data Science: Data wrangling, data visualization, machine


learning, and statistical analysis.

Scope: Data Science includes AI, ML, and various other techniques like data
mining and business intelligence.

Examples:

Analyzing sales data to predict future trends (Data Science using ML


algorithms).

Building recommendation engines for e-commerce platforms (Data Science


using AI and ML).

Comparison of AI, ML, and Data Science:


Aspect Artificial Intelligence Machine Learning Data Science

Field of creating Subfield of AI focused Broad field focusing on


Definition
intelligent machines on learning data insights

Scope Very broad, includes Narrower, focused on Comprehensive,


ML and more learning from data includes ML, AI, and

Data Science 6
more

Simulate human Learn patterns from Extract insights from


Objective
intelligence data data

Neural networks, NLP, Supervised, Data wrangling,


Techniques
etc. unsupervised learning visualization, ML

Robotics, game Spam filters,


Market analysis, fraud
Applications playing, virtual recommendation
detection
assistants systems

In conclusion, AI is the overarching concept that aims to create intelligent


systems, Machine Learning is a subset of AI that focuses on algorithms capable of
learning from data, and Data Science is a broader discipline that leverages AI and
ML, along with statistics and other techniques, to extract insights from data.

Basic Introduction of Python


Python is a high-level, interpreted, general-purpose programming language. It
was created by Guido van Rossum and first released in 1991. Python emphasizes
code readability and simplicity, making it an ideal choice for beginners and
professionals alike. Its extensive libraries and frameworks make it highly versatile,
used across various domains, including web development, data analysis, artificial
intelligence, and scientific computing.

Key Features of Python:


1. Easy Syntax: Python has a clear and straightforward syntax that resembles
plain English, making it easier to learn and write code.

2. Interpreted Language: Python code is executed line by line, which allows for
interactive debugging.

3. Dynamically Typed: Variables in Python don’t need an explicit declaration, as


the type is inferred during execution.

4. Cross-Platform: Python is available on multiple platforms like Windows, Linux,


and macOS.

5. Extensive Libraries: Python has a vast collection of standard libraries and


external packages for different tasks (e.g., NumPy for numerical

Data Science 7
computations, Pandas for data manipulation, Matplotlib for data visualization,
etc.).

6. Object-Oriented and Functional: Python supports both object-oriented


programming (OOP) and functional programming paradigms.

7. Community Support: Python has a large, active community that continually


contributes to its development and provides support via forums and
documentation.

Common Use Cases of Python:


Web Development: Using frameworks like Django, Flask.

Data Science and Machine Learning: With libraries like Pandas, NumPy,
Scikit-learn, TensorFlow.

Automation/Scripting: Writing scripts for automating tasks like file


management, data scraping, etc.

Scientific Computing: Python is widely used in academic and research


settings for simulations, data analysis, and scientific computation.

Python for Data Science


1. Pandas
Pandas is a powerful library for data manipulation and analysis in Python.

1.1 Advanced DataFrame Operations

Grouping and Aggregation

import pandas as pd

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo',


'bar'],
'B': [1, 2, 3, 4, 5, 6],
'C': [2.0, 5., 8., 1., 2., 9.]})

Data Science 8
grouped = df.groupby('A').agg({'B': 'sum', 'C': 'mean'})

Pivot Tables

pivoted = df.pivot_table(values='B', index='A', columns='C',


aggfunc='sum')

Merging and Joining

df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],


'A': ['A0', 'A1', 'A2', 'A3']})
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'B': ['B0', 'B1', 'B2', 'B3']})
merged = pd.merge(df1, df2, on='key')

1.2 Time Series Analysis

dates = pd.date_range('20230101', periods=6)


ts = pd.Series(range(6), index=dates)
resampled = ts.resample('2D').sum()

1.3 Handling Missing Data

df = pd.DataFrame({'A': [1, 2, np.nan, 4],


'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]})
filled = df.fillna(method='ffill')
interpolated = df.interpolate()

2. NumPy
NumPy is fundamental for numerical computing in Python.

2.1 Advanced Array Operations

Data Science 9
import numpy as np

# Broadcasting
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([10, 20, 30])
result = a + b # b is broadcast to match a's shape

# Fancy indexing
x = np.arange(10)
indices = [2, 5, 8]
selected = x[indices]

2.2 Vectorization

def sigmoid(x):
return 1 / (1 + np.exp(-x))

x = np.linspace(-10, 10, 100)


y = sigmoid(x) # Vectorized operation

3. Scikit-learn
Scikit-learn is a machine learning library for Python.

3.1 Pipeline Creation

from sklearn.pipeline import Pipeline


from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipe = Pipeline([
('scaler', StandardScaler()),
('svm', SVC())
])

Data Science 10
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)

3.2 Cross-Validation

from sklearn.model_selection import cross_val_score


from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
scores = cross_val_score(rf, X, y, cv=5)

4. Data Visualization
4.1 Matplotlib

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(x, y, 'r-', label='Data')
plt.title('Sample Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
plt.show()

4.2 Seaborn

import seaborn as sns

sns.set_style("whitegrid")
tips = sns.load_dataset("tips")
sns.scatterplot(x="total_bill", y="tip", hue="time", data=tip
s)
plt.show()

Data Science 11
5. Advanced Python Concepts for Data Science
5.1 List Comprehensions and Generator Expressions

# List comprehension
squares = [x**2 for x in range(10)]

# Generator expression
sum_of_squares = sum(x**2 for x in range(1000000))

5.2 Lambda Functions

df['new_column'] = df['old_column'].apply(lambda x: x*2 if x


> 0 else x)

5.3 Map, Filter, and Reduce

from functools import reduce

numbers = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x**2, numbers))
evens = list(filter(lambda x: x % 2 == 0, numbers))
product = reduce(lambda x, y: x * y, numbers)

These concepts and libraries form the core of Python's data science ecosystem,
providing powerful tools for data manipulation, analysis, and visualization.

Introduction to Google Colab


Google Colab (Colaboratory) is a free, cloud-based Jupyter notebook
environment that allows users to write and execute Python code in their browsers.
Colab is particularly useful for data science and machine learning projects due to
its ability to leverage powerful hardware like GPUs (Graphics Processing Units)
and TPUs (Tensor Processing Units) for computation.

Key Features of Google Colab:

Data Science 12
1. Cloud-Based: No installation is required. Notebooks are stored in Google
Drive, and you can access them from anywhere.

2. Free GPU/TPU Access: Colab provides free access to GPUs and TPUs, which
are vital for high-performance tasks like deep learning.

3. Pre-installed Libraries: Colab comes with many popular libraries like


TensorFlow, PyTorch, Pandas, NumPy, and Scikit-learn already installed.

4. Jupyter Notebook Interface: Colab uses the familiar Jupyter Notebook


interface, allowing you to write, visualize, and execute Python code
interactively.

5. Integration with Google Drive: You can save and load datasets and notebooks
directly to and from Google Drive.

6. Collaboration: Similar to Google Docs, Colab supports real-time collaboration,


enabling multiple users to work on the same notebook simultaneously.

7. Markdown and LaTeX Support: Colab allows for the inclusion of Markdown
and LaTeX (for writing mathematical equations) alongside code.

Use Cases of Google Colab:


Data Science and Machine Learning: Due to its GPU and TPU support, Colab
is commonly used for training machine learning models.

Collaborative Research: Colab’s real-time collaboration feature makes it


suitable for teamwork and academic projects.

Educational Purposes: It's widely used by students and educators for learning
Python and machine learning without the need for local installation.

Prototyping and Experimentation: Researchers and developers use Colab to


quickly prototype and test machine learning models.

Popular Dataset Repositories


Datasets are crucial for training, testing, and evaluating models in machine
learning and data science projects. Numerous repositories provide free access to
diverse datasets across various domains, such as healthcare, finance, image
recognition, and more. Here are some popular dataset repositories:

Data Science 13
1. Kaggle Datasets:

Website: https://www.kaggle.com/datasets

Kaggle is one of the largest platforms for data science competitions and
also hosts a wide range of datasets. Users can search for datasets by
category, size, or application domain.

Popular Datasets:

Titanic Survival Dataset: A well-known dataset for learning data


analysis and machine learning, focused on predicting the survival of
passengers on the Titanic.

MNIST Dataset: A large dataset of handwritten digits commonly used


for image classification.

COVID-19 Dataset: Datasets on COVID-19 cases and trends across


countries, regions, and demographics.

2. UCI Machine Learning Repository:

Website: https://archive.ics.uci.edu/ml/index.php

The UCI Machine Learning Repository is a popular destination for publicly


available datasets, widely used in machine learning research and
education.

Popular Datasets:

Iris Dataset: A classic dataset in machine learning, used for


classification problems involving flower species.

Wine Quality Dataset: Contains features related to wine composition


and helps predict wine quality.

Adult Dataset: Used for income classification based on demographic


attributes.

3. Google Dataset Search:

Website: https://datasetsearch.research.google.com/

Google’s Dataset Search allows users to find datasets across the web on
different platforms. It indexes datasets from a variety of sources such as

Data Science 14
academic journals, governmental agencies, and open data platforms.

4. Data.gov:

Website: https://www.data.gov/

Data.gov is a U.S. government website that provides access to open


datasets across various sectors such as agriculture, education, health, and
public safety.

Popular Datasets:

US Census Data: Comprehensive demographic data about the U.S.


population.

Crime Data: Data related to crimes across various U.S. cities and
states.

Environmental Data: Contains data on climate change, water quality,


and air pollution.

5. AWS Open Data Registry:

Website: https://registry.opendata.aws/

Amazon Web Services (AWS) hosts numerous open datasets for public
use, including datasets for satellite imagery, genomics, and machine
learning models.

Popular Datasets:

Amazon Reviews: A collection of product reviews from Amazon,


useful for NLP tasks.

NOAA Weather Data: Weather-related datasets that include historical


data and real-time monitoring.

SpaceNet Dataset: Satellite imagery datasets used for training models


in geospatial analysis and computer vision.

Discussion on Some Datasets:


1. Titanic Dataset:

Data Science 15
Description: The Titanic dataset contains information on passengers who
were aboard the Titanic when it sank. It includes features such as age,
sex, class, fare, and whether they survived.

Use Case: It is commonly used to teach binary classification algorithms.


The goal is to predict whether a passenger survived or not based on the
given features.

Analysis: With techniques like logistic regression or decision trees, you


can predict passenger survival probability, visualizing patterns in
demographics and survival.

2. MNIST Dataset:

Description: MNIST is a collection of 70,000 images of handwritten digits,


where each image is labeled with the corresponding digit.

Use Case: It is a benchmark dataset for image classification algorithms,


particularly in deep learning. It is widely used to test convolutional neural
networks (CNNs).

Analysis: The dataset is preprocessed and allows researchers to focus on


experimenting with different machine learning models. CNNs usually
achieve high accuracy rates on this dataset.

3. Iris Dataset:

Description: The Iris dataset includes features such as petal length, petal
width, sepal length, and sepal width for three species of Iris flowers.

Use Case: It is widely used for supervised learning tasks like


classification. The goal is to predict the species of the Iris flower based on
its features.

Analysis: With this dataset, techniques like k-Nearest Neighbors (k-NN) or


Support Vector Machines (SVM) can be applied to classify the flower
species.

4. Wine Quality Dataset:

Description: The dataset contains chemical features of different wine


samples, such as acidity, sugar content, and pH, along with their quality
score (ranging from 0 to 10).

Data Science 16
Use Case: It is used for regression problems, where the goal is to predict
the wine quality based on its features.

Analysis: Various regression techniques such as linear regression or


decision trees can be applied to predict wine quality and study how
different factors contribute to the overall quality.

In summary, Python and Google Colab are essential tools for data scientists,
offering powerful features for data analysis, machine learning, and scientific
computing. Popular dataset repositories like Kaggle, UCI, and Data.gov provide
valuable datasets that are commonly used for academic, research, and
commercial purposes. Understanding and analyzing these datasets is a critical
skill in data science.

Data Pre-processing
Data pre-processing is a critical step in the data analysis and machine learning
pipeline. It involves preparing raw data to make it suitable for further analysis or
model training. The quality of the data can significantly influence the performance
of machine learning models. Data pre-processing helps in handling missing
values, removing noise, scaling, transforming, and integrating data from multiple
sources.
Key steps in data pre-processing include:

1. Data Cleaning: Handling missing data, noise, and inconsistencies.

2. Data Integration: Combining data from multiple sources into a unified dataset.

3. Data Transformation: Scaling, normalizing, and converting data types to


ensure uniformity.

4. Data Reduction: Reducing the volume of data to make analysis more efficient
without losing important information.

5. Data Discretization: Converting continuous data into discrete intervals for


certain algorithms that require categorical data.

Example: If you have a dataset with missing values, you can fill them using the
mean, median, or mode of the available data (imputation). Alternatively, rows with
missing values can be removed if they are not critical.

Data Science 17
Python Example: Data Cleaning (Handling Missing Values)

import pandas as pd
import numpy as np

# Sample dataset
data = {'Age': [25, 30, np.nan, 22, np.nan], 'Salary': [5000
0, 54000, np.nan, 42000, 60000]}
df = pd.DataFrame(data)

# Fill missing values with the mean


df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
print(df)

Data Scales
Data can exist on different scales, which determine the type of statistical analysis
and machine learning techniques applicable to it. Understanding data scales is
vital for selecting the right methods for data processing.

1. Nominal Scale:

This is a categorical scale where values represent categories without any


order or ranking.

Example: Gender (Male, Female), colors (Red, Blue, Green).

Operations: Count, Mode.

2. Ordinal Scale:

This scale represents categories that have a meaningful order but no


precise difference between values.

Example: Ratings (Excellent, Good, Fair, Poor), ranking in a race (1st, 2nd,
3rd).

Operations: Median, Percentile.

Data Science 18
3. Interval Scale:

In this scale, the intervals between values are meaningful, but there is no
true zero point. Differences are consistent.

Example: Temperature in Celsius or Fahrenheit, dates on a calendar.

Operations: Addition, Subtraction, Mean, Standard Deviation.

4. Ratio Scale:

This scale has all the characteristics of the interval scale, with a true zero
point that indicates the absence of the quantity being measured.

Example: Height, weight, age, income.

Operations: Multiplication, Division.

Python Example: Encoding Ordinal Data

from sklearn.preprocessing import OrdinalEncoder

# Example of ordinal data: education levels


education_levels = [['High School'], ['Bachelor'], ['Maste
r'], ['PhD']]

# Ordinal encoding
encoder = OrdinalEncoder(categories=[['High School', 'Bachelo
r', 'Master', 'PhD']])
encoded_education = encoder.fit_transform(education_levels)
print(encoded_education)

Similarity and Dissimilarity Measures


Similarity and dissimilarity measures are used to quantify how similar or different
two data points (or sets of data) are. These measures are critical for tasks such as
clustering, classification, and recommendation systems.

Data Science 19
Python Example: Cosine and Euclidean Similarity

from sklearn.metrics.pairwise import cosine_similarity


from scipy.spatial.distance import euclidean

# Example vectors
vector_a = [1, 0, -1]
vector_b = [0, 1, 0]

# Cosine similarity

Data Science 20
cos_sim = cosine_similarity([vector_a], [vector_b])
print("Cosine Similarity:", cos_sim)

# Euclidean distance
euc_dist = euclidean(vector_a, vector_b)
print("Euclidean Distance:", euc_dist)

Sampling and Quantization of Data


Sampling:
Sampling refers to the process of selecting a subset of data from a larger dataset.
It’s particularly important when working with large datasets, as it allows for faster
computation and analysis.

1. Random Sampling: Each data point has an equal probability of being selected.

2. Stratified Sampling: The population is divided into homogeneous subgroups


(strata), and samples are taken from each subgroup proportionally.

3. Systematic Sampling: Data points are selected at regular intervals from the
dataset.

Quantization:
Quantization involves converting continuous data into discrete values or levels.

1. Scalar Quantization: Converts continuous variables into discrete values by


mapping them to quantization intervals.

Python Example: Random Sampling and Quantization

import numpy as np

# Random sampling
data = np.arange(1, 101)
sample = np.random.choice(data, size=10, replace=False)
print("Random Sample:", sample)

Data Science 21
# Quantization (Bin data into 5 levels)
quantized_data = np.digitize(data, bins=[20, 40, 60, 80])
print("Quantized Data:", quantized_data)

Filtering
Filtering is a technique used to remove or reduce noise from a dataset. It is an
essential step in data pre-processing, especially in signal processing and time-
series data. The goal is to smooth the data or remove outliers that can skew the
results of your analysis.

1. Moving Average Filter: Averages the data points over a sliding window,
helping to smooth out short-term fluctuations.

2. Median Filter: Replaces each data point with the median of neighboring
points, often used for outlier removal.

Python Example: Moving Average and Median Filter

import numpy as np
import pandas as pd
from scipy.ndimage import median_filter

# Sample time-series data


data = pd.Series([10, 12, 11, 13, 20, 15, 14, 13, 15, 18, 19,
25])

# Moving average filter (window size = 3)


moving_avg = data.rolling(window=3).mean()
print("Moving Average Filter:\\n", moving_avg)

# Median filter
median_filt = pd.Series(median_filter(data, size=3))
print("Median Filter:\\n", median_filt)

Data Science 22
Data Transformation
Data transformation is the process of converting data into a format suitable for
analysis. This can involve scaling, normalizing, encoding categorical data, or
transforming features to reduce skewness.

1. Normalization: Rescaling data to a range of [0, 1].

2. Standardization: Scaling data so that it has a mean of 0 and a standard


deviation of 1.

3. Logarithmic Transformation: Useful for handling skewed data by applying a


logarithmic function.

Python Example: Data Normalization and Log Transformation

from sklearn.preprocessing import MinMaxScaler, StandardScale


r
import numpy as np

# Sample data
data = np.array([[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]])

# Normalization (Min-Max scaling)


scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print("Normalized Data:\\n", normalized_data)

# Log transformation
log_transformed = np.log(data + 1)
print("Log Transformed Data:\\n", log_transformed)

Data Merging
Data merging involves combining two or more datasets into a single dataset based
on a common attribute or key. Common merging operations include:

1. Concatenation: Appending datasets along rows or columns.

Data Science 23
2. Joining: Merging datasets based on a key (like SQL joins: inner, left, right, and
outer).

Python Example: Merging DataFrames

import pandas as pd

# Sample data
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob',
'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Score': [85, 90, 75]})

# Merge (Inner Join)


merged_df = pd.merge(df1, df2, on='ID', how='inner')
print("Merged Data (Inner Join):\\n", merged_df)

Data Visualization
Data visualization is a key aspect of data analysis as it helps to understand
patterns, trends, and relationships in the data. Common visualization techniques
include:

1. Line Plot: Useful for time-series data.

2. Bar Plot: Displays categorical data.

3. Histogram: Shows the distribution of continuous data.

4. Scatter Plot: Shows relationships between two variables.

Python Example: Basic Data Visualization using matplotlib

import matplotlib.pyplot as plt


import seaborn as sns

# Sample data
data = pd.DataFrame({
'Height': [150, 160, 170, 180, 190],

Data Science 24
'Weight': [50, 60, 70, 80, 90]
})

# Scatter plot for Height vs. Weight


plt.scatter(data['Height'], data['Weight'])
plt.title('Height vs Weight')
plt.xlabel('Height')
plt.ylabel('Weight')
plt.show()

# Histogram for Weight distribution


plt.hist(data['Weight'], bins=5)
plt.title('Weight Distribution')
plt.xlabel('Weight')
plt.ylabel('Frequency')
plt.show()

Principal Component Analysis (PCA)


PCA is a dimensionality reduction technique used to reduce the number of
variables in a dataset while retaining most of the variation in the data. PCA
transforms the data into a new set of orthogonal components that capture the
variance of the data.
Steps in PCA:

1. Standardize the data.

2. Compute the covariance matrix.

3. Compute the eigenvectors and eigenvalues of the covariance matrix.

4. Project the data onto the principal components.

Python Example: PCA in Python

from sklearn.decomposition import PCA


from sklearn.preprocessing import StandardScaler

Data Science 25
import numpy as np

# Sample data
data = np.array([[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]])

# Standardizing the data


scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Applying PCA
pca = PCA(n_components=1) # Reducing to 1 principal componen
t
data_pca = pca.fit_transform(data_scaled)
print("PCA Transformed Data:\\n", data_pca)

Correlation
Correlation measures the strength and direction of a linear relationship between
two variables. It ranges from -1 to 1:

1: Perfect positive correlation

0: No correlation

1: Perfect negative correlation

Common correlation coefficients:

1. Pearson Correlation: Measures linear correlation between continuous


variables.

2. Spearman Correlation: Measures monotonic relationships (used for ordinal


data).

Python Example: Calculating Correlation

import pandas as pd

# Sample data

Data Science 26
data = pd.DataFrame({
'X': [1, 2, 3, 4, 5],
'Y': [2, 4, 6, 8, 10]
})

# Pearson correlation
correlation = data.corr(method='pearson')
print("Pearson Correlation:\\n", correlation)

Chi-Square Test
The Chi-Square test is used to determine if there is a significant association
between two categorical variables. It compares the observed frequencies with the
expected frequencies to test for independence.

Python Example: Chi-Square Test

import pandas as pd
from scipy.stats import chi2_contingency

# Contingency table for two categorical variables


data = pd.DataFrame({
'Gender': ['Male', 'Female', 'Female', 'Male', 'Male'],
'Purchased': ['Yes', 'No', 'Yes', 'Yes', 'No']

Data Science 27
})

# Create contingency table


contingency_table = pd.crosstab(data['Gender'], data['Purchas
ed'])

# Perform Chi-Square test


chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-Square Statistic: {chi2}, p-value: {p}")

Summary
Filtering: Smooths and cleans data using techniques like moving average and
median filters.

Data Transformation: Rescales, normalizes, or logs data to prepare it for


analysis.

Data Merging: Combines datasets using joins or concatenation.

Data Visualization: Visualizes data trends using plots like scatter plots,
histograms, and bar charts.

PCA: Reduces dimensionality by projecting data onto principal components.

Correlation: Measures the linear relationship between variables.

Chi-Square Test: Tests the association between two categorical variables.

All these concepts are critical to understanding how to process, analyze, and draw
insights from data, and Python provides powerful libraries like pandas , numpy , and
matplotlib to handle these tasks.

Unit 2
Regression Analysis
Regression analysis is a statistical technique used to model and analyze the
relationship between a dependent variable (target) and one or more independent

Data Science 28
variables (features). The goal of regression is to predict or explain the dependent
variable based on the given independent variables.
Types of regression analysis:

1. Linear Regression: Models a linear relationship between the dependent and


independent variables.

2. Generalized Linear Models (GLM): Extends linear regression to model non-


normal data (e.g., logistic regression for binary outcomes).

3. Regularized Regression: Enhances linear regression by adding penalty terms


to control overfitting, such as Ridge and Lasso regression.

Linear Regression

The objective of linear regression is to minimize the sum of squared residuals


between the actual and predicted values of y.

Python Example: Simple Linear Regression

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

Data Science 29
# Sample data (simple linear relationship)
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, tes
t_size=0.2, random_state=42)

# Linear regression model


model = LinearRegression()
model.fit(X_train, y_train)

# Predicting on test data


y_pred = model.predict(X_test)

# Plotting the regression line


plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X), color='red')
plt.title('Linear Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.show()

# Model parameters
print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)

Generalized Linear Models (GLM)


Generalized Linear Models (GLMs) extend linear regression to handle non-normal
response distributions. In GLMs, the relationship between the independent
variables \( X \) and the dependent variable \( y \) is modeled through a link
function, which connects the linear predictor to the mean of the distribution.
Common types of GLMs:

Data Science 30
2. Poisson Regression: For count data, using the log link function.

Python Example: Logistic Regression

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data for binary classification


X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 1, 1, 1]) # Binary outcomes

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, tes
t_size=0.2, random_state=42)

# Logistic regression model


log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predicting on test data


y_pred = log_reg.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Data Science 31
# Probability of class 1
print("Predicted probabilities:\\n", log_reg.predict_proba(X_
test))

Regularized Regression
Regularized regression methods help prevent overfitting by adding a penalty term
to the loss function in the linear regression model. The most common forms of
regularized regression are:

3. Elastic Net Regression:

Elastic Net is a combination of L1 (Lasso) and L2 (Ridge) penalties.

Python Example: Ridge and Lasso Regression

Data Science 32
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split

# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, tes
t_size=0.2, random_state=42)

# Ridge Regression (L2 regularization)


ridge_reg = Ridge(alpha=1.0) # Alpha is the regularization s
trength
ridge_reg.fit(X_train, y_train)
y_pred_ridge = ridge_reg.predict(X_test)
print("Ridge Regression Predictions:", y_pred_ridge)

# Lasso Regression (L1 regularization)


lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X_train, y_train)
y_pred_lasso = lasso_reg.predict(X_test)
print("Lasso Regression Predictions:", y_pred_lasso)

Summary of Key Concepts


1. Linear Regression:

Assumes a linear relationship between the dependent and independent


variables.

Useful for predicting continuous outcomes.

2. Generalized Linear Models (GLMs):

Extends linear regression to non-normal data distributions.

Data Science 33
Common types include logistic regression (for binary classification) and
Poisson regression (for count data).

3. Regularized Regression:

Helps prevent overfitting by adding penalty terms to the loss function.

Ridge (L2) adds squared coefficients as a penalty.

Lasso (L1) adds absolute values of coefficients as a penalty, promoting


sparsity.

Elastic Net combines both L1 and L2 regularization.

These techniques are fundamental in machine learning and statistical modeling for
solving various prediction and classification problems.

Cross-Validation
Cross-validation is a model evaluation technique that helps assess how well a
machine learning model will generalize to unseen data. Instead of splitting the
dataset into just training and testing sets, cross-validation divides the data into
multiple subsets (folds) and trains the model multiple times, each time using a
different subset for validation and the rest for training.

Types of Cross-Validation:
1. K-Fold Cross-Validation: The data is split into k equal-sized subsets (folds).
The model is trained k times, each time using k-1 folds for training and the
remaining fold for validation. The final result is the average of the results from
the k iterations.

2. Stratified K-Fold: Similar to K-Fold, but ensures each fold has a representative
proportion of classes for classification tasks.

3. Leave-One-Out Cross-Validation (LOOCV): A special case of K-Fold where k


is equal to the number of samples, i.e., each sample gets used as a validation
set once.

Python Example: K-Fold Cross-Validation

Data Science 34
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np

# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])

# KFold Cross-Validation
kf = KFold(n_splits=3)
model = LinearRegression()

# Cross-validation scores (R-squared)


scores = cross_val_score(model, X, y, cv=kf)
print("Cross-validation scores:", scores)
print("Average R-squared score:", np.mean(scores))

Training and Testing Data Set


In machine learning, it is crucial to evaluate model performance on data that was
not used during the training phase. The dataset is typically divided into two parts:

1. Training Set: Used to train the machine learning model. The model learns the
relationships between the input features and the target variable.

2. Testing Set: Used to evaluate the model's performance on unseen data. The
testing set is used to assess how well the model generalizes to new, unseen
examples.

Splitting the dataset is typically done in a ratio, such as 70% for training and 30%
for testing. In cases where the dataset is large, an additional validation set may
also be used for hyperparameter tuning.

Python Example: Train-Test Split

Data Science 35
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])

# Split data into 70% training and 30% testing


X_train, X_test, y_train, y_test = train_test_split(X, y, tes
t_size=0.3, random_state=42)

# Linear regression model


model = LinearRegression()
model.fit(X_train, y_train)

# Prediction and evaluation on the test set


y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error on Test Set:", mse)

Overview of Nonlinear Regression


Nonlinear regression is used when the relationship between the dependent
variable and one or more independent variables is not linear. Unlike linear
regression, nonlinear regression fits a nonlinear function (e.g., polynomial,
exponential, logarithmic) to the data.

Data Science 36
Python Example: Nonlinear Regression (Polynomial Regression)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# Sample nonlinear data


X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 4, 9, 16, 25]) # Quadratic relationship

# Polynomial transformation (degree = 2)


poly_model = make_pipeline(PolynomialFeatures(degree=2), Line
arRegression())
poly_model.fit(X, y)

# Predict on new data


y_pred = poly_model.predict(X)

# Plotting the results


plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
plt.title('Nonlinear Regression (Polynomial)')
plt.xlabel('X')
plt.ylabel('y')
plt.show()

Data Science 37
print("Predicted values:\\n", y_pred)

Overview of Ridge Regression


Ridge regression is a type of regularized linear regression that adds an L2 penalty
term to the cost function. This penalty term helps to shrink the coefficients and
prevents overfitting by discouraging the model from fitting the training data too
closely.

Advantages:
Reduces model complexity and prevents overfitting.

Can handle multicollinearity (when independent variables are highly


correlated).

Python Example: Ridge Regression

from sklearn.linear_model import Ridge


from sklearn.metrics import mean_squared_error

# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])

Data Science 38
# Ridge Regression (alpha = regularization strength)
ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X, y)

# Predict on the same data


y_pred = ridge_reg.predict(X)
mse_ridge = mean_squared_error(y, y_pred)

# Results
print("Ridge Regression Predictions:", y_pred)
print("Mean Squared Error:", mse_ridge)

Summary of Key Concepts


1. Cross-Validation:

Helps assess the model's performance by splitting the dataset into


multiple subsets.

K-Fold Cross-Validation is a popular method where the data is divided into


k folds and the model is trained k times.

2. Training and Testing Data Set:

Data is typically split into training and testing sets.

The training set is used to train the model, and the test set is used to
evaluate performance on unseen data.

3. Nonlinear Regression:

Used when the relationship between the dependent and independent


variables is not linear.

Polynomial regression is a common example of nonlinear regression.

4. Ridge Regression:

A type of regularized linear regression that adds an L2 penalty term.

Helps reduce overfitting by shrinking the coefficients.

Data Science 39
By understanding and implementing these regression techniques, you can better
model complex data relationships and create more robust predictive models.

Latent Variables
Latent variables are variables that are not directly observed but are inferred or
estimated from other observed variables. They are commonly used in fields such
as psychology, social sciences, and econometrics to represent abstract concepts
like intelligence, socioeconomic status, or customer satisfaction, which are not
directly measurable.

Examples:
Customer Satisfaction: Latent variables might include satisfaction or loyalty,
which are inferred from responses to survey questions.

Intelligence: Inferred from various measurable cognitive tests, but intelligence


itself is a latent variable.

Latent variables are often modeled using factor analysis or structural equation
modeling (SEM).

Structural Equation Modeling (SEM)


Structural Equation Modeling (SEM) is a statistical technique that combines
elements of factor analysis and multiple regression to examine complex
relationships between observed and latent variables. SEM allows researchers to
model relationships between:

1. Observed Variables: Measured directly (e.g., responses to a questionnaire).

2. Latent Variables: Inferred from observed variables (e.g., abstract traits like
"satisfaction").

3. Structural Relations: The cause-and-effect relationships between variables.

Key Components of SEM:


1. Measurement Model: Specifies how latent variables are measured by the
observed variables (similar to factor analysis).

Data Science 40
2. Structural Model: Specifies the relationships between latent variables (similar
to regression).

SEM is represented visually using path diagrams, where:

Squares represent observed variables.

Circles represent latent variables.

Arrows represent the relationships between variables (unidirectional arrows


for causal effects and bidirectional for correlations).

Python Libraries for SEM:


1. semopy : A Python library used to build and estimate SEM models.

2. statsmodels : For factor analysis.

Python Example: Factor Analysis (Latent Variable)


Factor analysis can be used to extract latent variables from a dataset of observed
variables. Below is an example of how to perform factor analysis to extract latent
factors using factor_analyzer library.

Factor Analysis Example (Latent Variables Extraction)

import pandas as pd
from factor_analyzer import FactorAnalyzer

# Example dataset (observed variables)


data = {
'Q1': [4, 5, 6, 7, 8],
'Q2': [2, 4, 5, 6, 7],
'Q3': [3, 5, 6, 7, 8],
'Q4': [1, 3, 4, 6, 7]
}
df = pd.DataFrame(data)

# Perform factor analysis to extract latent variables


fa = FactorAnalyzer(n_factors=1, rotation=None)

Data Science 41
fa.fit(df)

# Get the factor loadings (how much each observed variable co


ntributes to the latent variable)
factor_loadings = fa.loadings_
print("Factor Loadings:\\n", factor_loadings)

# Get the estimated latent variable scores for each observati


on
latent_variable_scores = fa.transform(df)
print("Latent Variable Scores:\\n", latent_variable_scores)

In this example, we assume that the observed variables (e.g., survey questions Q1
to Q4) are used to estimate a single latent factor.

SEM Example Using semopy


To perform SEM in Python, we can use the semopy library, which provides tools for
estimating structural equation models.

Structural Equation Model Example:


In this example, we'll specify an SEM model where latent variable "Satisfaction" is
inferred from observed variables (survey questions), and it impacts another latent
variable "Loyalty".

# Install semopy library first if not installed:


# pip install semopy

import pandas as pd
from semopy import Model, Optimizer

# Example dataset (observed variables)


data = {
'Q1': [4, 5, 6, 7, 8],
'Q2': [2, 4, 5, 6, 7],
'Q3': [3, 5, 6, 7, 8],

Data Science 42
'L1': [3, 4, 5, 6, 7],
'L2': [4, 5, 6, 6, 8],
'L3': [5, 6, 7, 7, 9]
}
df = pd.DataFrame(data)

# Define the SEM model


model_desc = """
# Latent variables
Satisfaction =~ Q1 + Q2 + Q3
Loyalty =~ L1 + L2 + L3

# Structural paths
Loyalty ~ Satisfaction
"""

# Build and optimize the SEM model


model = Model(model_desc)
opt = Optimizer(model)
opt.fit(df)

# Print model parameters (factor loadings, path coefficients)


print(model.inspect())

Explanation:
Satisfaction =~ Q1 + Q2 + Q3 : This line specifies that the latent variable
"Satisfaction" is inferred from the observed variables Q1, Q2, and Q3.

Loyalty =~ L1 + L2 + L3 : Similarly, the latent variable "Loyalty" is inferred from


L1, L2, and L3.

: This defines a structural path where "Loyalty" is


Loyalty ~ Satisfaction

influenced by "Satisfaction".

Summary of Key Concepts

Data Science 43
1. Latent Variables: These are abstract variables that are not directly observed
but are inferred from other measured variables. Latent variables are commonly
used to represent unobservable constructs like intelligence, satisfaction, or
economic status.

2. Structural Equation Modeling (SEM): A powerful statistical method for


examining relationships between observed and latent variables. SEM
combines factor analysis and regression to test complex relationships
between variables in a single model.

3. Python Code Examples:

Factor analysis can be used to extract latent variables.

semopy is a Python library that facilitates building and optimizing structural


equation models.

By using SEM and latent variables, we can model complex relationships in


datasets that involve unobservable concepts, leading to better understanding and
prediction in fields such as social sciences, marketing, and psychology.

Data Science 44

You might also like