Data Science
Data Science
https://yashnote.notion.site/Data-Science-1180e70e8a0f80bbbfa2fdee5d1f1d85?
pvs=4
Unit 1
Introduction to Data Science
Difference among AI, Machine Learning, and Data Science
Comparison of AI, ML, and Data Science:
Basic Introduction of Python
Key Features of Python:
Common Use Cases of Python:
Python for Data Science
1. Pandas
2. NumPy
3. Scikit-learn
4. Data Visualization
5. Advanced Python Concepts for Data Science
Introduction to Google Colab
Key Features of Google Colab:
Use Cases of Google Colab:
Popular Dataset Repositories
Discussion on Some Datasets:
Data Pre-processing
Python Example: Data Cleaning (Handling Missing Values)
Data Scales
Python Example: Encoding Ordinal Data
Similarity and Dissimilarity Measures
Python Example: Cosine and Euclidean Similarity
Sampling and Quantization of Data
Sampling:
Quantization:
Python Example: Random Sampling and Quantization
Filtering
Python Example: Moving Average and Median Filter
Data Transformation
Python Example: Data Normalization and Log Transformation
Data Science 1
Data Merging
Python Example: Merging DataFrames
Data Visualization
Python Example: Basic Data Visualization using matplotlib
Principal Component Analysis (PCA)
Python Example: PCA in Python
Correlation
Python Example: Calculating Correlation
Chi-Square Test
Python Example: Chi-Square Test
Summary
Unit 2
Regression Analysis
Linear Regression
Python Example: Simple Linear Regression
Generalized Linear Models (GLM)
Python Example: Logistic Regression
Regularized Regression
Python Example: Ridge and Lasso Regression
Summary of Key Concepts
Cross-Validation
Types of Cross-Validation:
Python Example: K-Fold Cross-Validation
Training and Testing Data Set
Python Example: Train-Test Split
Overview of Nonlinear Regression
Python Example: Nonlinear Regression (Polynomial Regression)
Overview of Ridge Regression
Advantages:
Python Example: Ridge Regression
Summary of Key Concepts
Latent Variables
Examples:
Structural Equation Modeling (SEM)
Key Components of SEM:
Python Libraries for SEM:
Python Example: Factor Analysis (Latent Variable)
Factor Analysis Example (Latent Variables Extraction)
Data Science 2
SEM Example Using semopy
Structural Equation Model Example:
Explanation:
Summary of Key Concepts
Unit 1
Introduction to Data Science
Data Science is a multidisciplinary field that combines statistics, computer
science, mathematics, and domain-specific knowledge to extract insights and
knowledge from structured and unstructured data. Data Science applies scientific
methods, processes, algorithms, and systems to analyze vast amounts of data
and generate actionable insights. In today's world, where data is generated in
massive volumes from various sources such as social media, business
transactions, IoT devices, etc., Data Science plays a critical role in making sense
of that data.
Key Aspects of Data Science:
1. Data Collection: Gathering data from various sources (web scraping, APIs,
surveys, sensors, etc.).
Data Science 3
1. Mathematics and Statistics: Understanding of concepts like probability,
distributions, hypothesis testing, linear algebra, etc.
4. Data Wrangling and Cleaning: Ability to preprocess data, handle missing data,
and deal with data inconsistencies.
Data Science 4
subfields like Natural Language Processing (NLP), computer vision, robotics, and
more.
Key Points:
Types of AI:
Narrow AI: AI systems designed for specific tasks (e.g., Siri, Alexa,
recommendation engines).
Examples:
Key Points:
Goal of ML: To enable machines to learn from data and improve with
experience.
Types of ML:
Data Science 5
Reinforcement Learning: The model learns through trial and error to
maximize rewards.
Examples:
Data Science:
Data Science is a more comprehensive field that integrates AI, ML, and other tools
to work with data in various forms. It focuses on extracting insights and
knowledge from data using a mix of statistics, algorithms, and domain knowledge.
While AI and ML are tools used in Data Science, Data Science is concerned with
the entire data lifecycle from collection to insight generation.
Key Points:
Goal of Data Science: To extract actionable insights from large datasets using
a mix of techniques.
Scope: Data Science includes AI, ML, and various other techniques like data
mining and business intelligence.
Examples:
Data Science 6
more
2. Interpreted Language: Python code is executed line by line, which allows for
interactive debugging.
Data Science 7
computations, Pandas for data manipulation, Matplotlib for data visualization,
etc.).
Data Science and Machine Learning: With libraries like Pandas, NumPy,
Scikit-learn, TensorFlow.
import pandas as pd
Data Science 8
grouped = df.groupby('A').agg({'B': 'sum', 'C': 'mean'})
Pivot Tables
2. NumPy
NumPy is fundamental for numerical computing in Python.
Data Science 9
import numpy as np
# Broadcasting
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([10, 20, 30])
result = a + b # b is broadcast to match a's shape
# Fancy indexing
x = np.arange(10)
indices = [2, 5, 8]
selected = x[indices]
2.2 Vectorization
def sigmoid(x):
return 1 / (1 + np.exp(-x))
3. Scikit-learn
Scikit-learn is a machine learning library for Python.
pipe = Pipeline([
('scaler', StandardScaler()),
('svm', SVC())
])
Data Science 10
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)
3.2 Cross-Validation
rf = RandomForestClassifier()
scores = cross_val_score(rf, X, y, cv=5)
4. Data Visualization
4.1 Matplotlib
plt.figure(figsize=(10, 6))
plt.plot(x, y, 'r-', label='Data')
plt.title('Sample Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
plt.show()
4.2 Seaborn
sns.set_style("whitegrid")
tips = sns.load_dataset("tips")
sns.scatterplot(x="total_bill", y="tip", hue="time", data=tip
s)
plt.show()
Data Science 11
5. Advanced Python Concepts for Data Science
5.1 List Comprehensions and Generator Expressions
# List comprehension
squares = [x**2 for x in range(10)]
# Generator expression
sum_of_squares = sum(x**2 for x in range(1000000))
numbers = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x**2, numbers))
evens = list(filter(lambda x: x % 2 == 0, numbers))
product = reduce(lambda x, y: x * y, numbers)
These concepts and libraries form the core of Python's data science ecosystem,
providing powerful tools for data manipulation, analysis, and visualization.
Data Science 12
1. Cloud-Based: No installation is required. Notebooks are stored in Google
Drive, and you can access them from anywhere.
2. Free GPU/TPU Access: Colab provides free access to GPUs and TPUs, which
are vital for high-performance tasks like deep learning.
5. Integration with Google Drive: You can save and load datasets and notebooks
directly to and from Google Drive.
7. Markdown and LaTeX Support: Colab allows for the inclusion of Markdown
and LaTeX (for writing mathematical equations) alongside code.
Educational Purposes: It's widely used by students and educators for learning
Python and machine learning without the need for local installation.
Data Science 13
1. Kaggle Datasets:
Website: https://www.kaggle.com/datasets
Kaggle is one of the largest platforms for data science competitions and
also hosts a wide range of datasets. Users can search for datasets by
category, size, or application domain.
Popular Datasets:
Website: https://archive.ics.uci.edu/ml/index.php
Popular Datasets:
Website: https://datasetsearch.research.google.com/
Google’s Dataset Search allows users to find datasets across the web on
different platforms. It indexes datasets from a variety of sources such as
Data Science 14
academic journals, governmental agencies, and open data platforms.
4. Data.gov:
Website: https://www.data.gov/
Popular Datasets:
Crime Data: Data related to crimes across various U.S. cities and
states.
Website: https://registry.opendata.aws/
Amazon Web Services (AWS) hosts numerous open datasets for public
use, including datasets for satellite imagery, genomics, and machine
learning models.
Popular Datasets:
Data Science 15
Description: The Titanic dataset contains information on passengers who
were aboard the Titanic when it sank. It includes features such as age,
sex, class, fare, and whether they survived.
2. MNIST Dataset:
3. Iris Dataset:
Description: The Iris dataset includes features such as petal length, petal
width, sepal length, and sepal width for three species of Iris flowers.
Data Science 16
Use Case: It is used for regression problems, where the goal is to predict
the wine quality based on its features.
In summary, Python and Google Colab are essential tools for data scientists,
offering powerful features for data analysis, machine learning, and scientific
computing. Popular dataset repositories like Kaggle, UCI, and Data.gov provide
valuable datasets that are commonly used for academic, research, and
commercial purposes. Understanding and analyzing these datasets is a critical
skill in data science.
Data Pre-processing
Data pre-processing is a critical step in the data analysis and machine learning
pipeline. It involves preparing raw data to make it suitable for further analysis or
model training. The quality of the data can significantly influence the performance
of machine learning models. Data pre-processing helps in handling missing
values, removing noise, scaling, transforming, and integrating data from multiple
sources.
Key steps in data pre-processing include:
2. Data Integration: Combining data from multiple sources into a unified dataset.
4. Data Reduction: Reducing the volume of data to make analysis more efficient
without losing important information.
Example: If you have a dataset with missing values, you can fill them using the
mean, median, or mode of the available data (imputation). Alternatively, rows with
missing values can be removed if they are not critical.
Data Science 17
Python Example: Data Cleaning (Handling Missing Values)
import pandas as pd
import numpy as np
# Sample dataset
data = {'Age': [25, 30, np.nan, 22, np.nan], 'Salary': [5000
0, 54000, np.nan, 42000, 60000]}
df = pd.DataFrame(data)
Data Scales
Data can exist on different scales, which determine the type of statistical analysis
and machine learning techniques applicable to it. Understanding data scales is
vital for selecting the right methods for data processing.
1. Nominal Scale:
2. Ordinal Scale:
Example: Ratings (Excellent, Good, Fair, Poor), ranking in a race (1st, 2nd,
3rd).
Data Science 18
3. Interval Scale:
In this scale, the intervals between values are meaningful, but there is no
true zero point. Differences are consistent.
4. Ratio Scale:
This scale has all the characteristics of the interval scale, with a true zero
point that indicates the absence of the quantity being measured.
# Ordinal encoding
encoder = OrdinalEncoder(categories=[['High School', 'Bachelo
r', 'Master', 'PhD']])
encoded_education = encoder.fit_transform(education_levels)
print(encoded_education)
Data Science 19
Python Example: Cosine and Euclidean Similarity
# Example vectors
vector_a = [1, 0, -1]
vector_b = [0, 1, 0]
# Cosine similarity
Data Science 20
cos_sim = cosine_similarity([vector_a], [vector_b])
print("Cosine Similarity:", cos_sim)
# Euclidean distance
euc_dist = euclidean(vector_a, vector_b)
print("Euclidean Distance:", euc_dist)
1. Random Sampling: Each data point has an equal probability of being selected.
3. Systematic Sampling: Data points are selected at regular intervals from the
dataset.
Quantization:
Quantization involves converting continuous data into discrete values or levels.
import numpy as np
# Random sampling
data = np.arange(1, 101)
sample = np.random.choice(data, size=10, replace=False)
print("Random Sample:", sample)
Data Science 21
# Quantization (Bin data into 5 levels)
quantized_data = np.digitize(data, bins=[20, 40, 60, 80])
print("Quantized Data:", quantized_data)
Filtering
Filtering is a technique used to remove or reduce noise from a dataset. It is an
essential step in data pre-processing, especially in signal processing and time-
series data. The goal is to smooth the data or remove outliers that can skew the
results of your analysis.
1. Moving Average Filter: Averages the data points over a sliding window,
helping to smooth out short-term fluctuations.
2. Median Filter: Replaces each data point with the median of neighboring
points, often used for outlier removal.
import numpy as np
import pandas as pd
from scipy.ndimage import median_filter
# Median filter
median_filt = pd.Series(median_filter(data, size=3))
print("Median Filter:\\n", median_filt)
Data Science 22
Data Transformation
Data transformation is the process of converting data into a format suitable for
analysis. This can involve scaling, normalizing, encoding categorical data, or
transforming features to reduce skewness.
# Sample data
data = np.array([[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]])
# Log transformation
log_transformed = np.log(data + 1)
print("Log Transformed Data:\\n", log_transformed)
Data Merging
Data merging involves combining two or more datasets into a single dataset based
on a common attribute or key. Common merging operations include:
Data Science 23
2. Joining: Merging datasets based on a key (like SQL joins: inner, left, right, and
outer).
import pandas as pd
# Sample data
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob',
'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Score': [85, 90, 75]})
Data Visualization
Data visualization is a key aspect of data analysis as it helps to understand
patterns, trends, and relationships in the data. Common visualization techniques
include:
# Sample data
data = pd.DataFrame({
'Height': [150, 160, 170, 180, 190],
Data Science 24
'Weight': [50, 60, 70, 80, 90]
})
Data Science 25
import numpy as np
# Sample data
data = np.array([[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]])
# Applying PCA
pca = PCA(n_components=1) # Reducing to 1 principal componen
t
data_pca = pca.fit_transform(data_scaled)
print("PCA Transformed Data:\\n", data_pca)
Correlation
Correlation measures the strength and direction of a linear relationship between
two variables. It ranges from -1 to 1:
0: No correlation
import pandas as pd
# Sample data
Data Science 26
data = pd.DataFrame({
'X': [1, 2, 3, 4, 5],
'Y': [2, 4, 6, 8, 10]
})
# Pearson correlation
correlation = data.corr(method='pearson')
print("Pearson Correlation:\\n", correlation)
Chi-Square Test
The Chi-Square test is used to determine if there is a significant association
between two categorical variables. It compares the observed frequencies with the
expected frequencies to test for independence.
import pandas as pd
from scipy.stats import chi2_contingency
Data Science 27
})
Summary
Filtering: Smooths and cleans data using techniques like moving average and
median filters.
Data Visualization: Visualizes data trends using plots like scatter plots,
histograms, and bar charts.
All these concepts are critical to understanding how to process, analyze, and draw
insights from data, and Python provides powerful libraries like pandas , numpy , and
matplotlib to handle these tasks.
Unit 2
Regression Analysis
Regression analysis is a statistical technique used to model and analyze the
relationship between a dependent variable (target) and one or more independent
Data Science 28
variables (features). The goal of regression is to predict or explain the dependent
variable based on the given independent variables.
Types of regression analysis:
Linear Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
Data Science 29
# Sample data (simple linear relationship)
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
# Model parameters
print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)
Data Science 30
2. Poisson Regression: For count data, using the log link function.
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Data Science 31
# Probability of class 1
print("Predicted probabilities:\\n", log_reg.predict_proba(X_
test))
Regularized Regression
Regularized regression methods help prevent overfitting by adding a penalty term
to the loss function in the linear regression model. The most common forms of
regularized regression are:
Data Science 32
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split
# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
Data Science 33
Common types include logistic regression (for binary classification) and
Poisson regression (for count data).
3. Regularized Regression:
These techniques are fundamental in machine learning and statistical modeling for
solving various prediction and classification problems.
Cross-Validation
Cross-validation is a model evaluation technique that helps assess how well a
machine learning model will generalize to unseen data. Instead of splitting the
dataset into just training and testing sets, cross-validation divides the data into
multiple subsets (folds) and trains the model multiple times, each time using a
different subset for validation and the rest for training.
Types of Cross-Validation:
1. K-Fold Cross-Validation: The data is split into k equal-sized subsets (folds).
The model is trained k times, each time using k-1 folds for training and the
remaining fold for validation. The final result is the average of the results from
the k iterations.
2. Stratified K-Fold: Similar to K-Fold, but ensures each fold has a representative
proportion of classes for classification tasks.
Data Science 34
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np
# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
# KFold Cross-Validation
kf = KFold(n_splits=3)
model = LinearRegression()
1. Training Set: Used to train the machine learning model. The model learns the
relationships between the input features and the target variable.
2. Testing Set: Used to evaluate the model's performance on unseen data. The
testing set is used to assess how well the model generalizes to new, unseen
examples.
Splitting the dataset is typically done in a ratio, such as 70% for training and 30%
for testing. In cases where the dataset is large, an additional validation set may
also be used for hyperparameter tuning.
Data Science 35
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
Data Science 36
Python Example: Nonlinear Regression (Polynomial Regression)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
Data Science 37
print("Predicted values:\\n", y_pred)
Advantages:
Reduces model complexity and prevents overfitting.
# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
Data Science 38
# Ridge Regression (alpha = regularization strength)
ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X, y)
# Results
print("Ridge Regression Predictions:", y_pred)
print("Mean Squared Error:", mse_ridge)
The training set is used to train the model, and the test set is used to
evaluate performance on unseen data.
3. Nonlinear Regression:
4. Ridge Regression:
Data Science 39
By understanding and implementing these regression techniques, you can better
model complex data relationships and create more robust predictive models.
Latent Variables
Latent variables are variables that are not directly observed but are inferred or
estimated from other observed variables. They are commonly used in fields such
as psychology, social sciences, and econometrics to represent abstract concepts
like intelligence, socioeconomic status, or customer satisfaction, which are not
directly measurable.
Examples:
Customer Satisfaction: Latent variables might include satisfaction or loyalty,
which are inferred from responses to survey questions.
Latent variables are often modeled using factor analysis or structural equation
modeling (SEM).
2. Latent Variables: Inferred from observed variables (e.g., abstract traits like
"satisfaction").
Data Science 40
2. Structural Model: Specifies the relationships between latent variables (similar
to regression).
import pandas as pd
from factor_analyzer import FactorAnalyzer
Data Science 41
fa.fit(df)
In this example, we assume that the observed variables (e.g., survey questions Q1
to Q4) are used to estimate a single latent factor.
import pandas as pd
from semopy import Model, Optimizer
Data Science 42
'L1': [3, 4, 5, 6, 7],
'L2': [4, 5, 6, 6, 8],
'L3': [5, 6, 7, 7, 9]
}
df = pd.DataFrame(data)
# Structural paths
Loyalty ~ Satisfaction
"""
Explanation:
Satisfaction =~ Q1 + Q2 + Q3 : This line specifies that the latent variable
"Satisfaction" is inferred from the observed variables Q1, Q2, and Q3.
influenced by "Satisfaction".
Data Science 43
1. Latent Variables: These are abstract variables that are not directly observed
but are inferred from other measured variables. Latent variables are commonly
used to represent unobservable constructs like intelligence, satisfaction, or
economic status.
Data Science 44