data science unit 1
data science unit 1
Types of Data
Structured Data: Data organized in rows and columns, typically stored in relational
databases. Examples include customer records, sales data, and financial transactions.
Unstructured Data: Data that does not follow a predefined format. Examples include text,
images, videos, and social media posts.
Semi-Structured Data: Data with some organizational properties but no fixed schema,
such as JSON, XML, or CSV files.
Sources of Data
Primary Data: Data collected firsthand for a specific purpose, such as surveys, experiments,
or interviews.
Secondary Data: Data collected by others and repurposed for analysis, such as publicly
available datasets, research studies, or government reports.
Real-Time Data: Data that is generated and processed instantly, often used in IoT systems,
stock trading platforms, or monitoring systems.
Dimensions of Data
Volume: Refers to the amount of data generated, often measured in terabytes or petabytes.
Variety: Indicates the different types and formats of data, such as images, text, or numerical
data.
Velocity: The speed at which data is generated and processed, critical for real-time
analytics.
Veracity: The quality and reliability of data, considering inaccuracies, biases, or
inconsistencies.
Value: The usefulness of the data for generating insights and driving decisions.
Characteristics of Data
Granularity: The level of detail in the data. High granularity means detailed data, while low
granularity refers to summarized or aggregated data.
Sparsity: The ratio of missing or zero values in a dataset. Sparse data often needs
specialized techniques for analysis.
1
Data Science 21CSS303T
Lifecycle of Data
Collection: Gathering raw data from various sources, such as sensors, web scraping, or
manual entry.
Storage: Organizing and storing data in databases, data lakes, or cloud platforms.
Processing: Cleaning and transforming data to make it suitable for analysis.
Analysis: Applying statistical and machine learning techniques to uncover patterns and
insights.
Visualization: Communicating insights through charts, graphs, and dashboards.
Decision-Making: Using insights to guide business or research decisions.
Privacy: Ensuring data is used responsibly and in compliance with laws like GDPR or CCPA.
Bias: Identifying and mitigating biases in data collection or analysis.
Security: Protecting sensitive data from breaches or unauthorized access.
Understanding the facets of data is critical in data science, as it lays the groundwork for
effective analysis and insight generation. By considering aspects like data type, source,
dimension, and ethical use, data scientists can harness the power of data to address real-
world challenges and drive innovation.
Problem Definition
Objective Setting: Clearly define the problem or question to be answered. Understand the
business or research goals and identify the key performance indicators (KPIs).
Stakeholder Involvement: Collaborate with domain experts to align expectations and
ensure that the analysis addresses practical needs.
Outcome Specification: Determine the expected outputs and their implications for
decision-making.
Data Collection
Identify Data Sources: Gather data from relevant sources, such as databases, APIs, surveys,
or external datasets.
Primary and Secondary Data: Determine whether new data collection is needed or if
existing datasets suffice.
2
Data Science 21CSS303T
Data Acquisition Tools: Use tools like web scraping, SQL queries, or cloud platforms to
extract data.
Data Preparation
Data Cleaning: Handle missing values, duplicate records, and outliers. Standardize formats
and resolve inconsistencies.
Data Transformation: Convert raw data into a usable format through normalization,
scaling, or encoding categorical variables.
Feature Engineering: Create new variables or features that enhance the model’s predictive
power.
Exploratory Data Analysis (EDA): Use statistical and visualization techniques to
understand data distribution, relationships, and trends.
Hypothesis Testing: Formulate and test hypotheses to validate assumptions about the
data.
Statistical Analysis: Apply statistical methods to identify patterns, correlations, and
anomalies.
Dimensionality Reduction: Reduce the number of features while retaining important
information (e.g., using PCA or t-SNE).
Modeling
Model Selection: Choose appropriate models based on the problem type (e.g., regression,
classification, clustering, or time-series analysis).
Model Training: Train the model on the prepared dataset using algorithms like linear
regression, decision trees, or neural networks.
Model Evaluation: Assess the model’s performance using metrics like accuracy, precision,
recall, F1-score, or mean squared error (MSE).
Hyperparameter Tuning: Optimize the model by adjusting hyperparameters to improve
performance.
Cross-Validation: Divide the dataset into training, validation, and test sets to ensure
robustness and generalizability.
Overfitting/Underfitting Analysis: Check for signs of overfitting (poor performance on
new data) or underfitting (insufficient learning).
Performance Metrics: Compare models and select the best-performing one for
deployment.
Deployment
Integration: Implement the final model into production systems, such as applications, APIs,
or dashboards.
3
Data Science 21CSS303T
Data Visualization: Present findings using charts, graphs, or dashboards to convey insights
clearly.
Stakeholder Communication: Translate technical results into actionable
recommendations for non-technical stakeholders.
Documentation: Record the process, assumptions, and findings to ensure reproducibility
and knowledge transfer.
Feedback Loop: Gather feedback from stakeholders to refine the model or analysis.
Continuous Improvement: Update models and processes as new data becomes available
or as requirements evolve.
3. Introduction to Numpy
NumPy (short for Numerical Python) is one of the most popular and fundamental libraries
in Python for numerical computing. It provides support for working with large,
multidimensional arrays and matrices, along with a collection of high-level mathematical
functions to operate on these arrays efficiently.
1. N-Dimensional Arrays: Provides the ndarray object, which allows for efficient storage and
operations on large datasets.
2. Mathematical Functions: Includes a wide range of functions for statistical, linear algebra,
and Fourier analysis.
3. Broadcasting: Allows for vectorized operations, making it unnecessary to write loops for
element-wise operations.
4. Interoperability: Works seamlessly with other Python libraries such as Pandas, Matplotlib,
and Scikit-learn.
5. High Performance: Built on C, ensuring speed and efficiency for numerical computations.
Installing NumPy
4
Data Science 21CSS303T
The core of NumPy is the ndarray (n-dimensional array), which is a grid of values of the same data
type.
import numpy as np #
print(arr)
Array Creation
Array Operations
Element-Wise Operations:
Matrix Multiplication:
mat1 = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
print(np.dot(mat1, mat2)) # Matrix product
Indexing
5
Data Science 21CSS303T
print(arr[1]) # 20
Slicing:
print(arr.shape) # (2, 2)
print(arr.size) # 4
Reshape
Aggregation
Attributes
NumPy arrays, represented by the ndarray object, come with various attributes that
provide useful metadata about the array. These attributes allow you to better understand
and manipulate arrays in an efficient and organized manner. Below are the key attributes of
a NumPy array:
ndarray.shape
Description: Returns a tuple representing the dimensions of the array (number of rows,
columns, etc.).
Example:
import numpy as np
6
Data Science 21CSS303T
ndarray.ndim
ndarray.size
Description: Returns the total number of elements in the array (product of all dimensions).
Example
ndarray.dtype
ndarray.itemsize
Description: Returns the size (in bytes) of each element in the array.
Example:
ndarray.nbytes
Description: Returns the total number of bytes consumed by the array (size × itemsize).
Example
7
Data Science 21CSS303T
ndarray.T (Transpose)
Description: Returns the transposed version of the array (only for 2D or higher
dimensions).
Example:
ndarray.flags
Description: Return the real and imaginary parts of the array, respectively (useful for
complex numbers).
Example
ndarray.strides
Description: Returns the number of bytes to step in each dimension when traversing an
array.
Example
ndarray.base
Description: If the array is a view of another array, this attribute points to the original
array.
Example
8
Data Science 21CSS303T
ndarray.flat
4. Numpy Arrays objects: The core of NumPy is its array object, called the ndarray (N-
dimensional array). This object is a powerful and versatile data structure for handling
numerical data efficiently. It is the backbone of most numerical operations in NumPy.
Creating Arrays
import numpy as np
# From a list
arr = np.array([1, 2, 3, 4, 5])
9
Data Science 21CSS303T
np.linspace(): Creates an array of evenly spaced values between a start and end point.
rand_arr = np.random.rand(2, 3)
Joining Arrays
a. Using np.concatenate()
import numpy as np
10
Data Science 21CSS303T
2. Splitting Arrays
a. Using np.split()
11
Data Science 21CSS303T
b. Using np.array_split()
# Horizontal split
h_split = np.hsplit(arr, 2)
print(h_split)
# Output: [array([[1], [3], [5]]), array([[2], [4], [6]])]
# Vertical split
v_split = np.vsplit(arr, 3)
print(v_split)
# Output: [array([[1, 2]]), array([[3, 4]]), array([[5, 6]])]
3. Searching in Arrays
a. Using np.where()
b. Using np.searchsorted()
Searches for a value and returns the index where it should be inserted to maintain order.
12
Data Science 21CSS303T
print(index) # Output: 2
c. Using np.isin()
4. Sorting Arrays
a. Using np.sort()
c. Using np.argsort()
13
Data Science 21CSS303T
indices = np.argsort(arr)
print(indices) # Output: [3 1 0 4 2]
Iterating
NumPy arrays support iteration just like Python lists, but with additional functionality for
multidimensional arrays.
You can iterate over the elements of a 1D array using a simple for loop:
import numpy as np
To iterate over each element in a multidimensional array, use the nditer() function:
Use the ndenumerate() function to get both the index and the value during iteration.
14
Data Science 21CSS303T
# (0, 0) 1
# (0, 1) 2
# (1, 0) 3
# (1, 1) 4
Copying arrays
NumPy arrays can be copied in two main ways: shallow copy and deep copy.
A shallow copy creates a new array object that refers to the same data as the original array.
Changes in one array will affect the other.
shallow_copy[0] = 99
print(arr) # Output: [99 2 3]
print(shallow_copy) # Output: [99 2 3]
b. Deep Copy
A deep copy creates a new array object with a completely independent copy of the data. Changes in
one array do not affect the other.
deep_copy[0] = 99
print(arr) # Output: [1 2 3]
print(deep_copy) # Output: [99 2 3]
You can check whether two arrays share memory using the np.may_share_memory() function:
Identity array
An identity array is a square matrix in which all the diagonal elements are 1 and all other elements
are 0. NumPy provides the np.identity() and np.eye() functions to create identity arrays.
15
Data Science 21CSS303T
a. Using np.identity()
identity_matrix = np.identity(4)
print(identity_matrix)
# Output:
# [[1. 0. 0. 0.]
# [0. 1. 0. 0.]
# [0. 0. 1. 0.]
# [0. 0. 0. 1.]]
b. Using np.eye()
Allows you to specify the number of rows and columns, and the position of the diagonal.
The np.eye() function in NumPy is used to create a 2D array with ones on the diagonal and zeros
elsewhere. It is similar to the identity matrix but offers more flexibility.
Syntax
numpy.eye(N, M=None, k=0, dtype=float, order='C')
Parameters
N: Number of rows.
M: (Optional) Number of columns. Defaults to N (square matrix).
k: Index of the diagonal:
o k=0 (default): Main diagonal.
o k>0: Above the main diagonal.
o k<0: Below the main diagonal.
dtype: Data type of the returned array. Default is float.
order: Memory layout ('C' for row-major, 'F' for column-major).
Examples
a. Basic Usage
import numpy as np
16
Data Science 21CSS303T
matrix = np.eye(4)
print(matrix)
# Output:
# [[1. 0. 0. 0.]
# [0. 1. 0. 0.]
# [0. 0. 1. 0.]
# [0. 0. 0. 1.]]
b. Non-Square Matrix
python
Copy code
matrix = np.eye(3, 5)
print(matrix)
# Output:
# [[1. 0. 0. 0. 0.]
# [0. 1. 0. 0. 0.]
# [0. 0. 1. 0. 0.]]
The Pandas Series is a one-dimensional labeled array capable of holding any data type (e.g.,
integers, floats, strings, etc.). It is similar to a NumPy array but with labels (index).
a. From a List
import pandas as pd
17
Data Science 21CSS303T
# 3 40
# dtype: int64
b. From a Dictionary
c. Custom Index
By index:
print(series['x']) # Output: 10
By position:
print(series[0]) # Output: 10
b. Slicing
print(series['x':'y']) # Output: x 10
# y 20
c. Filtering
print(series[series > 15]) # Output:
# y 20
# z 30
18
Data Science 21CSS303T
b. Statistical Operations
Mean:
Maximum:
print(series.max()) # Output: 30
Sum:
print(series.sum()) # Output: 60
6. Exploring Data using Data Frames, Index objects, Re index, Drop Entry
Pandas DataFrames are 2-dimensional labeled data structures with columns of potentially
different types. They are the primary structure for handling and analyzing data in Python.
Creating a DataFrame
a. From a Dictionary of Lists
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']
}
df = pd.DataFrame(data)
print(df)
19
Data Science 21CSS303T
# Output:
# Name Age City
# 0 Alice 25 New York
# 1 Bob 30 San Francisco
# 2 Charlie 35 Los Angeles
df = pd.DataFrame(data)
print(df)
2. Index Objects
3. Reindexing
20
Data Science 21CSS303T
Using reindex()
new_index = [0, 1, 2, 3]
reindexed_df = df.reindex(new_index)
print(reindexed_df)
# Output:
# Name Age City
# 0 Alice 25.0 New York
# 1 Bob 30.0 San Francisco
# 2 Charlie 35.0 Los Angeles
# 3 NaN NaN NaN
4. Dropping Entries
Dropping Rows
df = pd.DataFrame(data)
df_dropped = df.drop(1) # Drops the second row
print(df_dropped)
# Output:
# Name Age City
# 0 Alice 25 New York
# 2 Charlie 35 Los Angeles
Dropping Columns
In-Place Drop
21
Data Science 21CSS303T
print(df.head(2))
print(df.tail(2))
df.info(): Summary of the DataFrame, including data types and memory usage.
print(df.info())
b. Descriptive Statistics
print(df.describe())
print(df.size) # Output: 9
b. Selecting Rows
22
Data Science 21CSS303T
7. Data Alignment
Data alignment in Pandas ensures that operations between DataFrames or Series align on
their labels (indexes) by default.
Automatic Alignment
When performing arithmetic operations between objects with different indexes, Pandas aligns the
data based on the index.
import pandas as pd
result = s1 + s2
print(result)
# Output:
# a NaN
# b 6.0
# c 8.0
# d NaN
dtype: float64
You can fill missing values resulting from misaligned indexes using the fill_value parameter.
23
Data Science 21CSS303T
Sorting
a. Sorting by Index
b. Sorting by Values
sorted_df = df.sort_values(by='A')
print(sorted_df)
# Output:
# A
#c 1
#a 2
#b 3
2. Ranking
a. Default Ranking
s = pd.Series([7, 1, 2, 7])
ranked = s.rank()
print(ranked)
# Output:
# 0 3.5
# 1 1.0
# 2 2.0
# 3 3.5
dtype: float64
24
Data Science 21CSS303T
b. Handling Ties
You can specify how to handle ties using the method parameter:
ranked_min = s.rank(method='min')
print(ranked_min)
# Output:
# 0 3.0
# 1 1.0
# 2 2.0
# 3 3.0
dtype: float64
9. Summary Statistics
1. Aggregation Functions
mean(): Average.
sum(): Sum of values.
min() and max(): Minimum and maximum values.
std(): Standard deviation.
df =
pd.DataFrame({ 'A
': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
})
25
Data Science 21CSS303T
2. Cumulative Functions
print(df['A'].cumsum())
# Output:
#0 1
#1 3
#2 6
Name: A, dtype: int64
3. Describe Method
print(df.describe())
# Output:
# A B C
# count 3.000000 3.0 3.0
# mean 2.000000 5.0 8.0
# std 1.000000 1.0 1.0
# min 1.000000 4.0 7.0
# 25% 1.500000 4.5 7.5
# 50% 2.000000 5.0 8.0
# 75% 2.500000 5.5 8.5
# max 3.000000 6.0 9.0
4. Value Counts
s = pd.Series([1, 2, 2, 3, 3, 3])
print(s.value_counts())
# Output:
#3 3
#2 2
#1 1
dtype: int64
10. Index Hierarchy Data Acquisition: Gather information from different sources, Web
APIs, Open Data Sources, Web Scrapping.
Data acquisition refers to the process of collecting data from various sources to be analyzed.
There are multiple ways to gather data, ranging from using Web APIs, Open Data Sources,
to performing Web Scraping. Each method has its use case depending on the source and
the desired format.
26
Data Science 21CSS303T
Index Hierarchy
An Index Hierarchy (or MultiIndex) in Pandas allows for multiple levels of indexing in rows or
columns, which can be used to manage complex data and hierarchies. The index is like a key or label
that can represent hierarchical relationships between different rows or columns of data.
Creating a MultiIndex
To create a MultiIndex, you can pass a list of tuples or arrays to the pd.MultiIndex.from_tuples() or
pd.MultiIndex.from_arrays() methods.
import pandas as pd
Output:
Data
Letter Number
A 1 10
2 20
B 1 30
2 40
Accessing Data from a MultiIndex
2. Data Acquisition
a. Gathering Information from Web APIs
Web APIs (Application Programming Interfaces) provide a way to retrieve data from external
servers. They are often used for acquiring real-time data from various services (such as weather,
finance, social media, etc.).
The requests library is commonly used for interacting with web APIs.
27
Data Science 21CSS303T
import requests
response = requests.get(url)
Note: Replace your_api_key with an actual API key from the OpenWeatherMap API.
Open Data refers to publicly available datasets that are provided by governments, organizations, or
research institutions. These datasets can be used for analysis, research, or machine learning
purposes.
Government Open Data Portals: Many governments provide open datasets, such as the US
Data.gov and EU Open Data Portal.
Research Datasets: Sites like Kaggle and UCI Machine Learning Repository provide
datasets for machine learning and research.
To acquire open data, you can either download the files manually from the portals or use APIs,
depending on availability.
c. Web Scraping
Web Scraping is the process of extracting data from websites that don't provide an API. It involves
fetching the HTML content of a webpage and extracting specific data points.
To scrape data from a website, you need the requests and BeautifulSoup libraries. Here's an
example of scraping a simple webpage:
import requests
from bs4 import BeautifulSoup
28
Data Science 21CSS303T
response = requests.get(url)
Important Notes:
Once you have gathered data from different sources (such as Web APIs, Open Data, or Web
Scraping), you can combine and clean the data using Pandas.
Concatenating DataFrames
Output:
A B
0 1 4
1 2 5
2 3 6
3 7 10
4 8 11
5 9 12
Merging DataFrames
29
Data Science 21CSS303T
Output:
ID Name Age
0 1 Alice 25
1 2 Bob 30
30