Revision Questions
Revision Questions
REVISION QUESTIONS/AREAS
1. List and explain three techniques used in exploratory data analysis (EDA). (5 marks)
2. Given a dataset, explain how you would use Python's Pandas library to read, clean, and analyze the data.
Provide example code snippets for each step. (10 marks)
3. Discuss the differences between supervised and unsupervised learning models. Provide examples of each.
(5 marks)
4. Describe how SciPy can be used for optimization tasks. Provide an example of a simple optimization problem
and its solution using SciPy. (5 marks)
5. How do you use Matplotlib to create informative data visualizations? Provide an example of a line plot and a
subplot. (5 marks)
7. Describe the steps to perform data manipulation using Pandas. (10 marks)
8. Using Python, demonstrate how to perform a basic statistical analysis on a dataset using NumPy and SciPy.
Include code snippets. (10 marks)
9. How does the ANOVA technique help in understanding data variance? Provide an example of its application in
data analysis. (5marks)
10. Describe the basic operations and array attributes in NumPy. Provide examples of creating arrays and
performing operations on them. (5 marks)
11. Explain how to use Pandas for SQL operations. Provide example code to demonstrate reading data from a SQL
database and performing a simple query. (10 marks)
13. List and describe three industries that utilize data science.
15. What are the key techniques for exploring and visualizing data?
17. Describe different data types used for plotting and visualization.
22. Explain the concept of S Square and its significance in correlation analysis.
23. Describe practical applications and interpretation of regression and ANOVA.
25. Explain how to set up a Python environment and the role of Jupyter Notebook.
26. List and describe Python data types, operators, and functions.
28. Explain the different data sets and data structures available in R.
32. List basic operations, mathematical functions, and array attributes in NumPy.
37. Describe how to work with DataFrames, including data operations and indexing.
38. Explain how to read and write data files and perform SQL operations with Pandas.
40. How do you identify problem types and select appropriate learning models?
41. Discuss the process of training, testing, and optimizing machine learning models with Scikit-Learn.
44. Describe how to train NLP models and perform grid search for optimization.
46. Explain how to plot with Matplotlib, including line properties, (x,y) plots, and subplots.
47. Discuss techniques for creating visually appealing and informative data visualizations with Matplotlib.
1. List and explain three techniques used in exploratory data analysis (EDA). (5 marks)
Techniques in EDA:
1. Descriptive Statistics: Summarizing the main features of a dataset quantitatively. This includes measures like
mean, median, mode, standard deviation, and quartiles.
2. Data Visualization: Using graphical representations to see patterns, trends, and outliers. Common
visualizations include histograms, box plots, scatter plots, and heatmaps.
3. Data Cleaning: Identifying and correcting errors or inconsistencies in the data. This includes handling missing
values, correcting data types, and dealing with duplicates.
2. Given a dataset, explain how you would use Python's Pandas library to read, clean,
and analyze the data. Provide example code snippets for each step. (10 marks)
Reading Data:
import pandas as pd
Cleaning Data:
# Removing duplicates
df = df.drop_duplicates()
Analyzing Data:
python
Copy code
# Descriptive statistics
print(df.describe())
df['column'].hist()
plt.show()
3. Discuss the differences between supervised and unsupervised learning models. Provide
examples of each. (5 marks)
Supervised Learning:
Definition: Models are trained using labeled data.
Examples: Regression, Classification.
Regression: Predicting house prices.
Classification: Identifying spam emails.
Unsupervised Learning:
4. Describe how SciPy can be used for optimization tasks. Provide an example of a
simple optimization problem and its solution using SciPy. (5 marks)
Optimization with SciPy:
Usage: SciPy's optimize module provides functions for optimization and root finding.
Example:
python
Copy code
from scipy.optimize import minimize
# Objective function
def objective(x):
return x**2 + 5*x + 4
# Initial guess
x0 = 0
# Minimization
result = minimize(objective, x0)
print('Optimal value:', result.x)
# Data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
# Line plot
plt.plot(x, y)
plt.title('Line Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()
Subplot:
python
Copy code
# Data
x = [1, 2, 3, 4, 5]
y1 = [2, 3, 5, 7, 11]
y2 = [1, 4, 6, 8, 10]
# Subplot
fig, axs = plt.subplots(2)
axs[0].plot(x, y1)
axs[0].set_title('First Subplot')
axs[1].plot(x, y2)
axs[1].set_title('Second Subplot')
plt.show()
Advantages:
7. Describe the steps to perform data manipulation using Pandas. (10 marks)
Steps:
1. Loading Data:
python
Copy code
import pandas as pd
df = pd.read_csv('data.csv')
2. Inspecting Data:
python
Copy code
df.head()
df.info()
df.describe()
4. Filtering Data:
python
Copy code
filtered_df = df[df['column'] > value]
6. Merging DataFrames:
python
Copy code
df2 = pd.read_csv('data2.csv')
merged_df = pd.merge(df, df2, on='key')
8. Applying Functions:
python
Copy code
df['new_column'] = df['column'].apply(lambda x: x * 2)
9. Sorting Data:
python
Copy code
df = df.sort_values(by='column')
10.Saving Data:
python
Copy code
df.to_csv('cleaned_data.csv', index=False)
# Sample data
data = np.array([1, 2, 2, 3, 4, 4, 4, 5, 6, 6, 7, 8, 9, 10])
9. How does the ANOVA technique help in understanding data variance? Provide an
example of its application in data analysis. (5 marks)
ANOVA (Analysis of Variance):
Purpose: Compares means of three or more groups to see if at least one is significantly different.
Application:
Example: Testing if different teaching methods affect student performance.
python
Copy code
import pandas as pd
from scipy import stats
# Sample data
data = {
'Method A': [85, 90, 88, 92],
'Method B': [78, 80, 82, 84],
'Method C': [90, 92, 94, 96]
}
df = pd.DataFrame(data)
# ANOVA
f_stat, p_value = stats.f_oneway(df['Method A'], df['Method B'], df['Method C'])
print(f'F-statistic: {f_stat}, P-value: {p_value}')
10. Describe the basic operations and array attributes in NumPy. Provide examples of
creating arrays and performing operations on them. (5 marks)
Array Creation:
python
Copy code
import numpy as np
# Creating arrays
array1 = np.array([1, 2, 3])
array2 = np.zeros((2, 3))
array3 = np.ones((3, 3))
array4 = np.arange(0, 10, 2)
array5 = np.linspace(0, 1, 5)
Array Operations:
python
Copy code
# Element-wise operations
sum_array = array1 + array4
prod_array = array1 * array4
# Matrix operations
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])
print(matrix_sum, matrix_product)
Attributes:
python
Copy code
# Array attributes
print(f'Shape: {array1.shape}')
print(f'Data type: {array1.dtype}')
print(f'Number of dimensions: {array1.ndim}')
11. Explain how to use Pandas for SQL operations. Provide example code to
demonstrate reading data from a SQL database and performing a simple query. (10
marks)
Example:
python
Copy code
import pandas as pd
import sqlite3
print(query_result)
Definition: An interdisciplinary field focused on extracting insights from data using techniques from statistics,
computer science, and domain knowledge.
Basic Concepts:
Data Collection: Gathering data from various sources.
Data Cleaning: Preparing and cleaning data for analysis.
Exploratory Data Analysis (EDA): Understanding data patterns and characteristics.
Modeling: Applying statistical and machine learning models to make predictions.
Validation: Assessing model performance.
Communication: Presenting insights through reports and visualizations.
13. List and describe three industries that utilize data science.
Industries:
1. Healthcare: Predictive modeling for patient outcomes, personalized treatment plans, and medical image
analysis.
2. Finance: Fraud detection, risk management, algorithmic trading, and customer segmentation.
3. Retail: Inventory management, recommendation systems, and customer behavior analysis.
15. What are the key techniques for exploring and visualizing data?
Techniques:
1. Descriptive Statistics: Summarize data using mean, median, mode, and standard deviation.
2. Data Visualization: Use charts and plots to represent data visually (e.g., histograms, scatter plots, box plots).
3. Correlation Analysis: Identify relationships between variables using correlation coefficients.
4. Dimensionality Reduction: Simplify data using techniques like PCA (Principal Component Analysis).
5. Clustering: Group similar data points together to identify patterns (e.g., K-means clustering).
Definition: Statistics that summarize and describe the main features of a dataset.
Importance:
Data Summarization: Provide a quick overview of data.
Central Tendency: Measures like mean, median, and mode indicate the central point of the data.
Dispersion: Measures like range, variance, and standard deviation show data spread.
Shape: Skewness and kurtosis describe data distribution shape.
17. Describe different data types used for plotting and visualization.
Data Types:
1. Numerical Data: Continuous or discrete values, used in line plots, histograms, and scatter plots.
2. Categorical Data: Distinct categories, used in bar charts and pie charts.
3. Time-Series Data: Data points indexed by time, used in line plots and area plots.
4. Geospatial Data: Data with geographic components, used in maps and heatmaps.
Non-Statistical Analysis:
Inferential Statistics:
Definition: The entire group of individuals or instances about whom we want to draw conclusions.
Sample:
Sampling Techniques:
1. Random Sampling: Each member of the population has an equal chance of being selected.
2. Stratified Sampling: Population divided into subgroups (strata), and samples are taken from each.
3. Cluster Sampling: Population divided into clusters, and a whole cluster is sampled.
4. Systematic Sampling: Every nth member of the population is selected.
Purpose: Compare means of three or more groups to see if at least one is significantly different.
Types: One-way ANOVA, two-way ANOVA.
Example: Testing if different fertilizers affect crop yield.
22. Explain the concept of S Square and its significance in correlation analysis.
S Square (Sum of Squares):
ANOVA Applications:
Interpretation:
Regression: Coefficients indicate the direction and strength of relationships between variables.
ANOVA: F-statistic and p-value indicate if group means are significantly different.
1. Data Manipulation: Libraries like Pandas for data cleaning and transformation.
2. Statistical Analysis: Libraries like SciPy and Statsmodels for statistical tests and models.
3. Machine Learning: Libraries like Scikit-Learn for building and evaluating models.
4. Data Visualization: Libraries like Matplotlib and Seaborn for creating plots and charts.
5. Big Data: Libraries like Dask and PySpark for handling large datasets.
25. Explain how to set up a Python environment and the role of Jupyter Notebook.
Setting Up Python Environment:
1. Install Python: Download and install from the official Python website.
2. Package Manager: Install pip or conda for managing packages.
3. Virtual Environment: Create a virtual environment to manage dependencies.
bash
Copy code
# Using venv
python -m venv myenv
source myenv/bin/activate
# Using conda
conda create --name myenv python=3.8
conda activate myenv
Installing Packages:
bash
Copy code
pip install pandas numpy matplotlib jupyter
26. List and describe Python data types, operators, and functions.
Data Types:
Operators:
Functions:
28. Explain the different data sets and data structures available in R.
Data Sets:
Data Structures:
dplyr Package:
R
Copy code
library(dplyr)
df <- df %>%
filter(Age > 25) %>%
mutate(Salary = Age * 1000) %>%
arrange(Name)
Data Visualization:
ggplot2 Package:
R
Copy code
library(ggplot2)
ggplot(df, aes(x = Age, y = Salary)) +
geom_point() +
theme_minimal()
# Creating arrays
array1 = np.array([1, 2, 3])
array2 = np.zeros((2, 3))
array3 = np.ones((3, 3))
array4 = np.arange(0, 10, 2)
array5 = np.linspace(0, 1, 5)
Manipulating Ndarrays:
python
Copy code
# Reshaping
reshaped = array1.reshape((3, 1))
# Indexing
element = array1[0]
# Slicing
subset = array1[1:3]
# Broadcasting
broadcasted = array1 * 2
# Aggregation
sum_array = array1.sum()
mean_array = array1.mean()
32. List basic operations, mathematical functions, and array attributes in NumPy.
Basic Operations:
Mathematical Functions:
1. Sum: np.sum()
2. Mean: np.mean()
3. Standard Deviation: np.std()
4. Min/Max: np.min(), np.max()
5. Sin/Cos/Tan: np.sin(), np.cos(), np.tan()
Array Attributes:
1. Shape: array.shape
2. Data Type: array.dtype
3. Number of Dimensions: array.ndim
4. Size: array.size
Definition: An open-source Python library used for scientific and technical computing.
Sub-packages:
1. scipy.optimize: Functions for optimization and root finding.
2. scipy.stats: Statistical functions and tests.
3. scipy.integrate: Numerical integration routines.
4. scipy.linalg: Linear algebra routines.
5. scipy.signal: Signal processing tools.
6. scipy.sparse: Sparse matrix operations.
7. scipy.fftpack: Fast Fourier Transform routines.
1. Integration:
2. Optimization:
def objective(x):
return x**2 + 5*x + 4
result = minimize(objective, 0)
print(result.x)
3. Statistics:
Definition: A powerful open-source data analysis and manipulation library for Python.
Usage:
Data Structures: Provides DataFrames and Series for handling structured data.
Data Cleaning: Handle missing values, duplicates, and inconsistent data.
Data Transformation: Apply functions, group data, and perform aggregations.
Data Analysis: Perform statistical analysis, merge/join datasets, and filter data.
Input/Output: Read and write data from/to various file formats (CSV, Excel, SQL).
37. Describe how to work with DataFrames, including data operations and indexing.
Working with DataFrames:
Creating DataFrames:
python
Copy code
import pandas as pd
Indexing:
python
Copy code
# Selecting a column
ages = df['Age']
Data Operations:
python
Copy code
# Adding a new column
df['Salary'] = [50000, 60000, 70000]
# Applying functions
df['Age in 10 Years'] = df['Age'].apply(lambda x: x + 10)
# Merging DataFrames
data2 = {'Name': ['Alice', 'Bob'], 'Department': ['HR', 'IT']}
df2 = pd.DataFrame(data2)
merged_df = pd.merge(df, df2, on='Name')
38. Explain how to read and write data files and perform SQL operations with Pandas.
Reading and Writing Data Files:
Reading:
python
Copy code
df = pd.read_csv('data.csv')
df = pd.read_excel('data.xlsx')
df = pd.read_json('data.json')
Writing:
python
Copy code
df.to_csv('output.csv', index=False)
df.to_excel('output.xlsx', index=False)
df.to_json('output.json')