0% found this document useful (0 votes)

47 views7 pages

FAQ - ReCell

The document provides a comprehensive FAQ for the ReCell project, detailing steps for project initiation, data preprocessing, and exploratory data analysis (EDA). It addresses common issues such as handling missing values, managing outliers, and resolving errors encountered during modeling. Additionally, it offers guidance on statistical tests and the importance of ensuring data compatibility for successful model building.

Uploaded by

chrrrb6nqb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views7 pages

FAQ - ReCell

Uploaded by

chrrrb6nqb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Guided Project: FAQ

1. How should one approach the ReCell project?

 Before starting the project, please read the problem statement carefully and go
through the criteria and descriptions mentioned in the rubric.
 Once you understand the task, download the dataset and import it into
a Google Colab Notebook to get started with the project.
 To work on the project, you should start with data preprocessing and EDA using
descriptive statistics and visualizations.
 Once the EDA is completed and the data is preprocessed, you can use the data
to build a model, check its performance, and check whether or not it satisfies
the necessary assumptions.
 It is important to close the analysis with key findings and recommendations to
the business.

2. Since we have missing values in the dataset, what is the best way to handle or
treat those missing values?

The strategy to deal with missing values varies with the problem at
hand, the data provided, and other factors too. Some of the common
strategies are listed below.

 Drop the missing values

 Impute the missing values
 Using central tendency measures (mean, median, mode) of a column
 With mean: Missing values are imputed with the mean of the
column. Preferred for continuous data with no outliers
 With median: Missing values are imputed with the median of the
column. Preferred for continuous data with outliers
 With mode: Missing values are imputed with the mode of the
column. Preferred for categorical data
 Using central tendency measures (mean, median, mode) of a
column grouped by categories of a categorical column: Preferred for
cases where the data under similar categories of a categorical column are
likely to have similar properties

3. In what order should one do EDA and Data Preprocessing?

Whenever one extracts or gets some data, the first step generally is
to explore the data (check the distribution, summary statistics,
interactions between variables). More often than not, the data will be
in a state that would need some amount of preprocessing before any
exploration can be performed. As such, the step of data
preprocessing is both preceded and followed by some amount of data
exploration.
The initial exploration of the data helps in identifying the kind of
preprocessing needed for the data. For example, if the data has
missing values, the data distribution will help you decide on the
strategy to use to treat the missing values. Once you have treated
the missing values, you would want to check the distribution before
going ahead with modeling.
The exact steps to be taken for preprocessing, the kinds of analysis
to perform, and the order of their execution will depend on the data
and problem at hand.
4. X=sm.add_constant(X) is not working for my project as a new column is not
created. Why is this happening? How can this be resolved?

add_constant() does not add a constant column to the data if a

constant column already exists in it. Please check if the data has a
constant column before using add_constant().
As none of the independent variables should ideally be constants as
they have variability in them initially, the step where the variable(s)
became constant has to be identified. The outlier treatment step is a
good place to start.

5. What one should do if the p-values are high (> 0.05) for some dummies and
not for the other dummies of a categorical variable?

The dummy variables with p-value > 0.05 should be dropped one by
one until there are no such variables. After removing each high p-
value variable, the regression should be run again, and the p-values
of all the variables should be checked.
If all the dummy variables of a categorical column have a p-value >
0.05, then all the dummy variables for that column can be dropped at
once.
6. What one should do if the VIF is high (> 5) for some dummies and not for the
other dummies of a categorical variable?

The VIF values for dummy variables can be ignored.

If, however, the VIF value is inf or NaN, then one should check if one
of the dummy variables was dropped during one-hot encoding. If the
VIF value is still inf or NaN, a different dummy variable than the one
dropped by using drop_first=True should be dropped and VIF values
should be checked again.
For example, if a categorical variable 'Season' has four levels 'Spring',
'Summer', 'Fall' and 'Winter', and using drop_first=True drops the
dummy variable for 'Fall', then one can keep the dummy variable for
'Fall' and drop the dummy variable for 'Summer', and then check the
VIF values.

7. Do we need to treat the outliers?

It is not mandatory to treat the outliers in the data. Depending upon

the EDA performed, one can determine if the outlier values are
proper values or not, and then decide if outlier treatment is needed
or not and which columns should be treated. Some ways to treat
outliers are the following:

1. Cap the values by the IQR method

2. Drop the outliers

It is important to provide a proper explanation for the chosen

approach in the submission.

8. I am getting this error while building the model:

MissingDataError: exog contains inf or nans

How to resolve it?

The error occurred due to the presence of missing value(s) in the
data passed to the OLS model. The presence of missing values can
be checked using the following code:
[Link]().sum()

If missing values are present, then one can use appropriate methods
to treat all missing values before feeding the data to the model.

9. Why I am getting the MAPE as NaN?

NaN values for MAPE may occur due to a minor mistake while
defining the dependent and independent variables. When the
dependent variable is defined as a 2D array (as shown below)
# dependent variable
y = df[["target"]]

, the function to compute MAPE gets a 2D array (actual targets) and a

1D array (predicted targets) as input, resulting in a shape mismatch,
and consequently, NaN value for MAPE.
In order to rectify this error, one should define the target variable as
a 1D array as follows:
# dependent variable
y = df["target"]

10. I am getting the Goldfeldquandt test to give a p-value > 0.05, but the scatter
plot for heteroscedasticity for residuals does not show any pattern, which means
that it almost satisfies the assumption. What to do in this case?

For homoscedasticity assumption, one can consider an

approximation based on a scatter plot for homoscedasticity for
residuals. If the p-value of the Goldfeldquandt test is less than 0.05
but the scatter plot shows no clear patterns, one can conclude that
the assumption is satisfied with the reference of the residuals vs
fitted values plot.
However, it is good practice to try and ensure that the statistical test
results match the visual test results. To do so, we can add more
variables to try and get the Goldfeldquandt test to give a p-value >
0.05. We can also experiment with different transformation methods
and/or feature engineering to obtain a p-value > 0.05. However,
these steps are not mandatory for the scope of the project if the
scatter plot does not show any clear pattern.

11. I am getting this error while creating the dummy variables:

TypeError: unhashable type series

How to resolve it?

The message “TypeError: unhashable type” appears in a Python

program when one tries to use a datatype that is not hashable in a
portion of the code that requires hashable data. When creating
dummies, we need a hashable series.
One of the ways the data can become unhashable is during missing
value imputation. If one is imputing missing values using a central
tendency value (like median) of a column and misses adding a () at
the end of the function,
df['column1']=df1['column1'].fillna(df1['column1'].median)

instead of
df['column1']=df1['column1'].fillna(df1['column1'].median())

, the missing values are replaced with function objects instead of the
central tendency value (like median), resulting in the type of data
becoming an object. This column is converted into unhashable type
data.
So, it is important to ensure that the function for the central tendency
value is properly defined while imputing the missing values.

12. I am getting this error while building the model:

ValueError: Pandas data cast to numpy dtype of object. Check input data with
[Link](data).

How to resolve it?

This kind of error generally occurs when one attempts to fit a
regression model in Python before converting the categorical
variables to dummy variables.
All category variables must be converted to dummy variables. One
can use pandas.get_dummies() function to convert the categorical
variables into numerical variables.

13. I am getting a Future Warning in my code. How to resolve it?

Future warnings notify users about things that will change in the code
functionality during a package/library update. They can be ignored in
the current runtime as they will have no effect on the code
execution.
One can check the Python documentation to identify changes that
need to be made to the code to avoid such warnings.
One can also suppress the warning if needed (though it is not a
preferred approach). To suppress the warning, one can use the code
below at the start of the Python notebook:
import warnings
[Link](action='ignore', category=FutureWarning)

ReCell Project FAQ and Guidelines
No ratings yet
ReCell Project FAQ and Guidelines
5 pages
EDA Basics: Python for Data Analysis
100% (1)
EDA Basics: Python for Data Analysis
30 pages
20 Questions On Feature Engineering and Eda
No ratings yet
20 Questions On Feature Engineering and Eda
9 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Create Classification Datasets in Python
No ratings yet
Create Classification Datasets in Python
8 pages
Handling Missing Data in Pandas
100% (1)
Handling Missing Data in Pandas
14 pages
R Programming Quiz on Data Analysis
No ratings yet
R Programming Quiz on Data Analysis
10 pages
Practice Quiz
100% (1)
Practice Quiz
20 pages
Predictive Modelling Alternate Project Business Case
No ratings yet
Predictive Modelling Alternate Project Business Case
47 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Academic Performance Data Wrangling
No ratings yet
Academic Performance Data Wrangling
9 pages
Pandas: Data Cleaning Essentials
No ratings yet
Pandas: Data Cleaning Essentials
6 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
30 pages
Linear Regression Analysis for Firm Sales
100% (1)
Linear Regression Analysis for Firm Sales
23 pages
Interview Questions On Machine Learning
100% (5)
Interview Questions On Machine Learning
22 pages
Data Cleaning and Preprocessing Guide
No ratings yet
Data Cleaning and Preprocessing Guide
56 pages
Financial Default Prediction Report
100% (1)
Financial Default Prediction Report
21 pages
Data Cleaning and EDA Techniques Guide
No ratings yet
Data Cleaning and EDA Techniques Guide
38 pages
IntroToPython Unit 5
No ratings yet
IntroToPython Unit 5
42 pages
Python Pandas and Machine Learning Guide
No ratings yet
Python Pandas and Machine Learning Guide
21 pages
CS3352 Foundations of Data Science Apr May 2024 Question Paper Download
No ratings yet
CS3352 Foundations of Data Science Apr May 2024 Question Paper Download
19 pages
ML Interview Questions PDF
86% (7)
ML Interview Questions PDF
20 pages
Data Science - 2 Sets
No ratings yet
Data Science - 2 Sets
10 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Unit 2
No ratings yet
Unit 2
76 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Data Science Interview Questions and Answers For 2020 PDF
No ratings yet
Data Science Interview Questions and Answers For 2020 PDF
20 pages
Machine Learning Practical Exercises Guide
No ratings yet
Machine Learning Practical Exercises Guide
68 pages
Data Science Foundations Assessment Key
No ratings yet
Data Science Foundations Assessment Key
11 pages
EDA and PCA Analysis Techniques
No ratings yet
EDA and PCA Analysis Techniques
10 pages
Essential Steps in Exploratory Data Analysis
No ratings yet
Essential Steps in Exploratory Data Analysis
47 pages
Predictive Modelling Business Report
No ratings yet
Predictive Modelling Business Report
36 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Abinash Nag Project Report CART
No ratings yet
Abinash Nag Project Report CART
40 pages
R Language Overview: Ihaka & Gentleman
No ratings yet
R Language Overview: Ihaka & Gentleman
36 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
FDS 2 Marks QA
No ratings yet
FDS 2 Marks QA
2 pages
Data Wrangling Techniques for Analysis
No ratings yet
Data Wrangling Techniques for Analysis
18 pages
Churn Prediction Model Analysis
No ratings yet
Churn Prediction Model Analysis
11 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Data Preparation: Treatment of Missing Values
No ratings yet
Data Preparation: Treatment of Missing Values
26 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Data Analytics and Machine Learning Experiments
No ratings yet
Data Analytics and Machine Learning Experiments
27 pages
Predictive Modelling with Linear Regression
No ratings yet
Predictive Modelling with Linear Regression
19 pages
Handout 4
No ratings yet
Handout 4
15 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
50 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
23 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
Data Science Interview Prep Guide
No ratings yet
Data Science Interview Prep Guide
3 pages
Comprehensive Guide to Data Cleaning
No ratings yet
Comprehensive Guide to Data Cleaning
36 pages
EasyVisa Problem Statement
No ratings yet
EasyVisa Problem Statement
2 pages
German Credit Risk Analysis
No ratings yet
German Credit Risk Analysis
2 pages
Interpreting Logistic Regression Results
No ratings yet
Interpreting Logistic Regression Results
13 pages
Project Description and Mark Split
No ratings yet
Project Description and Mark Split
2 pages
Problem Statement - Is Project
No ratings yet
Problem Statement - Is Project
4 pages
Project Description and Mark Split
No ratings yet
Project Description and Mark Split
3 pages
FAQ - Is Coded Project
No ratings yet
FAQ - Is Coded Project
2 pages
Project Description and Mark Split
No ratings yet
Project Description and Mark Split
4 pages
Understanding Research Variables
No ratings yet
Understanding Research Variables
6 pages
Gower's Similarity Coefficient Explained
75% (4)
Gower's Similarity Coefficient Explained
7 pages
Interpreting Test Scores: UNIT-8
100% (3)
Interpreting Test Scores: UNIT-8
41 pages
Psychology Statistics Lab Guide
No ratings yet
Psychology Statistics Lab Guide
34 pages
Bwinter - Stats - Proofs Book About R
No ratings yet
Bwinter - Stats - Proofs Book About R
326 pages
Automating Loan Eligibility Analysis
No ratings yet
Automating Loan Eligibility Analysis
1 page
View Introduction To Research Methods and Data Analysis in The Health Sciences Instant Access
No ratings yet
View Introduction To Research Methods and Data Analysis in The Health Sciences Instant Access
16 pages
Psychology Survey Research Exam 2007
100% (1)
Psychology Survey Research Exam 2007
24 pages
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
100% (2)
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
8 pages
Home Lesson 15: Logistic, Poisson & Nonlinear Regression
No ratings yet
Home Lesson 15: Logistic, Poisson & Nonlinear Regression
32 pages
An Investigation of The Role of Pain Duration in The Management of Patients With Low Back Pain.
No ratings yet
An Investigation of The Role of Pain Duration in The Management of Patients With Low Back Pain.
195 pages
Libfm
No ratings yet
Libfm
7 pages
Topic 7 - Moderation Analysis
No ratings yet
Topic 7 - Moderation Analysis
21 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
Data Entry, Coding & Cleaning: SPSS Training
No ratings yet
Data Entry, Coding & Cleaning: SPSS Training
25 pages
Discriminant and Logit Analysis
No ratings yet
Discriminant and Logit Analysis
72 pages
Decision Trees for Waterflood Prediction
No ratings yet
Decision Trees for Waterflood Prediction
16 pages
Heart Disease Identification Method Using Machine Learning Classification in E Healthcare
No ratings yet
Heart Disease Identification Method Using Machine Learning Classification in E Healthcare
9 pages
Analysis of Epidemiological Data Using R
No ratings yet
Analysis of Epidemiological Data Using R
285 pages
Mengistu Hone Final Tss
No ratings yet
Mengistu Hone Final Tss
106 pages
Data Management and Statistical Tools
100% (1)
Data Management and Statistical Tools
86 pages
Choosing the Right Statistical Test
No ratings yet
Choosing the Right Statistical Test
5 pages
Statistics Presentation 1
No ratings yet
Statistics Presentation 1
64 pages
Yim 2018 Converg. Sci. Phys. Oncol. 4 014001 PDF
No ratings yet
Yim 2018 Converg. Sci. Phys. Oncol. 4 014001 PDF
12 pages
1.introduction To Biostat
No ratings yet
1.introduction To Biostat
39 pages
All The Statistical Concept You Required For Data Science
No ratings yet
All The Statistical Concept You Required For Data Science
26 pages
Mediators and Moderators in Research
No ratings yet
Mediators and Moderators in Research
27 pages
3656467-3630646-Dataset Analysis - Spaceship Titanic
No ratings yet
3656467-3630646-Dataset Analysis - Spaceship Titanic
4 pages
Biostatistics Course for Public Health Students
No ratings yet
Biostatistics Course for Public Health Students
41 pages
Statistics 1 - Introduction To ANOVA, Regression, and Logistic Regression
100% (1)
Statistics 1 - Introduction To ANOVA, Regression, and Logistic Regression
554 pages