Guided Project: FAQ
Previous
Next
1. How should one approach the ReCell project?
Before starting the project, please read the problem statement carefully and go
through the criteria and descriptions mentioned in the rubric.
Once you understand the task, download the dataset and import it into
a Google Colab Notebook to get started with the project.
To work on the project, you should start with data preprocessing and EDA using
descriptive statistics and visualizations.
Once the EDA is completed and the data is preprocessed, you can use the data
to build a model, check its performance, and check whether or not it satisfies
the necessary assumptions.
It is important to close the analysis with key findings and recommendations to
the business.
2. Since we have missing values in the dataset, what is the best way to handle or
treat those missing values?
The strategy to deal with missing values varies with the problem at
hand, the data provided, and other factors too. Some of the common
strategies are listed below.
Drop the missing values
Impute the missing values
Using central tendency measures (mean, median, mode) of a column
With mean: Missing values are imputed with the mean of the
column. Preferred for continuous data with no outliers
With median: Missing values are imputed with the median of the
column. Preferred for continuous data with outliers
With mode: Missing values are imputed with the mode of the
column. Preferred for categorical data
Using central tendency measures (mean, median, mode) of a
column grouped by categories of a categorical column: Preferred for
cases where the data under similar categories of a categorical column are
likely to have similar properties
3. In what order should one do EDA and Data Preprocessing?
Whenever one extracts or gets some data, the first step generally is
to explore the data (check the distribution, summary statistics,
interactions between variables). More often than not, the data will be
in a state that would need some amount of preprocessing before any
exploration can be performed. As such, the step of data
preprocessing is both preceded and followed by some amount of data
exploration.
The initial exploration of the data helps in identifying the kind of
preprocessing needed for the data. For example, if the data has
missing values, the data distribution will help you decide on the
strategy to use to treat the missing values. Once you have treated
the missing values, you would want to check the distribution before
going ahead with modeling.
The exact steps to be taken for preprocessing, the kinds of analysis
to perform, and the order of their execution will depend on the data
and problem at hand.
4. X=sm.add_constant(X) is not working for my project as a new column is not
created. Why is this happening? How can this be resolved?
add_constant() does not add a constant column to the data if a
constant column already exists in it. Please check if the data has a
constant column before using add_constant().
As none of the independent variables should ideally be constants as
they have variability in them initially, the step where the variable(s)
became constant has to be identified. The outlier treatment step is a
good place to start.
5. What one should do if the p-values are high (> 0.05) for some dummies and
not for the other dummies of a categorical variable?
The dummy variables with p-value > 0.05 should be dropped one by
one until there are no such variables. After removing each high p-
value variable, the regression should be run again, and the p-values
of all the variables should be checked.
If all the dummy variables of a categorical column have a p-value >
0.05, then all the dummy variables for that column can be dropped at
once.
6. What one should do if the VIF is high (> 5) for some dummies and not for the
other dummies of a categorical variable?
The VIF values for dummy variables can be ignored.
If, however, the VIF value is inf or NaN, then one should check if one
of the dummy variables was dropped during one-hot encoding. If the
VIF value is still inf or NaN, a different dummy variable than the one
dropped by using drop_first=True should be dropped and VIF values
should be checked again.
For example, if a categorical variable 'Season' has four levels 'Spring',
'Summer', 'Fall' and 'Winter', and using drop_first=True drops the
dummy variable for 'Fall', then one can keep the dummy variable for
'Fall' and drop the dummy variable for 'Summer', and then check the
VIF values.
7. Do we need to treat the outliers?
It is not mandatory to treat the outliers in the data. Depending upon
the EDA performed, one can determine if the outlier values are
proper values or not, and then decide if outlier treatment is needed
or not and which columns should be treated. Some ways to treat
outliers are the following:
1. Cap the values by the IQR method
2. Drop the outliers
It is important to provide a proper explanation for the chosen
approach in the submission.
8. I am getting this error while building the model:
MissingDataError: exog contains inf or nans
How to resolve it?
The error occurred due to the presence of missing value(s) in the
data passed to the OLS model. The presence of missing values can
be checked using the following code:
[Link]().sum()
If missing values are present, then one can use appropriate methods
to treat all missing values before feeding the data to the model.
9. Why I am getting the MAPE as NaN?
NaN values for MAPE may occur due to a minor mistake while
defining the dependent and independent variables. When the
dependent variable is defined as a 2D array (as shown below)
# dependent variable
y = df[["target"]]
, the function to compute MAPE gets a 2D array (actual targets) and a
1D array (predicted targets) as input, resulting in a shape mismatch,
and consequently, NaN value for MAPE.
In order to rectify this error, one should define the target variable as
a 1D array as follows:
# dependent variable
y = df["target"]
10. I am getting the Goldfeldquandt test to give a p-value > 0.05, but the scatter
plot for heteroscedasticity for residuals does not show any pattern, which means
that it almost satisfies the assumption. What to do in this case?
For homoscedasticity assumption, one can consider an
approximation based on a scatter plot for homoscedasticity for
residuals. If the p-value of the Goldfeldquandt test is less than 0.05
but the scatter plot shows no clear patterns, one can conclude that
the assumption is satisfied with the reference of the residuals vs
fitted values plot.
However, it is good practice to try and ensure that the statistical test
results match the visual test results. To do so, we can add more
variables to try and get the Goldfeldquandt test to give a p-value >
0.05. We can also experiment with different transformation methods
and/or feature engineering to obtain a p-value > 0.05. However,
these steps are not mandatory for the scope of the project if the
scatter plot does not show any clear pattern.
11. I am getting this error while creating the dummy variables:
TypeError: unhashable type series
How to resolve it?
The message “TypeError: unhashable type” appears in a Python
program when one tries to use a datatype that is not hashable in a
portion of the code that requires hashable data. When creating
dummies, we need a hashable series.
One of the ways the data can become unhashable is during missing
value imputation. If one is imputing missing values using a central
tendency value (like median) of a column and misses adding a () at
the end of the function,
df['column1']=df1['column1'].fillna(df1['column1'].median)
instead of
df['column1']=df1['column1'].fillna(df1['column1'].median())
, the missing values are replaced with function objects instead of the
central tendency value (like median), resulting in the type of data
becoming an object. This column is converted into unhashable type
data.
So, it is important to ensure that the function for the central tendency
value is properly defined while imputing the missing values.
12. I am getting this error while building the model:
ValueError: Pandas data cast to numpy dtype of object. Check input data with
[Link](data).
How to resolve it?
This kind of error generally occurs when one attempts to fit a
regression model in Python before converting the categorical
variables to dummy variables.
All category variables must be converted to dummy variables. One
can use pandas.get_dummies() function to convert the categorical
variables into numerical variables.
13. I am getting a Future Warning in my code. How to resolve it?
Future warnings notify users about things that will change in the code
functionality during a package/library update. They can be ignored in
the current runtime as they will have no effect on the code
execution.
One can check the Python documentation to identify changes that
need to be made to the code to avoid such warnings.
One can also suppress the warning if needed (though it is not a
preferred approach). To suppress the warning, one can use the code
below at the start of the Python notebook:
import warnings
[Link](action='ignore', category=FutureWarning)