Treatment of missing values in AustDist #125

gverbock · 2021-04-01T13:48:05Z

Issue Description

AutoDist is not able to compute the PSI (SimpleBucketer) when the (train) dataframe does contain missing values.

Removing missing values before running AutoDist (df.dropna()) may have a large impact on the results as it would remove all row containing a single missing value.

A solution could be to include the dropna() in the loop where the PSI is computed for each series.

Matgrb · 2021-04-01T14:17:53Z

Good idea to do that.

We need to implement a check in Autodist. I think it would be best to start with Autodist, and for each feature, check if there are any nans before applying stats tests, and if so, remove rows having NaNs before applying these tests. A warning for each feature should be printed in case this is done. One also has to make an unit test for that.

DistributionStatistic on its own will still fail in case there are NaNs in the data, but i think that is okay. Autodist is supposed to perform everything for the user, but DistributionStatistic would give error in that case, and ask the user to decide what to do with it.

gverbock · 2021-04-12T15:18:10Z

I will pick this one up.

Matgrb · 2021-04-12T18:29:19Z

Thanks! Let me know if you have some questions or you need a review.

Matgrb added good first issue Good for newcomers enhancement New feature or request labels Apr 1, 2021

Matgrb linked a pull request Apr 14, 2021 that will close this issue

Missing values autodist #126

Merged

Matgrb closed this as completed in #126 Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treatment of missing values in AustDist #125

Treatment of missing values in AustDist #125

gverbock commented Apr 1, 2021 •

edited

Loading

Matgrb commented Apr 1, 2021

gverbock commented Apr 12, 2021

Matgrb commented Apr 12, 2021

Treatment of missing values in AustDist #125

Treatment of missing values in AustDist #125

Comments

gverbock commented Apr 1, 2021 • edited Loading

Matgrb commented Apr 1, 2021

gverbock commented Apr 12, 2021

Matgrb commented Apr 12, 2021

gverbock commented Apr 1, 2021 •

edited

Loading