Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treatment of missing values in AustDist #125

Closed
gverbock opened this issue Apr 1, 2021 · 3 comments · Fixed by #126
Closed

Treatment of missing values in AustDist #125

gverbock opened this issue Apr 1, 2021 · 3 comments · Fixed by #126
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@gverbock
Copy link
Contributor

gverbock commented Apr 1, 2021

Issue Description

AutoDist is not able to compute the PSI (SimpleBucketer) when the (train) dataframe does contain missing values.
image

Removing missing values before running AutoDist (df.dropna()) may have a large impact on the results as it would remove all row containing a single missing value.

A solution could be to include the dropna() in the loop where the PSI is computed for each series.

@Matgrb
Copy link
Contributor

Matgrb commented Apr 1, 2021

Good idea to do that.

We need to implement a check in Autodist. I think it would be best to start with Autodist, and for each feature, check if there are any nans before applying stats tests, and if so, remove rows having NaNs before applying these tests. A warning for each feature should be printed in case this is done. One also has to make an unit test for that.

DistributionStatistic on its own will still fail in case there are NaNs in the data, but i think that is okay. Autodist is supposed to perform everything for the user, but DistributionStatistic would give error in that case, and ask the user to decide what to do with it.

@Matgrb Matgrb added good first issue Good for newcomers enhancement New feature or request labels Apr 1, 2021
@gverbock
Copy link
Contributor Author

I will pick this one up.

@Matgrb
Copy link
Contributor

Matgrb commented Apr 12, 2021

Thanks! Let me know if you have some questions or you need a review.

@Matgrb Matgrb linked a pull request Apr 14, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants