-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
ENH keep missing features during imputation #16695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH keep missing features during imputation #16695
Conversation
|
test script from sklearn.experimental import enable_iterative_imputer
import sklearn.impute as skimpute
import numpy as np
import scipy as sp
imputers = [
skimpute.SimpleImputer(strategy='mean'),
skimpute.SimpleImputer(strategy='median'),
skimpute.SimpleImputer(strategy='most_frequent'),
skimpute.SimpleImputer(strategy='constant'),
skimpute.KNNImputer(),
skimpute.IterativeImputer(),
]
x = np.array([[1, np.nan, 2], [3, np.nan, np.nan]])
for imp in imputers:
y = imp.fit_transform(x)
print(y)
if x.shape != y.shape:
print(imp)
for imp in imputers:
y = imp.transform(x)
print(y)
if x.shape != y.shape:
print(imp) |
|
We should aim to maintain backwards compatibility, making this an option for the user. |
|
So this behaviour might be triggered by a parameter? What do you think about |
|
Sounds like a good name for it...
|
|
test script from sklearn.experimental import enable_iterative_imputer
import sklearn.impute as skimpute
import numpy as np
import scipy as sp
print('default')
imputers = [
skimpute.SimpleImputer(strategy='mean'),
skimpute.SimpleImputer(strategy='median'),
skimpute.SimpleImputer(strategy='most_frequent'),
skimpute.KNNImputer(),
skimpute.IterativeImputer(),
]
x = np.array([[1, np.nan, 2], [3, np.nan, np.nan]])
for imp in imputers:
y1 = imp.fit_transform(x)
y2 = imp.transform(x)
if x.shape == y1.shape or x.shape == y2.shape:
print(imp, x, y1, y2, sep='\n')
print('keep')
imputers = [
skimpute.SimpleImputer(strategy='mean', keep_missing_features=True),
skimpute.SimpleImputer(strategy='median', keep_missing_features=True),
skimpute.SimpleImputer(strategy='most_frequent', keep_missing_features=True),
skimpute.SimpleImputer(strategy='constant', keep_missing_features=True),
skimpute.KNNImputer(keep_missing_features=True),
skimpute.IterativeImputer(keep_missing_features=True),
]
x = np.array([[1, np.nan, 2], [3, np.nan, np.nan]])
for imp in imputers:
y1 = imp.fit_transform(x)
y2 = imp.transform(x)
if x.shape != y1.shape or x.shape != y2.shape:
print(imp, x, y1, y2, sep='\n') |
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Please add tests to sklearn/impute/tests/ so that they remain part of our test suite.
|
Thanks @vitorsrg ! As far as I understand, this will keep features with nan unchanged, right? My assumption of imputers is that the output should contain no NaN. In the current form even with How about instead introducing |
I'm happy with that. I think it would be good to have in this PR if @vitorsrg can do it. |
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vitorsrg
sklearn/impute/tests/test_impute.py
Outdated
| X_trans = imputer.fit_transform(X) | ||
|
|
||
| if strategy == "constant": | ||
| assert X.shape == X_trans.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a strange inconsistency. I think we should follow keep_missing_features in the constant case too and drop the feature if keep_missing_features=False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpleImputer's constant strategy never dropped the features; and I don't think it should, because we can use it after the other imputers to prevent NaNs.
So the problem is that IterativeImputer(strategy='constant', 'keep_missing_features'=False) drops the features, whilst its SimpleImputer doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd personally rather consistency across strategy. However, I'm happy to leave 'constant' as exceptional for now, but it should be documented.
I don't think we should be using it after other imputers to prevent NaNs. I think a guaranteed invariance for imputers should be that after transform there are no more missing values. Hence my support for @rth's fill_missing_features idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a fill_missing_features_value parameter, then?
|
|
||
| imputer = Imputer(**{arg: strategy, 'keep_missing_features': True}) | ||
| assert X.shape == imputer.fit_transform(X).shape | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also check the values: that missing valued column is in the right place, and that all other values are identical to the X_trans produced with keep_missing_features=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment also applies to the common test
sklearn/impute/tests/test_common.py
Outdated
| def test_imputation_keep_missing_features(imputer): | ||
| X = np.array([[1, np.nan, 2], [3, np.nan, np.nan]]) | ||
|
|
||
| imputer = imputer.__class__(keep_missing_features=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more conventional would be: imputer.set_params(keep_missing_features=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what I checked, set_params is an instance method
I would like to prevent problems with recycling imputer objects that learned their own statistics_ and other inner params
sklearn/impute/tests/test_impute.py
Outdated
| [(SimpleImputer, 'strategy'), (IterativeImputer, 'initial_strategy')]) | ||
| @pytest.mark.parametrize( | ||
| "strategy", | ||
| ['mean', 'median', 'most_frequent', "constant"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should instead include (some of) these different strategies in IMPUTERS in test_common.py??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, but i couldn't understand your suggestion
sklearn/impute/_iterative.py
Outdated
| the missing indicator even if there are missing values at | ||
| transform/test time. | ||
| keep_missing_features : boolean, default=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is pretty strange?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's too long, but I couldn't find something better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe keep_empty_features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider missing a more consistent terminology, but it is not a strong opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see "missing" as being the property of a cell, but how we call a column with all "missing" cells/values? I think naming it again "missing" might be confusing. But also no strong preference here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, the attribute indicator_ already addresses this
|
I would suggest that |
|
@vitorsrg will you be able to continue working on this? @mitar, hopefully we'll implement better solutions for passing around feature names etc soon. I have previously proposed transformers could define a matrix indicating the dependencies between input and output features... but I've never really promoted it loudly. Here's one version of the idea: https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep003/proposal.html |
|
The problem with passing around feature names is that it does not work when there are duplicate feature names (what Pandas supports) which you can get quickly if you concat existing dataframes. I would prefer column indices in an attribute. |
fe0bd3d to
64e36a7
Compare
|
@rth I think we should chain an "constant" imputer instead of adding a filler param |
|
Whats is the expected behaviour for the missing features when transforming inplace? |
What does it currently do? You currently appear to have test failure? |
|
Hi @vitorsrg, thanks for your work so far. Are you still interested in working on this? If so, do you mind fixing conflicts? Thanks. |
|
Yes, I'll be working on this again by next week |
|
Is this feature still being worked on? |
116adac to
092ea4e
Compare
092ea4e to
9fb1cbb
Compare
…uter._initial_imputation
9fb1cbb to
0fe4913
Compare
|
FYI I've finished what I planned to do, so I'll just wait for the reviews now |
cmarmo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vitorsrg waiting for a core-dev review, just some suggestions in order to comply with the documentation guidelines.
If you could find some time to fix conflicts, this will also make the review easier. Thanks for your work and your patience!
Co-authored-by: Chiara Marmo <[email protected]>
|
@amueller related to imputation, in case you're interested in giving it a review. Removing this from the milestone. |
|
Hi @vitorsrg, We would like to have this PR part of the 1.2 release. Thanks! |
|
Closing in favor of #24770 |
Co-authored-by: Chiara Marmo <[email protected]> Co-authored-by: Julien Jerphanion <[email protected]> Co-authored-by: Jérémie du Boisberranger <[email protected]> Co-authored-by: Vitor SRG <[email protected]> Fixes #16695 Fixes #16426 Fixes #16977
Reference Issues/PRs
Fixes #16426
Fixes #16977
What does this implement/fix? Explain your changes.
keep_missing_featurestoSimpleImputerkeep_missing_featurestoIterativeImputerkeep_missing_featurestoKNNImputerkeep_missing_featuresshape testkeep_missing_featuresto regular tests