Skip to content

Conversation

@vitorsrg
Copy link

@vitorsrg vitorsrg commented Mar 15, 2020

Reference Issues/PRs

Fixes #16426
Fixes #16977

What does this implement/fix? Explain your changes.

  • add keep_missing_features to SimpleImputer
  • add keep_missing_features to IterativeImputer
  • add keep_missing_features to KNNImputer
  • add keep_missing_features shape test
  • add keep_missing_features to regular tests

@vitorsrg
Copy link
Author

test script

from sklearn.experimental import enable_iterative_imputer
import sklearn.impute as skimpute
import numpy as np
import scipy as sp

imputers = [
    skimpute.SimpleImputer(strategy='mean'),
    skimpute.SimpleImputer(strategy='median'),
    skimpute.SimpleImputer(strategy='most_frequent'),
    skimpute.SimpleImputer(strategy='constant'),
    skimpute.KNNImputer(),
    skimpute.IterativeImputer(),
]

x = np.array([[1, np.nan, 2], [3, np.nan, np.nan]])

for imp in imputers:
    y = imp.fit_transform(x)
    print(y)
    if x.shape != y.shape:
        print(imp)

for imp in imputers:
    y = imp.transform(x)
    print(y)
    if x.shape != y.shape:
        print(imp)

@jnothman
Copy link
Member

We should aim to maintain backwards compatibility, making this an option for the user.

@vitorsrg
Copy link
Author

So this behaviour might be triggered by a parameter? What do you think about keep_missing_features?

@jnothman
Copy link
Member

jnothman commented Mar 16, 2020 via email

@vitorsrg
Copy link
Author

test script

from sklearn.experimental import enable_iterative_imputer
import sklearn.impute as skimpute
import numpy as np
import scipy as sp

print('default')
imputers = [
    skimpute.SimpleImputer(strategy='mean'),
    skimpute.SimpleImputer(strategy='median'),
    skimpute.SimpleImputer(strategy='most_frequent'),
    skimpute.KNNImputer(),
    skimpute.IterativeImputer(),
]

x = np.array([[1, np.nan, 2], [3, np.nan, np.nan]])

for imp in imputers:
    y1 = imp.fit_transform(x)
    y2 = imp.transform(x)
    if x.shape == y1.shape or x.shape == y2.shape:
        print(imp, x, y1, y2, sep='\n')

print('keep')
imputers = [
    skimpute.SimpleImputer(strategy='mean', keep_missing_features=True),
    skimpute.SimpleImputer(strategy='median', keep_missing_features=True),
    skimpute.SimpleImputer(strategy='most_frequent', keep_missing_features=True),
    skimpute.SimpleImputer(strategy='constant', keep_missing_features=True),
    skimpute.KNNImputer(keep_missing_features=True),
    skimpute.IterativeImputer(keep_missing_features=True),
]

x = np.array([[1, np.nan, 2], [3, np.nan, np.nan]])

for imp in imputers:
    y1 = imp.fit_transform(x)
    y2 = imp.transform(x)
    if x.shape != y1.shape or x.shape != y2.shape:
        print(imp, x, y1, y2, sep='\n')

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Please add tests to sklearn/impute/tests/ so that they remain part of our test suite.

@rth
Copy link
Member

rth commented Apr 21, 2020

Thanks @vitorsrg ! As far as I understand, this will keep features with nan unchanged, right? My assumption of imputers is that the output should contain no NaN. In the current form even with keep_missing_features=True one still might get NaN, and so would have to make a pipeline of the imputer with another SimpleImputer(strategy="constant") just to be sure computations don't break downstream in the pipeline.

How about instead introducing fill_missing_features=False that would fill missing features with some values (e.g. likely 0)? Just asking for feedback for now, not asking to make these changes.

@jnothman
Copy link
Member

How about instead introducing fill_missing_features=False that would fill missing features with some values (e.g. likely 0)? Just asking for feedback for now, not asking to make these changes.

I'm happy with that.

I think it would be good to have in this PR if @vitorsrg can do it.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vitorsrg

X_trans = imputer.fit_transform(X)

if strategy == "constant":
assert X.shape == X_trans.shape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a strange inconsistency. I think we should follow keep_missing_features in the constant case too and drop the feature if keep_missing_features=False.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimpleImputer's constant strategy never dropped the features; and I don't think it should, because we can use it after the other imputers to prevent NaNs.

So the problem is that IterativeImputer(strategy='constant', 'keep_missing_features'=False) drops the features, whilst its SimpleImputer doesn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally rather consistency across strategy. However, I'm happy to leave 'constant' as exceptional for now, but it should be documented.

I don't think we should be using it after other imputers to prevent NaNs. I think a guaranteed invariance for imputers should be that after transform there are no more missing values. Hence my support for @rth's fill_missing_features idea.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a fill_missing_features_value parameter, then?


imputer = Imputer(**{arg: strategy, 'keep_missing_features': True})
assert X.shape == imputer.fit_transform(X).shape

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also check the values: that missing valued column is in the right place, and that all other values are identical to the X_trans produced with keep_missing_features=False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment also applies to the common test

def test_imputation_keep_missing_features(imputer):
X = np.array([[1, np.nan, 2], [3, np.nan, np.nan]])

imputer = imputer.__class__(keep_missing_features=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more conventional would be: imputer.set_params(keep_missing_features=True)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I checked, set_params is an instance method
I would like to prevent problems with recycling imputer objects that learned their own statistics_ and other inner params

[(SimpleImputer, 'strategy'), (IterativeImputer, 'initial_strategy')])
@pytest.mark.parametrize(
"strategy",
['mean', 'median', 'most_frequent', "constant"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should instead include (some of) these different strategies in IMPUTERS in test_common.py??

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but i couldn't understand your suggestion

the missing indicator even if there are missing values at
transform/test time.
keep_missing_features : boolean, default=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is pretty strange?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's too long, but I couldn't find something better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep_empty_features?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider missing a more consistent terminology, but it is not a strong opinion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see "missing" as being the property of a cell, but how we call a column with all "missing" cells/values? I think naming it again "missing" might be confusing. But also no strong preference here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, the attribute indicator_ already addresses this

@mitar
Copy link
Contributor

mitar commented Apr 24, 2020

I would suggest that is_empty_ property is added which contains the list of column indices on the input which were detected as having all nan values.

@jnothman
Copy link
Member

@vitorsrg will you be able to continue working on this?

@mitar, hopefully we'll implement better solutions for passing around feature names etc soon. I have previously proposed transformers could define a matrix indicating the dependencies between input and output features... but I've never really promoted it loudly. Here's one version of the idea: https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep003/proposal.html

@mitar
Copy link
Contributor

mitar commented Apr 27, 2020

The problem with passing around feature names is that it does not work when there are duplicate feature names (what Pandas supports) which you can get quickly if you concat existing dataframes. I would prefer column indices in an attribute.

@vitorsrg vitorsrg force-pushed the fix/imputer-keep-missing-feats branch from fe0bd3d to 64e36a7 Compare April 30, 2020 10:15
@vitorsrg vitorsrg marked this pull request as draft April 30, 2020 11:28
@vitorsrg
Copy link
Author

vitorsrg commented May 2, 2020

@rth I think we should chain an "constant" imputer instead of adding a filler param

@vitorsrg
Copy link
Author

vitorsrg commented May 2, 2020

Whats is the expected behaviour for the missing features when transforming inplace?

@vitorsrg vitorsrg marked this pull request as ready for review May 5, 2020 09:26
@vitorsrg vitorsrg requested a review from jnothman May 5, 2020 09:26
@jnothman
Copy link
Member

Whats is the expected behaviour for the missing features when transforming inplace?

What does it currently do?

You currently appear to have test failure?

@cmarmo
Copy link
Contributor

cmarmo commented Sep 25, 2020

Hi @vitorsrg, thanks for your work so far. Are you still interested in working on this? If so, do you mind fixing conflicts? Thanks.

@vitorsrg
Copy link
Author

Yes, I'll be working on this again by next week

Base automatically changed from master to main January 22, 2021 10:52
@charlottesmith0308
Copy link

Is this feature still being worked on?

@vitorsrg vitorsrg force-pushed the fix/imputer-keep-missing-feats branch from 116adac to 092ea4e Compare March 3, 2021 09:13
vitorsrg added a commit to vitorsrg/scikit-learn that referenced this pull request Mar 3, 2021
@vitorsrg vitorsrg force-pushed the fix/imputer-keep-missing-feats branch from 092ea4e to 9fb1cbb Compare March 3, 2021 09:34
@vitorsrg vitorsrg force-pushed the fix/imputer-keep-missing-feats branch from 9fb1cbb to 0fe4913 Compare March 3, 2021 10:26
@vitorsrg
Copy link
Author

vitorsrg commented Mar 4, 2021

FYI I've finished what I planned to do, so I'll just wait for the reviews now

@cmarmo
Copy link
Contributor

cmarmo commented Mar 25, 2021

This pull request also fixes #16977. Perhaps @rth could find some time to have a look?

@cmarmo cmarmo added this to the 1.0 milestone Mar 25, 2021
Copy link
Contributor

@cmarmo cmarmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitorsrg waiting for a core-dev review, just some suggestions in order to comply with the documentation guidelines.
If you could find some time to fix conflicts, this will also make the review easier. Thanks for your work and your patience!

Co-authored-by: Chiara Marmo <[email protected]>
jawadjawid added a commit to jawadjawid/scikit-learn that referenced this pull request Mar 27, 2021
jawadjawid added a commit to jawadjawid/scikit-learn that referenced this pull request Mar 27, 2021
@adrinjalali
Copy link
Member

@amueller related to imputation, in case you're interested in giving it a review. Removing this from the milestone.

@adrinjalali adrinjalali removed this from the 1.0 milestone Aug 20, 2021
@jeremiedbb jeremiedbb added this to the 1.2 milestone Sep 24, 2022
@jjerphan
Copy link
Member

Hi @vitorsrg,

We would like to have this PR part of the 1.2 release.
Do you want and still have time to pursue this PR or do you allow maintainers to finalise it?

Thanks!

@glemaitre glemaitre changed the title Keep missing features during imputation ENH keep missing features during imputation Oct 27, 2022
@glemaitre glemaitre self-assigned this Oct 27, 2022
@glemaitre
Copy link
Member

Closing in favor of #24770

@glemaitre glemaitre closed this Nov 16, 2022
jjerphan added a commit that referenced this pull request Nov 17, 2022
Co-authored-by: Chiara Marmo <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Jérémie du Boisberranger <[email protected]>
Co-authored-by: Vitor SRG <[email protected]>
Fixes #16695
Fixes #16426
Fixes #16977
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement module:impute Superseded PR has been replace by a newer PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IterativeImputer automatically removes variables with all missing values Missing features removal with SimpleImputer

10 participants