ENH keep missing features during imputation #16695

vitorsrg · 2020-03-15T04:26:05Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

add keep_missing_features to SimpleImputer
add keep_missing_features to IterativeImputer
add keep_missing_features to KNNImputer
add keep_missing_features shape test
add keep_missing_features to regular tests

vitorsrg · 2020-03-15T05:18:21Z

test script

from sklearn.experimental import enable_iterative_imputer
import sklearn.impute as skimpute
import numpy as np
import scipy as sp

imputers = [
    skimpute.SimpleImputer(strategy='mean'),
    skimpute.SimpleImputer(strategy='median'),
    skimpute.SimpleImputer(strategy='most_frequent'),
    skimpute.SimpleImputer(strategy='constant'),
    skimpute.KNNImputer(),
    skimpute.IterativeImputer(),
]

x = np.array([[1, np.nan, 2], [3, np.nan, np.nan]])

for imp in imputers:
    y = imp.fit_transform(x)
    print(y)
    if x.shape != y.shape:
        print(imp)

for imp in imputers:
    y = imp.transform(x)
    print(y)
    if x.shape != y.shape:
        print(imp)

jnothman · 2020-03-15T11:12:46Z

We should aim to maintain backwards compatibility, making this an option for the user.

vitorsrg · 2020-03-15T16:05:25Z

So this behaviour might be triggered by a parameter? What do you think about keep_missing_features?

jnothman · 2020-03-16T12:56:45Z

Sounds like a good name for it...

vitorsrg · 2020-03-22T19:44:55Z

test script

from sklearn.experimental import enable_iterative_imputer
import sklearn.impute as skimpute
import numpy as np
import scipy as sp

print('default')
imputers = [
    skimpute.SimpleImputer(strategy='mean'),
    skimpute.SimpleImputer(strategy='median'),
    skimpute.SimpleImputer(strategy='most_frequent'),
    skimpute.KNNImputer(),
    skimpute.IterativeImputer(),
]

x = np.array([[1, np.nan, 2], [3, np.nan, np.nan]])

for imp in imputers:
    y1 = imp.fit_transform(x)
    y2 = imp.transform(x)
    if x.shape == y1.shape or x.shape == y2.shape:
        print(imp, x, y1, y2, sep='\n')

print('keep')
imputers = [
    skimpute.SimpleImputer(strategy='mean', keep_missing_features=True),
    skimpute.SimpleImputer(strategy='median', keep_missing_features=True),
    skimpute.SimpleImputer(strategy='most_frequent', keep_missing_features=True),
    skimpute.SimpleImputer(strategy='constant', keep_missing_features=True),
    skimpute.KNNImputer(keep_missing_features=True),
    skimpute.IterativeImputer(keep_missing_features=True),
]

x = np.array([[1, np.nan, 2], [3, np.nan, np.nan]])

for imp in imputers:
    y1 = imp.fit_transform(x)
    y2 = imp.transform(x)
    if x.shape != y1.shape or x.shape != y2.shape:
        print(imp, x, y1, y2, sep='\n')

jnothman

Thank you. Please add tests to sklearn/impute/tests/ so that they remain part of our test suite.

rth · 2020-04-21T11:37:12Z

Thanks @vitorsrg ! As far as I understand, this will keep features with nan unchanged, right? My assumption of imputers is that the output should contain no NaN. In the current form even with keep_missing_features=True one still might get NaN, and so would have to make a pipeline of the imputer with another SimpleImputer(strategy="constant") just to be sure computations don't break downstream in the pipeline.

How about instead introducing fill_missing_features=False that would fill missing features with some values (e.g. likely 0)? Just asking for feedback for now, not asking to make these changes.

jnothman · 2020-04-21T11:51:55Z

How about instead introducing fill_missing_features=False that would fill missing features with some values (e.g. likely 0)? Just asking for feedback for now, not asking to make these changes.

I'm happy with that.

I think it would be good to have in this PR if @vitorsrg can do it.

jnothman

Thanks @vitorsrg

jnothman · 2020-04-21T11:54:26Z

sklearn/impute/tests/test_impute.py

+    X_trans = imputer.fit_transform(X)
+
+    if strategy == "constant":
+        assert X.shape == X_trans.shape


This is a strange inconsistency. I think we should follow keep_missing_features in the constant case too and drop the feature if keep_missing_features=False.

SimpleImputer's constant strategy never dropped the features; and I don't think it should, because we can use it after the other imputers to prevent NaNs.

So the problem is that IterativeImputer(strategy='constant', 'keep_missing_features'=False) drops the features, whilst its SimpleImputer doesn't.

I'd personally rather consistency across strategy. However, I'm happy to leave 'constant' as exceptional for now, but it should be documented.

I don't think we should be using it after other imputers to prevent NaNs. I think a guaranteed invariance for imputers should be that after transform there are no more missing values. Hence my support for @rth's fill_missing_features idea.

Should there be a fill_missing_features_value parameter, then?

jnothman · 2020-04-21T11:55:27Z

sklearn/impute/tests/test_impute.py

+
+    imputer = Imputer(**{arg: strategy, 'keep_missing_features': True})
+    assert X.shape == imputer.fit_transform(X).shape
+


Please also check the values: that missing valued column is in the right place, and that all other values are identical to the X_trans produced with keep_missing_features=False

This comment also applies to the common test

jnothman · 2020-04-21T11:56:05Z

sklearn/impute/tests/test_common.py

+def test_imputation_keep_missing_features(imputer):
+    X = np.array([[1, np.nan, 2], [3, np.nan, np.nan]])
+
+    imputer = imputer.__class__(keep_missing_features=True)


more conventional would be: imputer.set_params(keep_missing_features=True)

from what I checked, set_params is an instance method
I would like to prevent problems with recycling imputer objects that learned their own statistics_ and other inner params

jnothman · 2020-04-21T11:57:06Z

sklearn/impute/tests/test_impute.py

+    [(SimpleImputer, 'strategy'), (IterativeImputer, 'initial_strategy')])
+@pytest.mark.parametrize(
+    "strategy",
+    ['mean', 'median', 'most_frequent', "constant"])


Perhaps we should instead include (some of) these different strategies in IMPUTERS in test_common.py??

I'm sorry, but i couldn't understand your suggestion

mitar · 2020-04-22T06:29:11Z

sklearn/impute/_iterative.py

        the missing indicator even if there are missing values at
        transform/test time.

+    keep_missing_features : boolean, default=False


This name is pretty strange?

I think it's too long, but I couldn't find something better

Maybe keep_empty_features?

I consider missing a more consistent terminology, but it is not a strong opinion

I see "missing" as being the property of a cell, but how we call a column with all "missing" cells/values? I think naming it again "missing" might be confusing. But also no strong preference here.

Now, the attribute indicator_ already addresses this

mitar · 2020-04-24T07:49:02Z

I would suggest that is_empty_ property is added which contains the list of column indices on the input which were detected as having all nan values.

jnothman · 2020-04-27T11:57:07Z

@vitorsrg will you be able to continue working on this?

@mitar, hopefully we'll implement better solutions for passing around feature names etc soon. I have previously proposed transformers could define a matrix indicating the dependencies between input and output features... but I've never really promoted it loudly. Here's one version of the idea: https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep003/proposal.html

mitar · 2020-04-27T14:40:38Z

The problem with passing around feature names is that it does not work when there are duplicate feature names (what Pandas supports) which you can get quickly if you concat existing dataframes. I would prefer column indices in an attribute.

vitorsrg · 2020-05-02T22:02:42Z

@rth I think we should chain an "constant" imputer instead of adding a filler param

vitorsrg · 2020-05-02T22:03:20Z

Whats is the expected behaviour for the missing features when transforming inplace?

jnothman · 2020-05-20T12:34:56Z

Whats is the expected behaviour for the missing features when transforming inplace?

What does it currently do?

You currently appear to have test failure?

cmarmo · 2020-09-25T09:49:15Z

Hi @vitorsrg, thanks for your work so far. Are you still interested in working on this? If so, do you mind fixing conflicts? Thanks.

vitorsrg · 2020-10-14T03:32:57Z

Yes, I'll be working on this again by next week

charlottesmith0308 · 2021-02-20T16:12:34Z

Is this feature still being worked on?

…uter._initial_imputation

vitorsrg · 2021-03-04T05:26:48Z

FYI I've finished what I planned to do, so I'll just wait for the reviews now

cmarmo · 2021-03-25T14:07:58Z

This pull request also fixes #16977. Perhaps @rth could find some time to have a look?

cmarmo

@vitorsrg waiting for a core-dev review, just some suggestions in order to comply with the documentation guidelines.
If you could find some time to fix conflicts, this will also make the review easier. Thanks for your work and your patience!

sklearn/impute/_base.py

sklearn/impute/_iterative.py

sklearn/impute/_knn.py

Co-authored-by: Chiara Marmo <[email protected]>

adrinjalali · 2021-08-20T14:02:51Z

@amueller related to imputation, in case you're interested in giving it a review. Removing this from the milestone.

jjerphan · 2022-09-29T16:03:01Z

Hi @vitorsrg,

We would like to have this PR part of the 1.2 release.
Do you want and still have time to pursue this PR or do you allow maintainers to finalise it?

Thanks!

glemaitre · 2022-11-16T17:22:04Z

Closing in favor of #24770

Co-authored-by: Chiara Marmo <[email protected]> Co-authored-by: Julien Jerphanion <[email protected]> Co-authored-by: Jérémie du Boisberranger <[email protected]> Co-authored-by: Vitor SRG <[email protected]> Fixes #16695 Fixes #16426 Fixes #16977

github-actions bot added the module:impute label Mar 15, 2020

jnothman reviewed Mar 25, 2020

View reviewed changes

vitorsrg requested a review from jnothman March 27, 2020 06:42

rth mentioned this pull request Apr 21, 2020

IterativeImputer automatically removes variables with all missing values #16977

Closed

jnothman reviewed Apr 21, 2020

View reviewed changes

mitar reviewed Apr 22, 2020

View reviewed changes

rth mentioned this pull request Apr 23, 2020

Missing features removal with SimpleImputer #16426

Closed

vitorsrg force-pushed the fix/imputer-keep-missing-feats branch from fe0bd3d to 64e36a7 Compare April 30, 2020 10:15

vitorsrg marked this pull request as draft April 30, 2020 11:28

vitorsrg marked this pull request as ready for review May 5, 2020 09:26

vitorsrg requested a review from jnothman May 5, 2020 09:26

postmalloc mentioned this pull request Jun 18, 2020

ENH Add inverse_transform feature to SimpleImputer #17612

Merged

2 tasks

Base automatically changed from master to main January 22, 2021 10:52

cmarmo added the help wanted label Feb 25, 2021

vitorsrg force-pushed the fix/imputer-keep-missing-feats branch from 116adac to 092ea4e Compare March 3, 2021 09:13

vitorsrg added a commit to vitorsrg/scikit-learn that referenced this pull request Mar 3, 2021

Add PR scikit-learn#16695 entry to 1.0 changelog

9fb1cbb

vitorsrg force-pushed the fix/imputer-keep-missing-feats branch from 092ea4e to 9fb1cbb Compare March 3, 2021 09:34

vitorsrg added 5 commits March 3, 2021 07:24

Add keep_missing_features to SimpleImputer

a910d81

Set in_fit=False in IterativeImputer.transform's call to IterativeImp…

cb76dc8

…uter._initial_imputation

Add keep_missing_features to IterativeImputer

a57e351

Add keep_missing_features to KNNImputer

30a4676

Add PR scikit-learn#16695 entry to 1.0 changelog

0fe4913

vitorsrg force-pushed the fix/imputer-keep-missing-feats branch from 9fb1cbb to 0fe4913 Compare March 3, 2021 10:26

cmarmo added the Waiting for Reviewer label Mar 4, 2021

cmarmo added this to the 1.0 milestone Mar 25, 2021

cmarmo reviewed Mar 25, 2021

View reviewed changes

sklearn/impute/_base.py Outdated Show resolved Hide resolved

sklearn/impute/_iterative.py Outdated Show resolved Hide resolved

sklearn/impute/_knn.py Outdated Show resolved Hide resolved

Change 'boolean' to 'bool'

f4553b3

Co-authored-by: Chiara Marmo <[email protected]>

jawadjawid added a commit to jawadjawid/scikit-learn that referenced this pull request Mar 27, 2021

Add fix to http://scikit-learn#16695

3bdc622

jawadjawid added a commit to jawadjawid/scikit-learn that referenced this pull request Mar 27, 2021

Add units tests to scikit-learn#16695

46541bb

adrinjalali removed this from the 1.0 milestone Aug 20, 2021

jeremiedbb added this to the 1.2 milestone Sep 24, 2022

glemaitre changed the title ~~Keep missing features during imputation~~ ENH keep missing features during imputation Oct 27, 2022

glemaitre self-assigned this Oct 27, 2022

Merge remote-tracking branch 'origin/main' into pr/vitorsrg/16695

7a752ec

glemaitre mentioned this pull request Oct 27, 2022

ENH keep features with all missing values during imputation #24770

Merged

cmarmo added Superseded PR has been replace by a newer PR and removed Waiting for Reviewer labels Oct 27, 2022

glemaitre closed this Nov 16, 2022


		imputer = Imputer(**{arg: strategy, 'keep_missing_features': True})
		assert X.shape == imputer.fit_transform(X).shape

Uh oh!

ENH keep missing features during imputation #16695

ENH keep missing features during imputation #16695

Uh oh!

Conversation

vitorsrg commented Mar 15, 2020 • edited by jeremiedbb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

vitorsrg commented Mar 15, 2020

Uh oh!

jnothman commented Mar 15, 2020

Uh oh!

vitorsrg commented Mar 15, 2020

Uh oh!

jnothman commented Mar 16, 2020 via email

Uh oh!

vitorsrg commented Mar 22, 2020

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

rth commented Apr 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Apr 21, 2020

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mitar commented Apr 24, 2020

Uh oh!

jnothman commented Apr 27, 2020

Uh oh!

mitar commented Apr 27, 2020

Uh oh!

vitorsrg commented May 2, 2020

Uh oh!

vitorsrg commented May 2, 2020

Uh oh!

jnothman commented May 20, 2020

Uh oh!

cmarmo commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vitorsrg commented Oct 14, 2020

Uh oh!

vitorsrg commented Mar 15, 2020 •

edited by jeremiedbb

Loading

rth commented Apr 21, 2020 •

edited

Loading

cmarmo commented Sep 25, 2020 •

edited

Loading