Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To handle non-actionable steps in sklearn #866

Merged
merged 7 commits into from
Nov 18, 2019
Merged

To handle non-actionable steps in sklearn #866

merged 7 commits into from
Nov 18, 2019

Conversation

Neeratyoy
Copy link
Contributor

@Neeratyoy Neeratyoy commented Nov 6, 2019

Reference Issue

Addresses #825 and #480.

What does this PR implement/fix? Explain your changes.

The serialization expects each step to be a sklearn module and eventually an OpenMLFlow or a None. For ColumnTransform operations, a drop or passthrough can be specified to include a step in a pipeline but not execute it. The changes in this PR incorporates this additional string handling of the keywords drop and passthrough in a ColumnTransformer call.

How should this PR be tested?

import openml
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier

dataset_id = 61
dataset = openml.datasets.get_dataset(dataset_id)

task = openml.tasks.get_task(59)

X, y, categorical_ind, feature_names = dataset.get_data(target=dataset.default_target_attribute, dataset_format='array')
categorical_ind = np.array(categorical_ind)
cat_idx, = np.where(categorical_ind)
cont_idx, = np.where(~categorical_ind)

clf = make_pipeline(ColumnTransformer([('cat', make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder()), cat_idx.tolist()),
                                       ('cont', make_pipeline(SimpleImputer(strategy='median'), StandardScaler()), cont_idx.tolist())])          
                , DecisionTreeClassifier(max_depth=1))
if not categorical_ind.any():
    clf[0].set_params(cat='drop')
if not (~categorical_ind).any():
    clf[0].set_params(cont='drop')
hotencoded = clf[:-1].fit_transform(X)

bla = openml.runs.run_model_on_task(model=clf, task=task, return_flow=True)

Any other comments?

This PR handles the specific case of a ColumnTransform. I'm not sure if there are other such keywords which should be handled equivalently. We can then maintain a keyword listing, based on the sklearn version under extension.py and handle them with a similar logic.
If there are more diverse cases that may arise, some examples would help in designing a generalized solution in that case.
Would need to accordingly design relevant unit tests too.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@Neeratyoy Neeratyoy requested a review from mfeurer November 6, 2019 16:24
@codecov-io
Copy link

codecov-io commented Nov 6, 2019

Codecov Report

Merging #866 into develop will increase coverage by 2.23%.
The diff coverage is 83.33%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #866      +/-   ##
===========================================
+ Coverage    88.38%   90.62%   +2.23%     
===========================================
  Files           37       37              
  Lines         4298     6163    +1865     
===========================================
+ Hits          3799     5585    +1786     
- Misses         499      578      +79
Impacted Files Coverage Δ
openml/extensions/sklearn/extension.py 93.43% <83.33%> (+2.11%) ⬆️
openml/evaluations/evaluation.py 63.88% <0%> (-2.78%) ⬇️
openml/extensions/functions.py 95.65% <0%> (-1.13%) ⬇️
openml/utils.py 90.66% <0%> (-0.41%) ⬇️
openml/exceptions.py 93.47% <0%> (-0.08%) ⬇️
openml/extensions/sklearn/__init__.py 100% <0%> (ø) ⬆️
openml/setups/__init__.py 100% <0%> (ø) ⬆️
openml/tasks/__init__.py 100% <0%> (ø) ⬆️
openml/__init__.py 100% <0%> (ø) ⬆️
openml/extensions/__init__.py 100% <0%> (ø) ⬆️
... and 28 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e489f41...3b03d15. Read the comment docs.

Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your changes all make sense, but require some explanation in the code (and tests).

@Neeratyoy Neeratyoy marked this pull request as ready for review November 12, 2019 20:02
@Neeratyoy Neeratyoy requested a review from mfeurer November 12, 2019 20:02
Copy link
Collaborator

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add some error handling for earlier scikit-learn versions? Right now the tests fail due to a hard-to-interpret error message.

@Neeratyoy Neeratyoy requested a review from mfeurer November 13, 2019 15:41
@Neeratyoy
Copy link
Contributor Author

@mfeurer one of the old unit tests that I had to change test_serialize_feature_union fails for versions 0.18.2 and 0.19.2.
It appears that the manner in which we are serializing, the behaviour of one of the intermediate stages have changed since these two versions. To be more specific, the parameter metadata extraction.

Should I then try to change the code to see how this can be handled or skip that particular unit test?

@mfeurer
Copy link
Collaborator

mfeurer commented Nov 18, 2019

Should I then try to change the code to see how this can be handled or skip that particular unit test?

You can skip that unit test.

@Neeratyoy Neeratyoy requested a review from mfeurer November 18, 2019 14:23
@mfeurer mfeurer merged commit 2b7e740 into develop Nov 18, 2019
@mfeurer mfeurer deleted the fix_825 branch November 18, 2019 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants