To handle non-actionable steps in sklearn #866

Neeratyoy · 2019-11-06T16:23:58Z

Reference Issue

Addresses #825 and #480.

What does this PR implement/fix? Explain your changes.

The serialization expects each step to be a sklearn module and eventually an OpenMLFlow or a None. For ColumnTransform operations, a drop or passthrough can be specified to include a step in a pipeline but not execute it. The changes in this PR incorporates this additional string handling of the keywords drop and passthrough in a ColumnTransformer call.

How should this PR be tested?

import openml
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier

dataset_id = 61
dataset = openml.datasets.get_dataset(dataset_id)

task = openml.tasks.get_task(59)

X, y, categorical_ind, feature_names = dataset.get_data(target=dataset.default_target_attribute, dataset_format='array')
categorical_ind = np.array(categorical_ind)
cat_idx, = np.where(categorical_ind)
cont_idx, = np.where(~categorical_ind)

clf = make_pipeline(ColumnTransformer([('cat', make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder()), cat_idx.tolist()),
                                       ('cont', make_pipeline(SimpleImputer(strategy='median'), StandardScaler()), cont_idx.tolist())])          
                , DecisionTreeClassifier(max_depth=1))
if not categorical_ind.any():
    clf[0].set_params(cat='drop')
if not (~categorical_ind).any():
    clf[0].set_params(cont='drop')
hotencoded = clf[:-1].fit_transform(X)

bla = openml.runs.run_model_on_task(model=clf, task=task, return_flow=True)

Any other comments?

This PR handles the specific case of a ColumnTransform. I'm not sure if there are other such keywords which should be handled equivalently. We can then maintain a keyword listing, based on the sklearn version under extension.py and handle them with a similar logic.
If there are more diverse cases that may arise, some examples would help in designing a generalized solution in that case.
Would need to accordingly design relevant unit tests too.

codecov-io · 2019-11-06T19:14:20Z

Codecov Report

Merging #866 into develop will increase coverage by 2.23%.
The diff coverage is 83.33%.

@@             Coverage Diff             @@
##           develop     #866      +/-   ##
===========================================
+ Coverage    88.38%   90.62%   +2.23%     
===========================================
  Files           37       37              
  Lines         4298     6163    +1865     
===========================================
+ Hits          3799     5585    +1786     
- Misses         499      578      +79

Impacted Files	Coverage Δ
openml/extensions/sklearn/extension.py	`93.43% <83.33%> (+2.11%)`	⬆️
openml/evaluations/evaluation.py	`63.88% <0%> (-2.78%)`	⬇️
openml/extensions/functions.py	`95.65% <0%> (-1.13%)`	⬇️
openml/utils.py	`90.66% <0%> (-0.41%)`	⬇️
openml/exceptions.py	`93.47% <0%> (-0.08%)`	⬇️
openml/extensions/sklearn/__init__.py	`100% <0%> (ø)`	⬆️
openml/setups/__init__.py	`100% <0%> (ø)`	⬆️
openml/tasks/__init__.py	`100% <0%> (ø)`	⬆️
openml/__init__.py	`100% <0%> (ø)`	⬆️
openml/extensions/__init__.py	`100% <0%> (ø)`	⬆️
... and 28 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e489f41...3b03d15. Read the comment docs.

mfeurer

I think your changes all make sense, but require some explanation in the code (and tests).

openml/extensions/sklearn/extension.py

mfeurer

Could you please also add some error handling for earlier scikit-learn versions? Right now the tests fail due to a hard-to-interpret error message.

openml/extensions/sklearn/extension.py

Neeratyoy · 2019-11-15T17:03:27Z

@mfeurer one of the old unit tests that I had to change test_serialize_feature_union fails for versions 0.18.2 and 0.19.2.
It appears that the manner in which we are serializing, the behaviour of one of the intermediate stages have changed since these two versions. To be more specific, the parameter metadata extraction.

Should I then try to change the code to see how this can be handled or skip that particular unit test?

tests/test_extensions/test_sklearn_extension/test_sklearn_extension.py

openml/extensions/sklearn/extension.py

mfeurer · 2019-11-18T10:29:56Z

Should I then try to change the code to see how this can be handled or skip that particular unit test?

You can skip that unit test.

Initial changes to handle reproducible example from the issue

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

78b163e

Neeratyoy requested a review from mfeurer November 6, 2019 16:24

mfeurer reviewed Nov 7, 2019

View reviewed changes

Neeratyoy mentioned this pull request Nov 12, 2019

Cope with empty steps in scikit-learn #480

Closed

Neeratyoy added 2 commits November 12, 2019 20:02

Making tentative changes; Need to test deserialization

4875dba

Fixing deserialization when empty steps in sklearn model

0234d19

Neeratyoy marked this pull request as ready for review November 12, 2019 20:02

Neeratyoy requested a review from mfeurer November 12, 2019 20:02

mfeurer reviewed Nov 13, 2019

View reviewed changes

openml/extensions/sklearn/extension.py Outdated Show resolved Hide resolved

Neeratyoy added 2 commits November 13, 2019 14:43

Fixing flake issues, failing test cases

cbf0d6a

Fixing test cases

7768ca3

Neeratyoy requested a review from mfeurer November 13, 2019 15:41

Dropping support for 'None' as sklearn estimator

35d3c34

mfeurer requested changes Nov 18, 2019

View reviewed changes

tests/test_extensions/test_sklearn_extension/test_sklearn_extension.py Outdated Show resolved Hide resolved

openml/extensions/sklearn/extension.py Outdated Show resolved Hide resolved

Adding test case for None estimator

3b03d15

Neeratyoy requested a review from mfeurer November 18, 2019 14:23

mfeurer approved these changes Nov 18, 2019

View reviewed changes

mfeurer merged commit 2b7e740 into develop Nov 18, 2019

mfeurer deleted the fix_825 branch November 18, 2019 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

To handle non-actionable steps in sklearn #866

To handle non-actionable steps in sklearn #866

Neeratyoy commented Nov 6, 2019 •

edited

Loading

codecov-io commented Nov 6, 2019 •

edited

Loading

mfeurer left a comment

mfeurer left a comment

Neeratyoy commented Nov 15, 2019

mfeurer commented Nov 18, 2019

To handle non-actionable steps in sklearn #866

To handle non-actionable steps in sklearn #866

Conversation

Neeratyoy commented Nov 6, 2019 • edited Loading

Reference Issue

What does this PR implement/fix? Explain your changes.

How should this PR be tested?

Any other comments?

codecov-io commented Nov 6, 2019 • edited Loading

Codecov Report

mfeurer left a comment

Choose a reason for hiding this comment

mfeurer left a comment

Choose a reason for hiding this comment

Neeratyoy commented Nov 15, 2019

mfeurer commented Nov 18, 2019

Neeratyoy commented Nov 6, 2019 •

edited

Loading

codecov-io commented Nov 6, 2019 •

edited

Loading