Feature-engine is a Python library with multiple transformers to engineer features for use in machine learning models. Feature-engine's transformers follow scikit-learn's functionality with fit() and transform() methods to first learn the transforming parameters from data and then transform the data.
-
Feature-engine: A new open-source Python package for feature engineering
-
Practical Code Implementations of Feature Engineering for Machine Learning with Python
More resources will be added as they appear online!
- Missing Data Imputation
- Categorical Variable Encoding
- Outlier Capping or Removal
- Discretisation
- Numerical Variable Transformation
- Scikit-learn Wrappers
- Variables Combination
- Variable Selection
- MeanMedianImputer
- RandomSampleImputer
- EndTailImputer
- AddNaNBinaryImputer
- CategoricalVariableImputer
- FrequentCategoryImputer
- ArbitraryNumberImputer
- CountFrequencyCategoricalEncoder
- OrdinalCategoricalEncoder
- MeanCategoricalEncoder
- WoERatioCategoricalEncoder
- OneHotCategoricalEncoder
- RareLabelCategoricalEncoder
- Winsorizer
- ArbitraryOutlierCapper
- OutlierTrimmer
- EqualFrequencyDiscretiser
- EqualWidthDiscretiser
- DecisionTreeDiscretiser
- UserInputDiscreriser
- LogTransformer
- ReciprocalTransformer
- PowerTransformer
- BoxCoxTransformer
- YeoJohnsonTransformer
- SklearnTransformerWrapper
- MathematicalCombinator
- DropFeatures
From PyPI using pip:
pip install feature_engine
From Anaconda:
conda install -c conda-forge feature_engine
Or simply clone it:
git clone https://github.com/solegalli/feature_engine.git
>>> from feature_engine.categorical_encoders import RareLabelCategoricalEncoder
>>> import pandas as pd
>>> data = {'var_A': ['A'] * 10 + ['B'] * 10 + ['C'] * 2 + ['D'] * 1}
>>> data = pd.DataFrame(data)
>>> data['var_A'].value_counts()
Out[1]:
A 10
B 10
C 2
D 1
Name: var_A, dtype: int64
>>> rare_encoder = RareLabelCategoricalEncoder(tol=0.10, n_categories=3)
>>> data_encoded = rare_encoder.fit_transform(data)
>>> data_encoded['var_A'].value_counts()
Out[2]:
A 10
B 10
Rare 3
Name: var_A, dtype: int64
See more usage examples in the Jupyter Notebooks in the example folder of this repository, or in the documentation.
Details about how to contribute can be found in the Contributing Page
In short:
- Fork the repo
- Clone your fork into your local computer:
git clone https://github.com/<YOURUSERNAME>/feature_engine.git
- cd into the repo
cd feature_engine
- Install as a developer:
pip install -e .
- Create and activate a virtual environment with any tool of choice
- Install the dependencies as explained in the Contributing Page
- Create a feature branch with a meaningful name for your feature:
git checkout -b myfeaturebranch
- Develop your feature, tests and documentation
- Make sure the tests pass
- Make a PR
Thank you!!
PR's are welcome! Please make sure the CI tests pass on your branch.
We prefer tox. In your environment:
- Run
pip install tox
- cd into the root directory of the repo:
cd feature_engine
- Run
tox
If the tests pass, the code is functional.
You can also run the tests in your environment (without tox). For guidelines on how to do so, check the Contributing Page.
Feature-engine documentation is built using Sphinx and is hosted on Read the Docs.
To build the documentation make sure you have the dependencies installed. From the root directory: pip install -r docs/requirements.txt
.
Now you can build the docs: sphinx-build -b html docs build
BSD 3-Clause
Many of the engineering and encoding functionalities are inspired by this series of articles from the 2009 KDD Competition.