Skip to content

GSoC_2020_project_modelselection

Gil edited this page Feb 2, 2020 · 2 revisions

Flexible modelselection

Following up on one of our very first GSoC (2011) projects, this project intends to clean up, unify, extend, and scale-up Shogun's modelselection and hyper-parameter tuning framework. It is a cool mixture of modernizing existing code, using multi-threaded (and potentially distributed) concepts, and playing with black-box optimization frameworks.

Mentors

Difficulty & Requirements

Medium to advanced. Depends on ambitions, but we are flexible on student's abilities.

You need to know about:

  • Modelselection basics (x-validation, search-algorithms, implementation)
  • Modern C++ (threading with the STL, template metaprogramming, type erasure)
  • Knowledge of other libraries' approaches (sklearn, MLPack)

Ideally you also know about:

Details

X-validation v2

Every learning algorithm (Machine subclass) should work with x-validation ... fast! This is completely independent of any hyper-parameter tuning.

  • All model classes should be systematically tested with x-validation, see issue. This is similar to the trained model tests.
  • Identify models that do only perform read-only operations on the features (this will be all models later, depending on the progress of features-detox project).
  • Enable multi-core x-validation using openmp or std::thread, via cloning of the underlying learning machine, but with shared features (memory efficiency!). If there is interest this could be extended to distributed computing using a framework such as CAF
  • Carefully test the chosen models for race-conditions, memory errors, etc.
  • Add algorithms on a one-by-one basis.
  • Generalise code of the "trained model serialization" tests to a "trained model" tests, where multiple things can be checked for the trained models (serialization, x-validation for now).
  • Make sure model-selection has a progress bar, is stoppable, continue-able, etc. See also the black-box project
  • If there is enough time implement cutting edge algorithms using Bayesian optimisation or even Stein Point Markov Chain Monte Carlo with the new API!

We recently started pushing the use of modern C++ concepts for overload dispatching, mixins, concepts and more. Elegant solutions to the cross-validation problem using such are of course highly welcome :)

A clean API

We want to build a better way to specify free parameters to learn, which overlaps with the user experience project. We would like to shop around other libraries for ideas on specifying this.

Potential API:

params = root()
         .add("C", [1,2,3,4])
         .add("kernel", kernel("GaussianKernel"))
         .add("kernel::log_width", [1,2,3,4]).build()

# from a string
params = parse("""
{
"C": [1,2,3,4],
"epsilon": [0.01],
"kernel": {
   "name": "GaussianKernel"
   }
}
""")

# gridsearch
gs = GridSearch(params, cv=cv).train(X_train, y_train)

# randomized search
rs = RandomizedSearch(params, cv=cv).train(X_train, y_train)

# other parameter optimisation methods, e.g. Bayesian optimisation
...

Checkout https://github.com/shogun-toolbox/shogun/pull/4598 for some work that has already been done but not merged.

Some steps:

  • Review and compare other libraries ' approaches
  • Collect the most common use cases (random search, grid-search, gradient search (e.g. in our Gaussian Process framework))
  • Come up with a set of clean API examples / user stories for those cases
  • Draft code how to implement this API. This will include ways to annotate the spaces that parameters live in, as well as whether gradients are available.
  • Implement and test systematically
  • Make sure it works nicely in all target languages.

Black box optimisation

Bayesian optimisation and stochastic optimisation are powerful frameworks for blackbox optimisation. We aim to integrate bindings for both during the project. There is plenty of external libraries that do the algorithms for us, so this task is mostly about designing interfaces that tell Shogun to cross-validate the algorithm on the next set of parameters and reporting its performance.

Why this is cool

Hyperparameter estimation is a resource intensive process and it is required to develop a performant model. Shogun currently lacks this fundamental functionality and developing a clean and simple model selection API to quickly tune parameters would massively boost Shogun's usability. The project spans a huge range on topics within and outside of Shogun, including framework internals as well as cutting edge algorithms for optimisation. Be ready to learn a lot.

Useful resources

Clone this wiki locally