Comparing Apples and Apples: Experimentation with and Benchmarking of Hyperparameter Tuning
Using the Syne Tune open source system, we show how to improve code sharing and benchmarking for hyperparameter tuning, not only eliminating confounding factors in empirical comparisons, but also speeding up methodological innovations.
Abstract
Empirical studies in current HPO publications are often apples-and-oranges comparisons, plagued by many confounding factors, which can lower trust in the field and hinders adoption among practitioners, especially in industry. In this tutorial, we show how many confounding factors can be eliminated, without restricting the freedom of researchers to innovate on methodology. In Syne Tune, HPO methods are implemented against a simple API, abstracting away details of trial execution or signaling. Experiments can be run locally, distributed in the cloud, or in simulation. Benchmarking on tabulated or surrogate benchmarks is simplified and standardized by a method-agnostic simulation backend and a blackbox repository, which provide fully realistic results over many benchmarks, methods, and random repetitions often orders of magnitude faster than real time. An exploration can be switched from simulation to distributed execution in the cloud with a few lines of code. Syne Tune features clean implementations of many SotA HPO methods, including multi-fidelity, constrained, and multi-objective modalities, so that studies can start from a wide range of baselines. It also comes with results aggregation and visualization tools, and incorporates best practices of comparing methods across several benchmarks.
Outline
The main goal of this tutorial is to outline a concrete path towards improving code sharing and benchmarking for hyperparameter tuning and automation of large model training and transfer learning. Using the Syne Tune open source system as running example, we demonstrate that most confounding factors of apples-and-oranges comparisons, ubiquitous in the current literature, can be removed with little extra effort. To this end, we clarify concepts and properties along which modern HPO technology can be factorized (e.g., search, scheduling, trial execution, synchronous versus asynchronous decision making, multi-fidelity). In Syne Tune, methods are implemented against APIs which embody this factorization, and which encourage sharing code for common properties. Not only does this speed up innovation, it also allows for apples-and-apples comparisons, down to paired comparisons with the same random seeds, which are essential to gain trust with practitioners in industry. Other open source system for standardizing HPO benchmarking are HPOLib, YAHPO, Ax, or AutoML benchmarks, and other widely used HPO open source libraries include Optuna and Ray Tune. This tutorial uses Syne Tune, a fully functional backend-agnostic system for distributed HPO with dedicated tooling for benchmarking (simulator backend, blackbox repository, parallel experimentation in the cloud).
You will learn how to design comparative studies, factorizing across methods, benchmarks, and execution backends, where each such decision can be changed with a few lines of code. For comparisons on tabulated or surrogate benchmarks, asynchronously parallel trial execution and decision making is simulated, allowing for realistic results in terms of wall-clock time to be obtained often orders of magnitude faster than real time. Syne Tune allows to run many experiments in parallel in the cloud, and tooling for filtering, aggregation, and visualization of comparative results simplifies and speeds up decisions for the next round of experiments.
Syllabus:
Introduction:
Apples-and-oranges comparisons, and how we can do better
Confounding factors in comparisons of HPO methods
Basic concepts (with examples):
Hyperparameter tuning: Problem and terminology
Interplay between scheduler and trial execution backend
Synchronous and asynchronous decision making
Multi-fidelity HPO
Modes of experimentation:
Local versus distributed tuning
Running experiments in parallel
Simulating experiments from tabulated or surrogate benchmarks
Benchmarking, comparative studies:
Specifying a benchmark
Designing and running a study (including paired comparisons)
Visualization of results
Comparing methods across multiple benchmarks
Conclusions: Would you like to be part of this journey?