Automatic hyperparameter optimization

When tackling a machine learning task, your goal is to optimize for a performance measure of your algorithm, like accuracy or error rate. Improving your results can be done in several ways, including gathering better training data or better model selection. A particularly difficult challenge is posed by the choice of hyperparameters of your model, may it be the learning rate, the choice of optimizer or any other non-learnable parameter of an algorithm.

Tuning hyperparameters is hard, because it is computationally costly and very rarely tractable to do a full grid search on a valid range of parameters. On top of that, our intuition as humans for guessing good parameters is very limited and often worse than randomly searching the space of hyperparameters. A third aspect that makes this kind of tuning tough is that you often end up with a lot of boiler plate code in your experiments, nested loops that are hard to adapt, or custom evaluation scripts. It is reasons like this automatic hyperparameter optimization tools like Spearmint or Hyperopt have been developed. While also having the added benefit of implementing more sophisticated optimization algorithms than just doing random search, these libraries still lack convenience when it comes to your machine learning workflow.

At AETROS, we have been developing a very convenient interface for you to do state of the art hyperparameter optimization for any model of your choosing, making this task easier than ever. Furthermore, AETROS supports automatic and easy to setup scale-out of optimization to multiple machines, no complex configuration of distributed databases needed. Under the hood, AETROS uses the core of Hyperopt’s optimization suite to bring you advanced algorithms such as TPE. However, instead of Parallelizing Evaluations During Search via MongoDB, which requires a lot of up-front preparation, such as setting up a MongoDB, a job scheduler and writing a results evaluation script, AETROS does distributed optimization almost out of the box.

With our solution the only thing you need is:

  1. You python model
  2. Our SDK
  3. A training machine (or several)
  4. An AETROS account

We will describe the process in more detail later on, but on a high level, to optimize hyperparameters with our solution, the only thing you need to do is:

  1. Define hyperparameters ranges in AETROS Trainer.
  2. Setup at least one server (can be your local machine)
  3. Use previously defined parameter ranges in your python model specification.
  4. Send performance metrics to AETROS using our SDK.
  5. Start the optimization experiment and wait for results to come in.
It is important to note that in the simplest setup the only thing you need is a computer and a script with your python model. In particular, we are not coupled to any deep learning framework. In fact, we're not even bound to machine learning. You could use AETROS for any other kind of optimization problem, like optimal packing problems or others.

Feature summary

  • Automatic search of better hyperparameters
  • Three different optimization algorithms (Random, TPE, Annealing)
  • Automatic training job distribution across multiple servers
  • Very detailed and convenient hyperparameter space definition through interface
  • Runtime constrains like max epochs and max time
  • Watch the process, results and metrics in real-time in AETROS Trainer
  • Completely based on Git
  • All results can be exported as CSV

Glossary

  • Hyperparameter
    Refers to parameters that influence model performance, but can or should not be learned by the model itself. They are called hyperparameters to distinguish them from regular, learnable parameters of your model, e.g. weights and bias. Examples: learning rate, choise of optimizer, dropout probability, number of neurons per layer in a neural network.
  • KPI
    Key performance indicator: Is a metric that indicates how well your model performs, for instance loss or accuracy. KPIs are sent to AETROS through channels.
  • Server
    In AETROS a server is a normal computer or dedicated machine on which you run the aetros server. This allows you to automatically start jobs without logging in via SSH.
  • Job
    A job in this case refers to the full optimization experiment of your model. This is usually just a started python script that starts the training of your model.
  • Trial
    A trial is a training run with one specific set of hyperparameters chosen by the respective algorithm. In the beginning, trials are initialized with random hyperparameters in the warming up phase for the TPE algorithm.
  • Initial trial
    Is a trial with random hyperparameters in the warming up phase for the TPE algorithm.

How it works

The basic idea is to train your model with a specific set of parameters, i.e. run a trial, and receive a performance metric, i.e. a KPI. Based on that information we know which parameters worked well, so we are able to calculate good guesses of which parameters should be used in the next trial. This process is repeated as often as you wish and should lead to better KPI results over time. You can monitor in our job browser exactly which hyperparameters performed good and which didn't. With a constrained range of epochs/iterations or time, you can find out which hyperparameters performed better at an early stage, to build for example a smaller (and thus faster) network or to filter roughly very bad hyperparameter spaces.

The initial step is to ask Hyperopt for random parameters during the first initial trials. When initial trials are handled, we fire up Hyperopt's TPE and feed it with all of the history information we got from all previous trials (random trials and previous TPE trials), so Hyperopt's TPE calculates for us a next good guess. This process is repeated as often as you want and should lead to better KPIs the more trials you started, until it reaches the best possible KPI with the given hyperparameter space, model and training data.

We support currently three optimiztion algorithms:

  • Random
    Starts trials based on random hyperparameters.
  • TPE
    Hyperopt's Tree-of-Parzen-Estimators (TPE) algorithm, see this paper for an introduction:
    Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms. Starts trials based on random hyperparameters during the warm up phase (initial trials). After warming up, it starts calculated hyperparameters based on the TPE algorithm.
  • Annealing
    Annealing is a simple but effective variant on random search that takes some advantage of a smooth response surface.

Technical

In your model you receive hyperparameters from AETROS using the JobBackend.get_parameter(name) and send us your KPI via JobChannel.send(x, y). To do so, you need to integrate a few lines of code in your python script using our Python SDK. You see examples in the step by step tutorial below.

How it looks

When an optimization is done.


Step by step

1. Create a model

Open AETROS Trainer and create a new custom python model. Once created open "CODE" tab and enter there your git url and the python filename from the script that starts a training of your model.
If you want to play around with it locally, you can enter a local git url like file:///Users/peter/models/my-model-directory/. Make sure you fired up a git repo inside this directory, since our training distribution across multiple servers in AETROS Trainer is completely based on git. We always checkout the latest commit of the defined branch once a new job is started.

cd /Users/peter/models/my-model-directory/
git init
# change your model
git add model.py
git commit -m "new version"

2. Connect server

Create and connect at least one server with AETROS, see documentation External server infrastructure. You can use of course your computer/notebook for this as well.
Create a server in AETROS Trainer, install aetros sdk via pip and execute the shown command in our interface.

pip install aetros
aetros server --secure-key=SECURE_KEY

Now you should see your machine in AETROS Trainer as online, allowing you to start the given model on this machine at any time from any place on earth.

If you choose to create and connect several servers with AETROS, you're free to do so. In the create dialog of an optimization you can select as many servers as you want. AETROS makes sure that all servers are equally busy based on their max-parallel-jobs setting and queue length.

3. Define hyperparameters

You basically define your hyperparameters defaults in AETROS Trainer. Open your model and click on the tab "HYPERPARAMETERS", here you can add as many hyperparameters as you want using the + Add hyperparameter button.

You can choose between several types, allowing you to build very complex scenarios.

  1. String Simple string.
  2. Number Simple integer and floats.
  3. Boolean Simple booleans true/false.
  4. Group Allows you to define categories to bundle hyperparameters in groups.
  5. Choice: String A list of strings, allowing you to (also automatically) choose exactly one of them.
  6. Choice: Number A list of numbers, allowing you to (also automatically) choose exactly one of them.
  7. Choice: Group A list of groups, allowing you to (also automatically) choose exactly one of them. Perfect for use cases where you define multiple optimizer with different parameters and test only one of them for one particular job or trial.

4. Use hyperparameters

Now it's time to use in your actual model hyperparameters from AETROS. Make sure you have already read Python SDK: Getting started, since you should already have integrated our python sdk in your model.

To read a hyperparameter from the current job, just use job.get_parameter(name). See here an example:

import aetros.backend import time job = aetros.backend.start_job('username/model') batch_size = job.get_parameter('batch_size'); nb_filters = job.get_parameter('nb_filters') model = Sequential() model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1], border_mode='valid', input_shape=input_shape)) model.add(Flatten()) model.add(Dense(job.get_parameter('first_dense.neurons'))) model.add(Activation('relu')) model.add(Dropout(job.get_parameter('first_dense.dropout'))) # ....
The full example can be seen at: github.com/marcj/aetros-keras-mnist_cnn/commit/74f9412.

5. Send KPI

To understand how well a group of hyperparameters performed, we need to get information about the KPI during the training. You do this by using channels. The main difference in a normal channel/metric and a KPI is that you set kpi=True at that channel. Only one channel can be a KPI. Example:

job = aetros.backend.start_job('username/model') #kpi=True to tell AETROS this is the metric we want to optimize #main=True so see that metric in the job browser accuracy_channel = job.create_channel('accuracy', kpi=True, main=True, yaxis={'dtick': 10}) # training accuracy_channel.send(1, 98.9); accuracy_channel.send(2, 99.1); # ... job.done();
Per default, AETROS tries to maximize a metric. You can also tell the optimization to minimize a metric by setting max_optimization=False. Example:
job = aetros.backend.start_job('username/model') #kpi=True to tell AETROS this is the metric we want to optimize #main=True so see that metric in the job browser #max_optimization=False to tell AETROS we want to minimize this metric loss_channel= job.create_channel('loss', kpi=True, main=True, max_optimization=False) # training loss_channel.send(1, 0.5512); loss_channel.send(2, 0.5133);

Make sure AETROS received your channels/metrics correctly before you start an optimization by creating a job manually and look into the AETROS Trainer.

6. Create optimization

Open your model in AETROS Trainer and click on + Optimization button top right. A Create optimization dialog pops up. Here you can now define your hyperparameter space and define exactly which parameter should be searched automatically and which should be static.

Hyperparameter space

A hyperparameter space is a range of values of each hyperparameter. You can define for example how many neurons are allowed to be used. As in the screenshot below, we define second_dense.neurons to 10 -> 128 in 1 steps, we also define second_dense.dropout to 0 -> 0.5 in 0.005 steps. Also, we make the whole second_dense layer optional by letting the algorithm define automatically whether second_dense.active is True or False.

For hyperparameters of type "Number" you have the most options to choose. To get more information about each algorithm please read Hyperopt's parameter expressions.

If you don't activate AUTO then this hyperparameter will not be touched by our optimization and is fixed, like in the screenshot below batch_size.

Optimization settings

Beside the actual hyperparameter space, it's also important to define good fundamental settings, like how much one trial is allowed to run (how many epochs or how many minutes) or how many random trials are created before TPE kicks in.

  • Max epochs
    How many epochs one trial is allowed to run. You use in your model job.progress(x, total) to track the progress, if x reaches max epochs, the trial will automatically end by receiving a SIGINT signal by our SDK. This is very handy if your actual training run takes very long and you usually see very fast whether the model performs good or not. Only very rare cases you don't limit epochs (like when a whole model training takes only several minutes).
  • Max time
    Same as above but constrained by the actual elapsed time. The model receives like above a SIGINT signal once the limit is reached.
  • Initial random tries (TPE only)
    During the warming up phase, the optimization starts only trials with random hyperparameters. This is important so the TPE is seeded before it starts. Make sure to create enough random tries. You start more random tries if you have more hyperparameters.
  • Max parallel
    If you choose to let the optimization run on several servers, you should make sure that this settings is equal or greater to the chosen server count. It makes sure that you have max parallel jobs running at the same time. It's important to note that you should not choose this number to high, since hyperparameters for each jobs are calculated when they are started and only ended previous jobs are included in this hyperparameter calculation.
  • Max tries
    How many trials are allowed to be started overall (included random tries and calculated tries).
Handle early stops

If you activate Max epochs or Max time, our SDK is sending your process a SIGNINT signal. Depending on your framework you use, you need to stop the training process manually to shutdown the process. Most of the times you don't need to do anything, but in case your process does not exit, try to register a signal handler and make sure your process is terminating correctly.

job = aetros.backend.start_job('username/model') loss_channel= job.create_channel('loss', kpi=True, main=True, max_optimization=False) # register signal import signal def stop(sig, frame): print("early stop received\n") # do something to stop the training process signal.signal(signal.SIGINT, stop) for i in range(0, 1000): loss_channel.send(i, loss);
Server

Don't forget to choose on which servers the optimization is allowed to start jobs. Scroll down in the Create optimization dialog too see all registered servers for your account. If you choose to select more than one servers, AETROS makes sure that all servers are equally busy during the optimization based on three server information: current running jobs, max parallel jobs and current queue length.

Press "Create" to create your optimization and its first jobs.

7. Watch the progress

Once you have created the optimization, you should see under the tab "OPTIMIZATIONS" a new optimization with one or more CREATED, QUEUED or TRAINING jobs. You can now take a coffee or go to sleep, while it finds automatically the best hyperparameter for you.

Warming up (TPE)
Training
Done

Hover over a job to see its hyperparameters, progress, status and more information. Click on a job to go to its detail view and see all channels, logs and other stats.

8. Evaluate

If your optimization is performing badly, you can either press "stop" and create a new one with a better hyperparameter space. Or you adjust your model, commit it in your git repository, and press "Restart", which basically deleted all your jobs of this optimization, builds completely new hyperparameters and starts new jobs with the new git version.

If your optimization is done, you can press Show in job browser which opens the job browser containing only jobs from the chosen optimization. You can now evaluate which hyperparameters performed best by sorting and looking at the stats, or export all job information as CSV and do some excel magic.