When tackling a machine learning task, your goal is to optimize for a performance measure of your algorithm, like accuracy or error rate. Improving your results can be done in several ways, including gathering better training data or better model selection. A particularly difficult challenge is posed by the choice of hyperparameters of your model, may it be the learning rate, the choice of optimizer or any other non-learnable parameter of an algorithm.Tuning hyperparameters is hard, because it is computationally costly and very rarely tractable to do a full grid search on a valid range of parameters. On top of that, our intuition as humans for guessing good parameters is very limited and often worse than randomly searching the space of hyperparameters. A third aspect that makes this kind of tuning tough is that you often end up with a lot of boiler plate code in your experiments, nested loops that are hard to adapt, or custom evaluation scripts. It is reasons like this automatic hyperparameter optimization tools like Spearmint or Hyperopt have been developed. While also having the added benefit of implementing more sophisticated optimization algorithms than just doing random search, these libraries still lack convenience when it comes to your machine learning workflow.
At AETROS, we have been developing a very convenient interface for you to do state of the art hyperparameter optimization for any model of your choosing, making this task easier than ever. Furthermore, AETROS supports automatic and easy to setup scale-out of optimization to multiple machines, no complex configuration of distributed databases needed. Under the hood, AETROS uses the core of Hyperopt’s optimization suite to bring you advanced algorithms such as TPE. However, instead of Parallelizing Evaluations During Search via MongoDB, which requires a lot of up-front preparation, such as setting up a MongoDB, a job scheduler and writing a results evaluation script, AETROS does distributed optimization almost out of the box.
With our solution the only thing you need is:
We will describe the process in more detail later on, but on a high level, to optimize hyperparameters with our solution, the only thing you need to do is:
warming upphase for the TPE algorithm.
The basic idea is to train your model with a specific set of parameters, i.e. run a trial, and receive a performance metric, i.e. a KPI. Based on that information we know which parameters worked well, so we are able to calculate good guesses of which parameters should be used in the next trial. This process is repeated as often as you wish and should lead to better KPI results over time. You can monitor in our job browser exactly which hyperparameters performed good and which didn't. With a constrained range of epochs/iterations or time, you can find out which hyperparameters performed better at an early stage, to build for example a smaller (and thus faster) network or to filter roughly very bad hyperparameter spaces.
The initial step is to ask Hyperopt for random parameters during the first initial trials. When initial trials are handled, we fire up Hyperopt's TPE and feed it with all of the history information we got from all previous trials (random trials and previous TPE trials), so Hyperopt's TPE calculates for us a next good guess. This process is repeated as often as you want and should lead to better KPIs the more trials you started, until it reaches the best possible KPI with the given hyperparameter space, model and training data.
We support currently three optimiztion algorithms:
In your model you receive hyperparameters from AETROS using the Job API: Hyperparameters and send us your KPI via Job API: Channels. To do so, you need to integrate a few lines of code in your python script using our Python SDK. You see examples in the step by step tutorial below.
When an optimization is done.
Open AETROS Trainer and create a new model.
Once created open "CODE" tab and enter there your git url and the python filename from the script that starts
a training of your model.
If you want to play around with it locally, you can enter a local git url like
file:///Users/peter/models/my-model-directory/. Make sure you fired up a git repo inside this directory,
since our training distribution across multiple servers in AETROS Trainer is completely based on git. We always checkout the latest commit of the defined branch once a new job is started.
Create and connect at least one server with AETROS, see documentation Server Cluster.
You can use of course your computer/notebook for this as well.
Create a server in AETROS Trainer, install aetros sdk via pip and execute the shown command in our interface.
Now you should see your machine in AETROS Trainer as online, allowing you to start the given model on this machine at any time from any place on earth.
If you choose to create and connect several servers with AETROS, you're free to do so. In the create dialog of an optimization you can select as many servers as you want. AETROS makes sure that all servers are equally busy based on their max-parallel-jobs setting and queue length.
You define all hyperparameters with its defaults in aetros.yml. See Configuration: parameters chapter for more information.
Now it's time to use in your actual model hyperparameters from AETROS. Make sure you have already read Job API: Getting started, since you should at this point already have a configuration file with hyperparameters.
To read a hyperparameter from the current job, just use
job.get_parameter(name). See here an example:
To understand how well a group of hyperparameters performed, we need to get information about the KPI during the training. You do this by using channels.
The main difference in a normal channel/metric and a KPI is that you set
kpi=True at that channel. Only one channel can be a KPI. Example:
Make sure AETROS received your channels/metrics correctly before you start an optimization by creating a job manually and look into the AETROS Trainer.
Open your model in AETROS Trainer and click on
+ Optimization button top right. A
Create optimization dialog pops up.
Here you can now define your hyperparameter space and define exactly which
parameter should be searched automatically and which should be static.
A hyperparameter space is a range of values of each hyperparameter. You can define for example how many neurons are allowed to be used. As in the screenshot below,
10 -> 128 in
1 steps, we also define
0 -> 0.5 in
0.005 steps. Also,
we make the whole second_dense layer optional by letting the algorithm define automatically whether
For hyperparameters of type "Number" you have the most options to choose. To get more information about each algorithm please read Hyperopt's parameter expressions.
If you don't activate
AUTO then this hyperparameter will not be touched by our optimization and is fixed, like in the screenshot below
Beside the actual hyperparameter space, it's also important to define good fundamental settings, like how much one trial is allowed to run (how many epochs or how many minutes) or how many random trials are created before TPE kicks in.
job.progress(x, total)to track the progress, if
max epochs, the trial will automatically end by receiving a SIGINT signal by our SDK. This is very handy if your actual training run takes very long and you usually see very fast whether the model performs good or not. Only very rare cases you don't limit epochs (like when a whole model training takes only several minutes).
warming upphase, the optimization starts only trials with random hyperparameters. This is important so the TPE is seeded before it starts. Make sure to create enough random tries. You start more random tries if you have more hyperparameters.
max paralleljobs running at the same time. It's important to note that you should not choose this number to high, since hyperparameters for each jobs are calculated when they are started and only ended previous jobs are included in this hyperparameter calculation.
If you activate
Max epochs or
Max time, our SDK is sending your process a SIGNINT signal. Depending on your framework you use, you need to stop the training process
manually to shutdown the process. Most of the times you don't need to do anything, but in case your process does not exit, try to register a signal handler and make sure your process is terminating correctly.
Don't forget to choose on which servers the optimization is allowed to start jobs. Scroll down in the
dialog too see all registered servers for your account. If you choose to select more than one servers, AETROS makes sure that all servers are equally busy during the optimization based
on three server information: current running jobs, max parallel jobs and current queue length.
Press "Create" to create your optimization and its first jobs.
Once you have created the optimization, you should see under the tab "OPTIMIZATIONS" a new optimization with one or more
You can now take a coffee or go to sleep,
while it finds automatically the best hyperparameter for you.
Hover over a job to see its hyperparameters, progress, status and more information. Click on a job to go to its detail view and see all channels, logs and other stats.
If your optimization is performing badly, you can either press "stop" and create a new one with a better hyperparameter space. Or you adjust your model, commit it in your git repository, and press "Restart", which basically deleted all your jobs of this optimization, builds completely new hyperparameters and starts new jobs with the new git version.
If your optimization is done, you can press
Show in job browser which opens the job browser containing only jobs from the chosen optimization.
You can now evaluate which hyperparameters performed best by sorting and looking at the stats, or export all job information as CSV and do some excel magic.