Server Cluster

A server in AETROS Trainer is a computing machine that provides resources (CPU, memory, GPU). Many servers together is called a cluster. You can connect any machine using the aetros server command to join that cluster at AETROS Trainer. If you create a new job, resources from your cluster will be reserved and a server will be assigned automatically to it, so the job can be executed on a machine with enough free resources.

Since per default all jobs are executed in a Docker container, the requested resources by the jobs are limited using the Docker Engine.

Your benefits

  1. Train on remote hardware
  2. Monitor server utilization (CPU, RAM, bandwidth, disk, processes, running jobs)
  3. Start jobs through the interface
  4. Start fully automated hyperparameter optimizations
  5. Run builds (continuous integration)

Architecture

Cluster overview

In the cluster modal, you see all of your server, their hardware utilization and reserved resources. You can click on each to get a more detailed view about that server.

Server monitoring

In the server view, you see all devices (GPU, memory, CPU) utilization as well as network bandwidth, disk usage, available resources and current running jobs on that server, and more.

Step 1: Create server

To do so, open AETROS Trainer, click "Cluster" on the top left bar and click at "CREATE SERVER". Please enter now a unique name. After clicking on "CREATE" the newly created server is visible. You see right in the middle of the screen a command you need to enter on your actual server in order to connect your server with AETROS.

Step 2: Prepare server

Install requirements

  1. Install Docker
  2. Install nvidia-docker2 (only necessary for GPU support, Ubuntu 16.04 only)
  3. Install Python
  4. Install aetros-cli

Step 2.1: Connect the server using SSH keys

Per default, in AETROS everything is authenticated using a SSH key pair. You can create one automatically by using the aetros authenticate command. This command creates a new pair of SSH keys and associates it with your AETROS account, making the server authenticated against AETROS. Just execute it and follow the output.

aetros authenticate
aetros server marcj/server-name
Connected to aetros.com as username/servername

Step 2.2: Using magic token

Alternatively you can temporarily create a SSH key pair (only available during the executiong of the command and will be deleted once the command exists) using the --generate-ssh-key=TOKEN argument. You see the TOKEN in the server view in AETROS Trainer. It's important to keep your secret key in --generate-ssh-key GENERATE_SSH_KEY private.

aetros server marcj/server-name --generate-ssh-key=SECURE_KEY
Connected to aetros.com as username/servername

Replace SERVER_KEY with your actual server key you see when you open the server in AETROS Trainer. You should see now your server as online in AETROS Trainer.


Your server is now connected and you can start creating jobs on it using aetros start or aetros run.

Step 3: Daemonize (optional)

To connect your server automatically to AETROS on bootstrap and to make sure the command restarts automatically when its crashes, you can use supervisord.
Here's a short introduction how to install and configure supervisord with aetros-cli.

3.1 Installation

sudo apt-get install supervisor

3.2 Add aetros-cli to supervisor

First check which full path aetros-cli has by running following command:

which aetros
/usr/local/bin/aetros

So, the command path is /usr/local/bin/aetros. We need to use this full path in our supervisor configuration file.

[program:long_script] command=/usr/local/bin/aetros server marcj/server-name --generate-ssh-key=SECURE_KEY autostart=true autorestart=true stderr_logfile=/var/log/aetros-server.err.log stdout_logfile=/var/log/aetros-server.out.log

Replace SERVER_KEY with your actual server key you see when you open the server in AETROS Trainer. Make sure this key stays private.

3.2 Refresh supervisor

Tell supervisor to reread all of its configurations.

supervisorctl reread

aetros server should now immediately be executed. You find more information in the article How To Install and Manage Supervisor on Ubuntu and Debian VPS.