Getting started

AETROS Trainer is at its heart a build system where you can start and monitor jobs of any kind. A job can be to normalize data, an experiment, to move and transform data or simply to train a neural network. The special machine learning features we added like GPU resources assignment, experiment monitoring, experiment comparison/reproducibility and more make AETROS Trainer so different to regular build systems like Jenkins or CI solutions like Travis, CircleCI and more. You can not only monitor your simple jobs like a shell command, but you can use our Python SDK to send very machine learning specific analytics data, which will be displayed and versioned in AETROS Trainer in real-time to get a good insight of your machine learning algorithms.

AETROS CLI installation

Users can use AETROS CLI, an open-source command line application, on their PC to start jobs locally or on the server cluster, list jobs, connect the machine as a computing machine in the server cluster and much more. See left side under the section "CLI COMMANDS" for all commands. In order to do that, you need to install AETROS CLI and authenticate your local machine with AETROS Trainer.

Install PIP package

Requirements:
  • Python 2 or 3
  • Python PIP is version >= 9
  • Docker optional (to run local jobs in Docker containers)
# Make sure pip and setuptools are up to date
pip install --update pip setuptools
# install aetros package
pip install aetros

Install Git and SSH

Since all jobs are based on Git, you have to install Git in version 2 as well as an SSH (OpenSSH) client. Make sure git and ssh are available in your PATH variable. Openssh client comes often with Git, but if you don't have ssh available after installing Git, install it for your operating system.

Setup authentication

If you use on-premises AETROS Trainer, you need to configure AETROS CLI to point to your on-premises AETROS Trainer installation before doing the following. Do this by following the On-Premises: Configure clients documentation.

Whenever you use commands or the SDK of the aetros-cli package, your local machine connects to the AETROS Trainer through a very secure SSH connection, which is either to the official cloud server at trainer.aetros.com or your on-premises installation. You need to configure first on your local machine SSH keys to authenticate the machine.
We create automatically SSH keys for you and store its public key in your AETROS Trainer account, when you use the aetros authenticate CLI command. If you wish to manually configure SSH, please read the Authentication: Manual way.

aetros authenticate
# follow instructions
aetros id
Key installed of account marcj (Marc J. Schmidt) on trainer.aetros.com

For more information see the Authentication chapter.

If you use an on-premises installation, you need to configure "host" in your home configuration, see Configuration chapter.

Create first job

To create your first job, you need to create a new folder where your script files should be stored or switch to your project folder where your scripts are already be.
Use then the aetros init command to create a first model (at AETROS Trainer) and link the current folder with the specified model by placing there a aetros.yml configuration file. Both will will be done automatically by the aetros init command.

mkdir my-model
cd my-model
aetros init my-model
# or for an organisation
aetros init model-name -o orga-name
# or in a space
aetros init model-name -s space-name

You're now ready to start jobs on your local host that will appear under the "my-model" model in AETROS Trainer. Do this by using the aetros run command.

ls -al
-rw-r--r-- 1 marc staff 197B Feb 22 14:14 aetros.yml
aetros run --local 'echo Hi!'
1 files added (25 bytes) Job marcj/my-model/3fc9c40104e51494635466a68fef0c7e6b4f9142 created. Open http://trainer.aetros.com/model/marcj/my-model/job/3fc9c40104e51494635466a68fef0c7e6b4f9142 to monitor it. Hi!

You see now that one file has been added to the job (because we have only the aetros.yml from the aetros init command in there) and a new job has been created. Also, there's a URL you can open to monitor the job now.
If you open that link, you see AETROS Trainer in your browser and can click at the bottom at "LOGS" to see the output of our command.

Execute on a server

If you want to start a job now on a server, you have to connect your existing server to AETROS Trainer first. You can do this by executing aetros server on that server, see Server Cluster chapter for more information.

Once connected, you see in the cluster view in the AETROS Trainer interface all connected servers ("Cluster" button top left) with its resources (CPU, memory, GPU, etc).

To start now a job on any machine, just remove the --local argument of aetros run or aetros start command.

aetros run 'echo Hi!'
1 files added (25 bytes) Job marcj/my-model/3fc9c40104e51494635466a68fef0c7e6b4f9142 created. Open http://trainer.aetros.com/model/marcj/my-model/job/3fc9c40104e51494635466a68fef0c7e6b4f9142 to monitor it.

A server is automatically chosen based on the required resources define the aetros.yml. If you want to limit the server assignment processes to one particular server, use the --server username/servername argument.

All commands have an --help argument, so you see all possible ways to execute it. Example: aetros run --help.
Since AETROS itself does not sell server hardware, you have to rent a cloud server or use your own machine to execute jobs.

Execute in Docker container

As you see executing a simple command like echo Hi! on a server is very easy. Everything gets stored, you can watch it in real-time and even restart it to reproduce it. So far so good, but what if you want to start a long-running script (maybe Python?) with a defined environment (Python3 and Tensorflow maybe)?

Well, to demonstrate this we go the simple route by using aetros run again.

aetros run --local --image=tensorflow/tensorflow \ "python -c 'import tensorflow as tf; print(tf.__version__)'"

If you open the job in AETROS Trainer, you should see some information about the Docker image and in the log viewer the tensorflow version (which we printed with print(tf.__version__)).

If you specified a Docker image (using --image tensorflow/tensorflow or the configuration property image: tensorflow/tensorflow) you need to make sure Docker is installed on the machine the job is executed. For example, if you use --local, the job is executed on your local machine, so if you have specified an image, you need to install Docker first. If you ommit the --local argument, it will be started on a server (a machine where aetros server runs), where Docker needs to be installed as well.

Use model configuration file

But you can also see the command gets longer and longer. To make things short, you can include any option in the aetros.yml configuration file of the current folder.

# placed by the aetros init command model: marcj/my-model # add this manually using a text editor image: tensorflow/tensorflow command: "python -c 'import tensorflow as tf; print(tf.__version__)'"

You can now create the same job by just executing aetros run in the folder the aetros.yml is stored. You can overwrite each option of the aetros.yml configuration using the command arguments.

aetros run

The other advantage of having a aetros.yml is that you can start jobs over the AETROS Trainer interface with a couple of clicks without even touching the terminal again. Also, the CI integration needs this configuration file to know how to execute your job.

Use script files

For now, we haven't attached additional files to our job. A file can be a python script, a shell script or some data (csv, little zips) you need to process a command.

You attach new files to a new job created by aetros run by just adding them to the current working directory since the "run" command attaches all files in the current working directory to the job. If you want to exclude some files from being uploaded and attached to a job, you can add them to ignore configuration option in the aetros.yml configuration file.

Use Git repository

The aetros run is designed to work with the current working directory and to sync every file from it to the starting Git tree of a job. This makes it possible to compare files between jobs and not have the struggle to sync first a Git repository by adding Git commits every time you change a single line of code.

Alternatively, you can upload all files to the Git repository of the model, which you find in the AETROS Trainer behind the "Clone" button at the model view.
You can, of course use your own Git repository hosted on a server of your choice (Github, Bitbucket, ...) and link its URL in the "CODE" tab at the model view.
You start jobs from source code specified in the "CODE" tab with the aetros start command.

cd my-model/
git add script.py
git commit -m 'changed xy'
git push origin master
# With following command source of the Git is downloaded
# and command specified in aetros.yml started.
# Everything is executed on the local machine.
aetros start --local marcj/my-model

If you want to start a new job from a Git repository on a server, you can use the AETROS Trainer interface. Go to the model view and press "+ JOB" button at the top right. Choose the hyperparameters, configurations and the server. Press "Create" to create and start a new job immediately.

Use GPU

Local jobs, host-execution

You job runs locally when you use the --local argument. If you have additionally NOT specified the image configuration (by using the configuration value or --image argument), your job will run on the local host directly (without Docker involved). In this case, your script has full access to all host hardware and you don't need to specify which GPU resources are required. The gpu configuration options are ignored.

Server jobs, host-execution

You job runs on a cluster server when you NOT use the --local argument. If you have additionally NOT specified the image configuration (by using the configuration value or --image argument), your job will run on a server directly (without Docker involved). In this case, your script has full access to all host hardware and you don't need to specify which GPU resources are required. The gpu configuration options are ignored.

Local jobs, Docker container

You job runs locally when you use the --local argument. If you have additionally specified the image configuration (by using the configuration value or --image argument), your job will run on the local Docker engine (in a separate Docker container).

In this case, your script has only access to the hardware you configured. For local jobs in Docker container, you need to specify --gpu-device id argument (multiple allowed). --gpu-device 0 defines that you want to pass through one local GPU to your Docker container. That works only on Linux hosts. On Windows and MacOS, you need to use no host-execution.

Example: aetros run --local --gpu-device 0 --gpu-device 1 --image tensorflow/tensorflow:gpu 'python gpu-training.py', which means you assigned the first and second GPU to the container. Use aetros gpu to see the id necessary for --gpu-device. --gpu-device works only for local jobs using --local.

Server jobs, Docker container

You job runs on a cluster server when you use NOT the --local argument. If you have additionally specified the image configuration (by using the configuration value or --image argument), your job will run on the Docker engine (in a separate Docker container) on a cluster server.

In this case, your script has only access to the hardware you configured. For jobs in Docker container on a cluster server, you need to specify --gpu count argument. --gpu 2 defines that you want to pass through two free GPUs to your Docker container of the cluster server's GPU cards. That works only on Linux hosts. On Windows and MacOS, you need to use no host-execution.

Example: aetros run --gpu 2 --image tensorflow/tensorflow:gpu 'python gpu-training.py', which means you assign 2 GPU cards, whatever card currently is free.

If you need more GPUs in your job, you can specify a higher number of GPUs in --gpu number. However, you need to make sure the server has sufficient GPUs installed. You see how many GPUs are installed in the server view.

To setup a AETROS cluster server with GPU support, see the chapter GPU usage.