NVIDIA-Docker version 2 released

14. November 2017 - Machine Learning

Few hours ago NVIDIA merged in its Github repository NVIDIA/nvidia-docker branch 2.0 into master with the claim "It is already more stable than 1.0 but we need help testing it", which means v1.0 is obsolete and replaced by the new version.

It seems NVIDIA bets completely on the newest implementation, marking the old version as obsolete and moves forward by already closing issues specified to v1.0. This means obviously that NVIDIA-docker v1 is not maintained anymore and users should switch to the new version as soon as possible.

According to their own README, the changes compared to version 1 are:

  • Doesn't require wrapping the Docker CLI and doesn't need a separate daemon
  • GPU isolation is now achieved with environment variable `NVIDIA_VISIBLE_DEVICES`
  • Can enable GPU support for any Docker image. Not just the ones based on our official CUDA images
  • Package repositories are available for Ubuntu and CentOS
  • Uses a new implementation based on libnvidia-container

Notes about the release

The biggest change coming with that release is that you don't need to build your images based on NVIDIA images anymore. That means concretely that you can run GPU accelerated programs directly on for example the debian:stretch docker image without installing the drivers inside the container anymore. Version 2 also provides on the major Linux distributions like Debian and Ubuntu the program nvidia-smi automatically.

No CUDA/cuDNN installed

Although the new NVIDIA Docker runtime passes the GPU devices to your container and installs some driver files automatically, they do not provide yet automatically a CUDA or cuDNN version inside that container. This is understanable since there are meanwhile a lot of major CUDA versions in production usage and NVIDIA docker doesn't seem to want to force one particular version in your container. You still have to install that manually if you wan't to use those libraries. That basically means, it's currently probably still easier to just pick nvidia/cuda as base Docker image if you use CUDA in your programs.

Compare

Using the old NVIDIA-Docker v1 your Docker container needs to be based always on NVIDIA's nvidia/cuda image, since there are driver and GPU libraries already installed. You were used to start containers with nvidia-docker:

# "nvidia/cuda" Works NV_GPU='0,1' nvidia-docker run -it nvidia/cuda nvidia-smi # "debian:stretch" doesn't work NV_GPU='0,1' nvidia-docker run -it debian:stretch nvidia-smi

New way is to use your regular docker program and define the "nvidia" runtime.

# "nvidia/cuda" works as usual docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES='0,1' --rm nvidia/cuda nvidia-smi # "debian:stretch" works as well now docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES='0,1' --rm debian:stretch nvidia-smi
Not all Linux distributions as guests are supported. We tested Ubuntu and Debian, but Alpine for example seems not to be supported yet.
The runtime "nvidia" only passes through the GPU devices if you define -e NVIDIA_VISIBLE_DEVICES="all" or devices like -e NVIDIA_VISIBLE_DEVICES="1". If you don't specify that environment variable, nvidia-docker won't provide a GPU inside that container.

Using NVIDIA Docker v2

In order to use the new version, you should remove first the old one.
See Removing NVIDIA Docker 1.0.

Continue installating NVIDIA docker 2 by following our step-by-step guide in our documentation: GPU usage.

The official pre-built packages are only for CentOS and for Ubuntu only for version 16.04. See their official Github website: NVIDIA Docker Engine wrapper repository.

All AETROS cli commands and AETROS cluster server are completely based on NVIDIA docker 2, so if you want to start jobs that utilize a NVIDIA GPU in Docker, you can easily start the aetros server on your GPU server and are ready to start jobs through AETROS Trainer or the command line tools like aetros run.

Identify GPU

To use GPU isolation (making a particular GPU available into the Docker container using NVIDIA_VISIBLE_DEVICES) NVIDIA continues to identify the GPU by an own enumeration that is based on the pciBusId, concretely the BDF notation. The new runtime still sorts all your GPUs by that BDF notation and identifies each by its index. Since you don't see that ID by using nvidia-smi, we've built into our AETROS cli aetros gpu the support to display that ID.

aetros gpu
CUDA version: 8000 0000:00:04.0 GPU id=0 Tesla P100-PCIE-16GB (memory 15.89GB, free 15.60GB) 0000:00:05.0 GPU id=1 Tesla P100-PCIE-16GB (memory 15.89GB, free 15.60GB) 0000:00:06.0 GPU id=2 Tesla P100-PCIE-16GB (memory 15.89GB, free 15.60GB) 0000:00:07.0 GPU id=3 Tesla P100-PCIE-16GB (memory 15.89GB, free 15.60GB)

At id=0 you see the id to choose for the environment variable -e NVIDIA_VISIBLE_DEVICES=0. You can see that id also in the server view in AETROS Trainer, e.g "GPU0", "GPU1":

If you use AETROS cluster server, you do not need to use NVIDIA_VISIBLE_DEVICES in any way since we automatically manage the GPUs on your host and assign a GPU to a job container automatically. See the documentation AETROS Server Cluster for more information.

Version 2017.5 released

9. November 2017 - Releases

Today we are happy to announce the release of version 2017.5 with Docker support and an overhauled server management system called cluster.

New features

  1. Feature: Cluster
  2. Feature: Docker support
  3. Updated: Improved UI/UX with many changes
  4. Updated: Major AETROS SDK v0.11.0
We upgraded our documentation and added many new information regarding the new workflow. Make sure to start reading the Getting Started guide that covers most of the new stuff.

Cluster

In the past, you were used to start jobs by picking before a server manually. We changed that in the way, that the job defines what it needs (for example 4gb ram, 1 GPU) and we find automatically the best server in your cluster. We introduced for that new configuration options in the aetros.yml called resources, where you can define how much CPU cores, how much memory, how many GPUs and how much GPU memory the job should reserve on the server. Using Docker container, we can limit each. If you have in your server multiple GPUs installed, we make sure only the requested one will be passed through to the container.

Each connected server to AETROS is now providing resources you can limit using the aetros server command. All servers together is named "Cluster". An overview of your cluster can be seen under the button "Cluster" in the top main bar.

Read more ...

Version 2017.4 released

2. September 2017 - Releases

Today we are happy to announce another very big release of AETROS Trainer and our platform with a complete overhauled storage engine and improved collaboration user interface.

Changes at a glance

We got in the last months tons of feedback: one of the biggest was that companies and private organisations want to have their own AETROS Trainer in their own server infrastructure. Teams that work together on machine learning models ask quite a lot about a feature called "organisations" where you have a shared space of models, datasets and jobs - like you probably already know from software like GitHub. Also, we got meanwhile power users creating tons of models and jobs each week. We found for all of your wishes a solution and are proud to present you our work.

Another big topic is for us to make all your data more accessible to you. Thus we changed our whole model and job storage engine to Git by using an open file structure and not our internal MySQL database anymore. This means you have all your model code and job information (metrics, used code, results, images, logs, etc) now in a Git repository you can easily clone and modify. This is not only better for our application scaling, but allows you to keep your experiments data and results now always in your hands, making it offline accessible and usable with your Git client of choice.

Read more ...

Road to version 2017.4

13. June 2017 - Releases

Directly after our latest release 2017.3 we want to inform you about the features coming in the next release.

On-Premises solution

We had tons of feedback for our AETROS Trainer. One of the biggest was to provide the software as on-premises. Obviously important for bigger companies that want to have everything in their own network. We had this already on our list, but will move it now to top priority and schedule the on-premises version as of July 2017.

Community Datasets

As you know you can already browse and use community models on our platform. We want to extend that by providing datasets and leaderboards for public available datasets as well. To know how good your algorithm compares to others is always interesting and an indicator whether you are on something big or if you fail hard. Also, for beginners it's super interesting to see which algorithm performs currently best at a certain dataset. We want to make the datasets you can already manage in AETROS Trainer available through our website with a leader board where everyone can submit a tagged job of a model to this dataset to show his/her results publicity.

The next release is scheduled for end of July 2017.

Version 2017.3 released

12. June 2017 - Releases

Today we are happy to announce one of our biggest release of AETROS Trainer and our platform with tons of new cool features for all data scientists and machine learning engineers out there!

Changes at a glance

With this update we made tracking and organizing experiments easier than ever. You can now give an experiment a description you can use to describe the uniqueness of this experiment. This gives a better overview and allows you to remember faster what was special about an experiment. If you need to reproduce results on different machines, you have a perfect overview of your used system variables and library versions with the environment feature. Additionally, you can upload your source files to AETROS and stick it to the experiment, so you can track source code changes and see directly changes between versions using our new compare view.

Read more ...

Version 2017.3: Experiment comparison

12. June 2017 - Feature

Have you ever wondered what you changed in your soure code, dataset or hyperparameters to get a particular result? Or why one week later your new experiment gets way worse results? Well, you could use git to version everything and use your git interface like Github to track changes (or use Excel). However, since this is cumbersome and we want to have everything in once place, we built an experiment comparison view, where you have a side-by-side view of all aspects of your experiment: Progress, hyper-parameters, additional information, metrics and even unified diffs of your source code. With this new feature, you can compare multiple experiments side-by-side and see instantly differences, which will help you for example to find the cause in changes of the performance of your models way faster.

Read more ...

Version 2017.3: Experiment notifications

12. June 2017 - Feature

We are often vesting our time with looking at the progress of a specific job and yes sometimes this is super existing, but on other days we just want to know when a specific job is done and we can get the results. Sometimes we start a long running experiment and 20 minutes later the RAM is full and the experiment crashed but we just see this the next morning.

Therefore, we implemented the notification function. You can now decide what model should trigger a notification by watching the model. We send an email or Slack message when a experiment is finished or failed.

Read more ...

Better job tracking

15. May 2017 - Feature

When working with long running experiments, it's very important to get a feeling for the overall computation performance, so you can calculate with an ETA. Also, every time you change your architecture or training data you may run into performance penalties. Since computation is expensive and you usually don't want to wait weeks for a training job, you need to keep an eye on those stats. We improved the overall tracking of the progress of a training job now even more. You can not only see epochs (using job.progress(current, total)) but also see its batch progress (using new method job.batch(current, total, [size])) which acts as a sub progress of the regular epoch/iteration tracking. We calculate for you as before automatically an ETA, and samples/s if you pass size in job.batch.

You can now also get an better overview of the used hyperparameters and additional information you can freely set using job.set_info(key, value) to enrich the job with additional information. Also, we added an loss tracking graph that indicates whether your model overfits or underfits.

Automatic hyperparameter optimization: Easier than ever

24. February 2017 - Feature

We just published one of our newest and biggest features: Model optimizations. This allows you to automatically find hyperparameters based on the KPI of any python model, no matter which framework you use. Basic idea:

  1. You define hyperparameters (for example learning_rate)
  2. Its spaces (for example learning_rate from 0.005 to 0.5)
  3. We start the training script of your model several times with hyperparameters within the given space
  4. In your model you send us your KPI (accuracy for example)
  5. We can determine which hyperparameter performed well and which didn't
  6. Calculate further hyperparameter and start automatically as many training runs as you want (at wish parallel distributed across multiple servers)

In our user interface AETROS Trainer you can start as many optimizations as you want, watch its progress, adjust hyperparameter spaces, compare or export the results.

Features

  • Automatic search of better hyperparameters
  • Three different optimization algorithms (Random, TPE, Annealing)
  • Automatic training job distribution across multiple servers
  • Very detailed and convenient hyperparameter space definition through interface
  • Runtime constrains like max epochs and max time
  • Watch the process, results and metrics in real-time in AETROS Trainer
  • Completely based on Git
  • All results can be exported as CSV

Full documentation

See our main documentation: Automatic hyperparameter optimization.

Improved hyperparameters

23. February 2017 - Feature

You can now define hyperparameters in a even more detailed way. You can choose between seven types: String, Number, Boolean, Group (dict), Choice: String (selectbox), Choice: Number(selectbox), Choice: Group (select group). And of course, you can overwrite those hyperparameters per job.

Read more ...

Better jobs browser

20. February 2017 - Feature

With our improved jobs browser you can now create own job categories, export jobs as CSV and see continuous integration builds. If you hover with your mouse over a particular job you see now additionally all used hyperparameters and custom information. Through the implementation of a pagination you can now browse hundreds or thousands of jobs without performance issues.

Read more ...

New feature: External servers / job scheduler

25. January 2017 - Feature

You can now connect external server with AETROS using aetros-cli so you can start and monitor training jobs across multiple external servers with just one command. This makes it super easy to distribute your training jobs across multiple servers without using ssh (and start each job manually).

More documentation about that feature can be seen at Server Cluster.

New Website, new version and won NVIDIA contest

25. January 2017 - Company

We are excited to announce that we won the NVIDIA Inception "cool demo" contest! The price is a brand new NVIDIA Pascal Titan X, that we will definitely use to train a lot of new hot deep learning models. Thank you very much, NVIDIA!

Read more ...