Few hours ago NVIDIA merged
in its Github repository NVIDIA/nvidia-docker branch
2.0 into master with the
claim "It is already more stable than 1.0 but we need help testing it", which means v1.0 is obsolete and replaced by the new version.
It seems NVIDIA bets completely on the newest implementation, marking the old version as obsolete and moves forward by already closing issues specified to v1.0. This means obviously that NVIDIA-docker v1 is not maintained anymore and users should switch to the new version as soon as possible.
According to their own README, the changes compared to version 1 are:
The biggest change coming with that release is that you don't need to build your images based on NVIDIA images anymore. That means concretely that you
can run GPU accelerated programs directly on for example the
debian:stretch docker image without installing the drivers inside the container anymore.
Version 2 also provides on the major Linux distributions like Debian and Ubuntu the program
Although the new NVIDIA Docker runtime passes the GPU devices to your container and installs some driver files automatically, they do not provide yet automatically a CUDA or cuDNN version inside that container. This is understanable since there are meanwhile a lot of major CUDA versions in production usage and NVIDIA docker doesn't seem to want to force one particular version in your container. You still have to install that manually if you wan't to use those libraries. That basically means, it's currently probably still easier to just pick nvidia/cuda as base Docker image if you use CUDA in your programs.
Using the old NVIDIA-Docker v1 your Docker container needs to be based always on NVIDIA's
nvidia/cuda image, since there are driver and GPU libraries already installed.
You were used to start containers with
New way is to use your regular
docker program and define the "nvidia" runtime.
-e NVIDIA_VISIBLE_DEVICES="all"or devices like
-e NVIDIA_VISIBLE_DEVICES="1". If you don't specify that environment variable, nvidia-docker won't provide a GPU inside that container.
In order to use the new version, you should remove first the old one.
See Removing NVIDIA Docker 1.0.
Continue installating NVIDIA docker 2 by following our step-by-step guide in our documentation: GPU usage.
All AETROS cli commands and AETROS cluster server are completely based on NVIDIA docker 2, so if you
want to start jobs that utilize a NVIDIA GPU in Docker, you can easily start the
aetros server on your GPU server and are ready to start jobs through AETROS Trainer or the command line tools like
To use GPU isolation (making a particular GPU available into the Docker container using
NVIDIA_VISIBLE_DEVICES) NVIDIA continues to identify the GPU by an own enumeration
that is based on the pciBusId, concretely the BDF notation.
The new runtime still sorts all your GPUs by that BDF notation and identifies each by its index. Since you don't see that ID by using
nvidia-smi, we've built into our
aetros gpu the support to display that ID.
id=0 you see the id to choose for the environment variable
-e NVIDIA_VISIBLE_DEVICES=0. You can see that id also in the server view in AETROS Trainer, e.g "GPU0", "GPU1":
NVIDIA_VISIBLE_DEVICESin any way since we automatically manage the GPUs on your host and assign a GPU to a job container automatically. See the documentation AETROS Server Cluster for more information.
Today we are happy to announce the release of version 2017.5 with Docker support and an overhauled server management system called cluster.
In the past, you were used to start jobs by picking before a server manually. We changed that in the way, that the job defines what it needs (for example 4gb ram, 1 GPU) and
we find automatically the best server in your cluster. We introduced for that new configuration options in the
resources, where you can
define how much CPU cores, how much memory, how many GPUs and how much GPU memory the job should reserve on the server. Using Docker container, we can limit each. If you have
in your server multiple GPUs installed, we make sure only the requested one will be passed through to the container.
Each connected server to AETROS is now providing resources you can limit using the
aetros server command. All servers together is named "Cluster". An overview of your cluster
can be seen under the button "Cluster" in the top main bar.
Today we are happy to announce another very big release of AETROS Trainer and our platform with a complete overhauled storage engine and improved collaboration user interface.
We got in the last months tons of feedback: one of the biggest was that companies and private organisations want to have their own AETROS Trainer
in their own server infrastructure. Teams that work together on machine learning models ask quite a lot about a feature called "organisations" where
you have a shared space of models, datasets and jobs - like you probably already know from software like GitHub.
Also, we got meanwhile power users creating tons of models and jobs each week. We found for all of your wishes a solution and are proud to present you our work.
Another big topic is for us to make all your data more accessible to you. Thus we changed our whole model and job storage engine to Git by using an open file structure and not our internal MySQL database anymore. This means you have all your model code and job information (metrics, used code, results, images, logs, etc) now in a Git repository you can easily clone and modify. This is not only better for our application scaling, but allows you to keep your experiments data and results now always in your hands, making it offline accessible and usable with your Git client of choice.
Directly after our latest release
2017.3 we want to inform you about the features coming in the next release.
We had tons of feedback for our AETROS Trainer. One of the biggest was to provide the software as on-premises. Obviously important for bigger companies that want to have everything in their own network. We had this already on our list, but will move it now to top priority and schedule the on-premises version as of July 2017.
As you know you can already browse and use community models on our platform. We want to extend that by providing datasets and leaderboards for public available datasets as well. To know how good your algorithm compares to others is always interesting and an indicator whether you are on something big or if you fail hard. Also, for beginners it's super interesting to see which algorithm performs currently best at a certain dataset. We want to make the datasets you can already manage in AETROS Trainer available through our website with a leader board where everyone can submit a tagged job of a model to this dataset to show his/her results publicity.
The next release is scheduled for end of July 2017.
Today we are happy to announce one of our biggest release of AETROS Trainer and our platform with tons of new cool features for all data scientists and machine learning engineers out there!
With this update we made tracking and organizing experiments easier than ever. You can now give an experiment a description you can use to describe the uniqueness of this experiment. This gives a better overview and allows you to remember faster what was special about an experiment. If you need to reproduce results on different machines, you have a perfect overview of your used system variables and library versions with the environment feature. Additionally, you can upload your source files to AETROS and stick it to the experiment, so you can track source code changes and see directly changes between versions using our new compare view.Read more ...
Have you ever wondered what you changed in your soure code, dataset or hyperparameters to get a particular result? Or why one week later your new experiment gets way worse results? Well, you could use git to version everything and use your git interface like Github to track changes (or use Excel). However, since this is cumbersome and we want to have everything in once place, we built an experiment comparison view, where you have a side-by-side view of all aspects of your experiment: Progress, hyper-parameters, additional information, metrics and even unified diffs of your source code. With this new feature, you can compare multiple experiments side-by-side and see instantly differences, which will help you for example to find the cause in changes of the performance of your models way faster.Read more ...
We are often vesting our time with looking at the progress of a specific job and yes sometimes this is super existing, but on other days we just want to know when a specific job is done and we can get the results. Sometimes we start a long running experiment and 20 minutes later the RAM is full and the experiment crashed but we just see this the next morning.
Therefore, we implemented the notification function. You can now decide what model should trigger a notification by watching the model. We send an email or Slack message when a experiment is finished or failed.Read more ...
When working with long running experiments, it's very important to get a feeling for the overall computation performance, so you can calculate with an ETA.
Also, every time you change your architecture or training data you may run into performance penalties. Since computation is expensive and you usually don't want
to wait weeks for a training job, you need to keep an eye on those stats. We improved the overall tracking of the progress of a training job now even more. You can
not only see epochs (using
job.progress(current, total)) but also see its batch progress (using new method
job.batch(current, total, [size])) which acts
as a sub progress of the regular epoch/iteration tracking. We calculate for you as before automatically an ETA, and samples/s if you pass
You can now also get an better overview of the used hyperparameters and additional information you can freely set using
job.set_info(key, value) to enrich
the job with additional information. Also, we added an loss tracking graph that indicates whether your model overfits or underfits.
We just published one of our newest and biggest features: Model optimizations. This allows you to automatically find hyperparameters based on the KPI of any python model, no matter which framework you use. Basic idea:
learning_rate from 0.005 to 0.5)
In our user interface AETROS Trainer you can start as many optimizations as you want, watch its progress, adjust hyperparameter spaces, compare or export the results.
See our main documentation: Automatic hyperparameter optimization.
You can now define hyperparameters in a even more detailed way. You can choose between seven types: String, Number, Boolean, Group (dict), Choice: String (selectbox), Choice: Number(selectbox), Choice: Group (select group). And of course, you can overwrite those hyperparameters per job.Read more ...
With our improved jobs browser you can now create own job categories, export jobs as CSV and see continuous integration builds. If you hover with your mouse over a particular job you see now additionally all used hyperparameters and custom information. Through the implementation of a pagination you can now browse hundreds or thousands of jobs without performance issues.Read more ...
You can now connect external server with AETROS using
aetros-cli so you can start and monitor training jobs across multiple external servers with just one command.
This makes it super easy to distribute your training jobs across multiple servers without using ssh (and start each job manually).
More documentation about that feature can be seen at Server Cluster.
We are excited to announce that we won the NVIDIA Inception "cool demo" contest! The price is a brand new NVIDIA Pascal Titan X, that we will definitely use to train a lot of new hot deep learning models. Thank you very much, NVIDIA!Read more ...