A few years ago we thought that the best way to encapsulate software for deployment in the cloud was to create a virtual machine image and store it in a repository for Amazon images or the Microsoft VM depot. Then anybody could grab the image and spin it up on the appropriate data center. Unfortunately virtualization tools are not all the same, so running a VM from Amazon on Azure or other cloud was a real problem. I sat through endless debates with my science colleagues that usually went something like this. “This is just another form of evil vendor LOCK-IN!” We spent a great deal of time thinking about ways to address this evilness.
While this was going on other folks realized that the Linux Kernel had some cool features like namespace isolation and control groups that provides ways to bound the resource utilization of processes. By the middle of 2013 a little company called dotCloud released a tool that would provide a better way to deploy encapsulated application and this became Docker and Docker, Inc. Microsoft also figured out how to do the same thing with Windows and they are said to release a Docker capable version of Windows Server soon.
What is Docker and why does it change the game? Docker allows applications to be provisioned in “containers” that encapsulate all application dependencies. The application sees a complete, private process space, file system and network interface that is isolated from applications in other containers on the same host operating system. Once the basic Docker framework and library are installed on the host OS, loading and bringing up a container stored in the Docker archive requires one command and takes only as long as it takes to download the container binary. Because container images are so much smaller than a VM image, the download is very fast. And starting a new instance of a cached container image takes only a few seconds. Docker isolation provides a great way to factor large applications and there are simple ways for running containers to communicate. (This is a great way to avoid the “jar hell” that was so common in big Java applications when different app components require libraries or environments that are in conflict.) It took only a few mouse clicks to install Boot2Docker on my little Microsoft Surface and another minute to spin up a container with the IPython notebook. (If you try this at home, note that Boot2Docker conflicts with Hypervisor so you need to turn HyperV off.) To make life even easier another new company called CoreOS has a stripped down distribution of Linux that is designed for large-scale cloud deployment and Docker is built-in.
Docker is designed to support a variety of distributed applications and it is now widely used in the “Internet industry” (including folks like YELP*, Spotify, Baidu, Yandex, eBay) and supported by the public cloud providers Google, Microsoft, Amazon and IBM. Loading CoreOS as a VM on Azure was easy and bringing up Docker instances on it was trivial.
Where’s the Science?
At this point many of you readers may say, “We know all this. Where have you been the last year?” The point of this post is to call out the impressive collection of scientific applications that have been containerized. My exploration of this began with a nice chat I had with two bioinformatics researchers who were interested in using a public cloud for their work. I asked them which tools they were using. They told me and I thought there must be Docker containers for that software. Sure enough, they were there. In fact there are example that are appearing every day. Here is a short list that I found on the Docker repository.
- Radio Astronomy tools including containers for LOFAR, pyimager and meqtrees.
- Chemical Informatics libraries from the chemtoolbox
- Bioinformatics. The ever popular Galaxy toolkit is available in various forms. The University of Hamburg genome toolkit is also available.
- For Mathematics and Statistics there is, of course, R and Python with numpy etc in various combinations.
- For Machine Learning there is the complete collection of ML algorithms written in Julia, many versions of Spark, the Vowpal Wabbit tools and the Scikit-learn python tools.
- For Geospatial data there is a container with geoserver
- For digital archival and data curation there is a container for dspace and irods
- The NSF iPlant consortium has developed the Agave science-as-a-service platform and the various components of that are now containerized.
Each of these allow you to spin up a running instance of the software in seconds on a Docerized server like CoreOS or Ubuntu (after a Docker install) or Windows with Boot2Docker. So all the problems of science in the cloud are solved. Correct? Of course not. What if you want to run a cluster of Docker Containers that are sharing you large workloads? Or running a big Spark deployment?
The challenge with Docker is managing multiple Azure servers each with multiple Docker containers. There are four major open source solutions to this problem.
- Mesos is a distributed systems kernel for managing large numbers of servers. Originally developed as part of the Berkeley Amplab, it is now supported by a startup, Mesosphere.
- Kubernetes is a technology released by Google and endorsed by Microsoft and others. Kubernetes is designed to manage groups of containers. It allows users to ask a cluster to run a set of containers. The system automatically chooses hosts to run those containers on.
- Weave is an overlay network for Docker that allows Docker clusters from different data centers or even different cloud vendors to be tied together. Weave runs on Google and AWS and Weave has been ported to Azure by Patrick Chanezon.
- Swarm is another tool developed by the Docker team to manage a cluster of containers. We know of no ports of Swarm to Azure.
These tools can be used together in various combinations. In our next post we will report on our experience using these tools to build a cluster application.
One final note. Docker does not solve the entire vendor Lock-In issue. There is still the data. Depend upon where you put big data collections the laws of computational physics will draw your computing to it. At least Docker makes it ease for that computing to move.
Henrik Nordborg just pointed me to TheUberCloud were a very large number of containerized science and engineering application have been deployed on a number of cloud platforms. More details later when I have a better understanding of what they have done. What I see now looks very impressive.
Thanks for an informative post, Dennis. You mentioned bioinformatics … do you know a repository of containerized bioinformatics applications? I’m wondering at the available breadth of apps but couldn’t find such after a quick search…
Update: I did find this: https://github.com/BioDocker?page=1
Thanks for this.
The links in the blog include the galaxy toolbox that has a large number of standard bioinformatics components and also the Hamburg genome toolkit. However, if there are specific bioinformatics tools you are looking for, give us a list and we can help look.
Very nice post. That is a very exciting problem, I am currently writing my master thesis about that topic: how to run distributed computing frameworks with docker. Is there any science application particularly required? I would like to develop and show one use case.
yes, the distributed computing case with docker is very interesting. And there are many science application. The easiest one is to use Spark with a machine learning case. For example, do a big k-means clustering example. Spark has already been ported to Docker clusters using Mesos. Here are two more science apps. 1. Do a metagenomics example. Use Blast on a large number of sequences in parallel. This is embarrassingly parallel because it requires no communication except at the end to gather results. 2. at the opposite end of the spectrum you can do something like computational fluid dynamics. This example requires constant inter-process communication. A well known lesson of parallel computing is the balance of communication and computation is critical to performance.
Did you do some experiments with a use-case similar to the fluid dynamics one? I am wondering how much overhead the weave overlay network adds.
i have not done any CFD cases, but TheUberCloud has done a lot of those. Looking at the costs of clusters of containers is a great project. i suspect the overhead of Weave may be high, but i don’t know yet.