Monthly Archives: February 2015

Docker and Containers of Science.

A few years ago we thought that the best way to encapsulate software for deployment in the cloud was to create a virtual machine image and store it in a repository for Amazon images or the Microsoft VM depot. Then anybody could grab the image and spin it up on the appropriate data center. Unfortunately virtualization tools are not all the same, so running a VM from Amazon on Azure or other cloud was a real problem. I sat through endless debates with my science colleagues that usually went something like this. “This is just another form of evil vendor LOCK-IN!” We spent a great deal of time thinking about ways to address this evilness.

While this was going on other folks realized that the Linux Kernel had some cool features like namespace isolation and control groups that provides ways to bound the resource utilization of processes. By the middle of 2013 a little company called dotCloud released a tool that would provide a better way to deploy encapsulated application and this became Docker and Docker, Inc. Microsoft also figured out how to do the same thing with Windows and they are said to release a Docker capable version of Windows Server soon.

What is Docker and why does it change the game? Docker allows applications to be provisioned in “containers” that encapsulate all application dependencies.   The application sees a complete, private process space, file system and network interface that is isolated from applications in other containers on the same host operating system.   Once the basic Docker framework and library are installed on the host OS, loading and bringing up a container stored in the Docker archive requires one command and takes only as long as it takes to download the container binary.   Because container images are so much smaller than a VM image, the download is very fast. And starting a new instance of a cached container image takes only a few seconds. Docker isolation provides a great way to factor large applications and there are simple ways for running containers to communicate. (This is a great way to avoid the “jar hell” that was so common in big Java applications when different app components require libraries or environments that are in conflict.) It took only a few mouse clicks to install Boot2Docker on my little Microsoft Surface and another minute to spin up a container with the IPython notebook. (If you try this at home, note that Boot2Docker conflicts with Hypervisor so you need to turn HyperV off.) To make life even easier another new company called CoreOS has a stripped down distribution of Linux that is designed for large-scale cloud deployment and Docker is built-in.

Docker is designed to support a variety of distributed applications and it is now widely used in the “Internet industry” (including folks like YELP*, Spotify, Baidu, Yandex, eBay) and supported by the public cloud providers Google, Microsoft, Amazon and IBM.   Loading CoreOS as a VM on Azure was easy and bringing up Docker instances on it was trivial.

Where’s the Science?

At this point many of you readers may say, “We know all this. Where have you been the last year?” The point of this post is to call out the impressive collection of scientific applications that have been containerized. My exploration of this began with a nice chat I had with two bioinformatics researchers who were interested in using a public cloud for their work. I asked them which tools they were using. They told me and I thought there must be Docker containers for that software. Sure enough, they were there.   In fact there are example that are appearing every day.   Here is a short list that I found on the Docker repository.

  • Radio Astronomy tools including containers for LOFAR, pyimager and meqtrees.
  • Chemical Informatics libraries from the chemtoolbox
  • Bioinformatics. The ever popular Galaxy toolkit is available in various forms. The University of Hamburg genome toolkit is also available.
  • For Mathematics and Statistics there is, of course, R and Python with numpy etc in various combinations.
  • For Machine Learning there is the complete collection of ML algorithms written in Julia, many versions of Spark, the Vowpal Wabbit tools and the Scikit-learn python tools.
  • For Geospatial data there is a container with geoserver
  • For digital archival and data curation there is a container for dspace and irods
  • The NSF iPlant consortium has developed the Agave science-as-a-service platform and the various components of that are now containerized.

Each of these allow you to spin up a running instance of the software in seconds on a Docerized server like CoreOS or Ubuntu (after a Docker install) or Windows with Boot2Docker.   So all the problems of science in the cloud are solved. Correct? Of course not. What if you want to run a cluster of Docker Containers that are sharing you large workloads? Or running a big Spark deployment?

The challenge with Docker is managing multiple Azure servers each with multiple Docker containers.   There are four major open source solutions to this problem.

  • Mesos is a distributed systems kernel for managing large numbers of servers.   Originally developed as part of the Berkeley Amplab, it is now supported by a startup, Mesosphere.
  • Kubernetes is a technology released by Google and endorsed by Microsoft and others.   Kubernetes is designed to manage groups of containers. It allows users to ask a cluster to run a set of containers. The system automatically chooses hosts to run those containers on.
  • Weave is an overlay network for Docker that allows Docker clusters from different data centers or even different cloud vendors to be tied together.   Weave runs on Google and AWS and Weave has been ported to Azure by Patrick Chanezon.
  • Swarm is another tool developed by the Docker team to manage a cluster of containers. We know of no ports of Swarm to Azure.

These tools can be used together in various combinations.   In our next post we will report on our experience using these tools to build a cluster application.

One final note.  Docker does not solve the entire vendor Lock-In issue.  There is still the data.  Depend upon where you put big data collections the laws of computational physics will draw your computing to it.  At least Docker makes it ease for that computing to move.

UPDATE:

Henrik Nordborg just pointed me to TheUberCloud were a very large number of containerized science and engineering application have been deployed on a number of cloud platforms.  More details later when I have a better  understanding of what they have done.  What I see now looks very impressive.