Category Archives: scientific computing

Science Applications of Generative Neural Networks

Machine learning is a common tool used in all areas of science. Applications range from simple regression models used to explain the behavior of experimental data to novel applications of deep learning. One area that has emerged in the last few years is the use of generative neural networks to produce synthetic samples of data that fit the statistical profile of real data collections. Generative models are among the most interesting deep neural networks and they abound with applications in science. The important property of all generative networks is that if you train them with a sufficiently, large and coherent collection of data samples, the network can be used to generate similar samples. But when one looks at the AI literature on generative models, one can come away with the impression that they are, at best, amazing mimics that can conjure up pictures that look like the real world, but are, in fact, pure fantasy. So why do we think that they can be of value in science? There are a several reasons one would want to use them. One reason is that the alternative method to understand nature may be based on a simulation that is extremely expensive to run. Simulations are based on the mathematical expression of a theory about the world. And theories are often loaded with parameters, some of which may have known values and others we can only guess at. Given these guesses, the simulation is the experiment: does the result look like our real-world observations? On the other hand, generative models have no explicit knowledge of the theory, but they do an excellent job of capturing the statistical distribution of the observed data. Mustafa Mustafa from LBNL states,

“We think that when it comes to practical applications of generative models, such as in the case of emulating scientific data, the criterion to evaluate generative models is to study their ability to reproduce the characteristic statistics which we can measure from the original dataset.” (from Mustafa, et. al arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018)

Generated models can be used to create “candidates” that we can use to test and fine-tune instruments designed to capture rare events. As we shall see, they have also been used to create ‘feasible’ structures that can inform us about possibilities that were not predicted by simulations. Generative models can also be trained to generate data associated with a class label and they can be effective in eliminating noise. As we shall see this can be a powerful tool in predicting outcomes when the input data is somewhat sparse such as when medical records have missing values.

Flavors of Generative Models

There are two main types of GMs and, within each type, there are dozens of interesting variations. Generalized Adversarial Networks (GANs) consist of two networks, a discriminator and a generator (the bottom part of Figure 1 below). Given a training set of data the discriminator is trained to distinguish between the training set data and fake data produced by the generator. The generator is trained to fool the discriminator. This eventually winds up in a generator which can create data that perfectly matches the data distribution of the samples. The second family are autoencoders. Again, this involved two networks (top in figure below). One is designed to encode the sample data into a low dimensional space. The other is a decoder that takes the encoded representation and attempts to recreate it. A variational autoencoder (VAEs) is one that forces the encoded representations to fit into a distribution that looks like the unit Gaussian. In this way, samples from this compact distribution can be fed to the decoder to generate new samples.

var_and_gan.JPG

Figure 1.

Most examples of generative networks that are commonly cited involve the analysis of 2-D images based on the two opposing convolutional or similar networks.  But this need to be the case. (see “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space” by Anh Nguyen, et. al. arXiv:1612.00005v2  [cs.CV]  12 Apr 2017).

One fascinating science example we will discuss in greater detail later is by Shahar Harel and Kira Radinsky.  Shown below (Figure 2), it is a hybrid of a variational autoencoder with a convolutional encoder and recurrent neural network decoder for generating candidate chemical compounds.

VAE-with-lstm

Figure 2.  From Shahar Harel and Kira Radinsky have a different approach in “Prototype-Based Compound Discovery using Deep Generative Models” (http://kiraradinsky.com/files/acs-accelerating-prototype.pdf ).

Physics and Astronomy

Let’s start with some examples from physics and astronomy.

In statistical mechanics, Ising models provide a theoretical tool to study phase transitions in materials. The usual approach to study the behavior of this model at various temperatures is via Monte Carlo simulation. Zhaocheng Liu, Sean P. Rodrigues and Wenshan Cai from Georgia Tech in their paper “Simulating the Ising Model with a Deep Convolutional Generative Adversarial Network” (arXiv: 1710.04987v1 [cond-mat.dis-nn] 13 Oct 2017). The Ising states they generate from their network faithfully replicate the statistical properties of those generated by simulation but are also entirely new configurations not derived from previous data.

Astronomy is a topic that lends itself well to applications of generative models. Jeffrey Regier et. al. in “Celeste: Variational inference for a generative model of astronomical images” describe a detailed multi-level probabilistic model that considers both the different properties of stars and galaxies at the level of photons recorded at each pixel of the image. The purpose of the model is to infer the properties of the imaged celestial bodies. The approach is based on a variational computation similar to the VAEs described below, but far more complex in terms of the number of different modeled processes. In “Approximate Inference for Constructing Astronomical Catalogs from Images, arXiv:1803.00113v1 [stat.AP] 28 Feb 2018”, Regier and collaborators take on the problem of building catalogs of objects in thousands of images. For each imaged object there are 9 different classes of random variables that must be inferred. The goal is to compute the posterior distribution of these unobserved random variables conditional on a collection of astronomical images. They formulated a variational inference (VI) model and compared that to a Markov chain monte carlo (MCMC) method. MCMC proved to be slightly more accurate in several metrics but VI was very close. On the other hand, the variational method was 1000 times faster. It is also interesting to note that the computations were done on a Cori, the DOE supercomputer and the code was written in Julia.

Cosmological simulation is used to test our models of the universe. In “Creating Virtual Universes Using Generative Adversarial Networks” (arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018) Mustafa Mustafa, et. al. demonstrate how a slightly-modified standard GAN can be used generate synthetic images of weak lensing convergence maps derived from N-body cosmological simulations. The results, shown in Figure 3 below, illustrate how the generated images match the validation tests. But, what is more important, the resulting images also pass a variety of statistical tests ranging from tests of the distribution of intensities to power spectrum analysis. They have made the code and data available at http://github.com/MustafaMustafa/cosmoGAN . The discussion section at the end of the paper speculates about the possibility of producing generative models that also incorporate choices for the cosmological variable that are used in the simulations.

cosmo

Figure 3.  From  Mustafa Mustafa, et. al. “Creating Virtual Universes Using Generative Adversarial Networks” (arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018

Health Care

Medicine and health care are being transformed by the digital technology. Imaging is the most obvious place where we see advanced technology.  Our understanding of the function of proteins and RNA has exploded with high-throughput sequence analysis. Generative methods are being used here as well. Reisselman, Ingraham and Marks in “Deep generative models of genetic variation capture mutation effects” consider the problem of how mutations to a protein disrupt it function. They developed a version of a variational autoencoder they call DeepSequence that is capable if predicting the likely effect of mutations as they evolve.

Another area of health care that is undergoing rapid change is health records. While clearly less glamourous than RNA and protein analysis, it is a part of medicine that has an impact on every patient. Our medical records are being digitized at a rapid rate and once in digital form, they can be analyzed by many machine learning tools. Hwang, Choi and Yoon in “Adversarial Training for Disease Prediction from Electronic Health Records with Missing Data” (arXiv:1711.04126v4 [cs.LG] 22 May 2018) address two important problems. First, medical records are often incomplete. They have missing value because certain test results were not correctly recorded. The process of translating old paper forms to digital artifacts can introduce additional errors. Traditional methods of dealing with this are to introduce “zero” values or “averages” to fill the gaps prior to analysis, but this is not satisfactory. Autoencoders have been shown to be very good at removing noise from data (see https://towardsdatascience.com/how-to-reduce-image-noises-by-autoencoder-65d5e6de543). Hwang and his colleagues applied this to medical records. The second thing they have done is to use a GAN to predict the disease from the “corrected” record. The type of GAN they use is an “AC-GAN” (see https://arxiv.org/pdf/1610.09585.pdf) which incorporates a class label with each training item. This allows a class label along with the random latent variable as input to force the generator to create an output similar to training elements of that class. A byproduct is a discriminator that can tell if an input has the correct class label. In their case the they are interested in if a given medical record may predict the occurrence of a tumor or not. Of course, this is far from usable as a sole diagnostic in a clinical setting, but it is a very interesting technology.

Drug Design

One exciting application of these techniques is in the design of drugs. The traditional approach is high throughput screening in which large collections of chemicals are tested against potential targets to see if any have potential therapeutic effects. Machine learning techniques have been applied to the problem for many years, but recently various deep learning method have shown surprisingly promising results. One of the inspirations for the recent work has been the recognition that molecular structures have properties similar to natural language (see Cadeddu, A, et. al.. Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angewandte Chemie 2014, 126.) More specifically, there are phrases and grammar rules in chemical compounds that have statistical properties not unlike natural language. There is a standard string representation called SMILES that an be used to illustrate these properties. SMILES representations describe atoms and their bonds and valences based on a depth-first tree traversal of a chemical graph. In modern machine learning, language structure and language tasks such as machine natural language translation are aided using recurrent neural networks. As we illustrated in our book, an RNN trained with lots of business news text is capable of generating realistic sounding business news headlines from a single starting word. However close inspection reveals that the content is nonsense. However, there is no reason we cannot apply RNNs to SMILES string to see if they can generate new molecules. Fortunately, there are sanity tests that can be applied to generated SMILES string to filter out the meaningless and incorrectly structured compounds. This was done by a team at Novartis (Ertl et al. Generation of novel chemical matter using the LSTM neural network, arXiv:1712.07449) who demonstrated that these techniques could generate billions of new drug-like molecules. Anvita Gupta, Alex T. Muller, Berend J. H. Huisman, Jens A. Fuchs, Petra Schneid and Gisbert Schneider applied very similar ideas to “Generative Recurrent Networks for De Novo Drug Design”. They demonstrated that if they started with fragments of a drug of interest they could use the RNN and transfer learning to generate new variations that can may be very important. Another similar result is from Artur Kadurin, et. al. in “druGAN: An Advanced Generative Adversarial Autoencoder Model for de Novo Generation of New Molecules with Desired Molecular Properties in Silico.

Shahar Harel and Kira Radinsky have a different approach in “Prototype-Based Compound Discovery using Deep Generative Models”. There model is motivated by a standard drug discovery process which involves start with a molecule, called a prototype, with certain known useful properties and making modifications to it based on scientific experience and intuition. Harel and Radinsky designed a very interesting Variational Autoencoder shown in figure 2 above. As with several others the start with a SMILES representation of the prototype. The first step is an embedding space is generated for SMILES “language”. The characters in the prototype sequence are imbedded and fed to a layer of convolutions that allow local structures to emerge as shorter vectors that are concatenated, and a final all-to-all layer is used to generate sequence of mean and variance vectors for the prototype. This is fed to a “diversity layer” which add randomness.

The decoder is an LSTM-based recurrent network which generates the new molecule. The results they report are impressive. In a one series of experiments they took as prototypes compounds from drugs that were discovered years ago, and they were able to generate more modern variations that are known to be more powerful and effective. No known drugs were used in the training.

Summary

These are only a small sample of the research on the uses of Generative Neural networks in science.   We must now return to the question posed in the introduction:  When are these applications of neural networks advancing science?    We should first ask the question what is the role of ‘computational science’?  It was argued in the 1990s that computing and massive computational simulation had become the third paradigm of science because it was the only way to test theories for which it was impossible to design physical experiments.   Simulations of the evolution of the universe is a great example.    These simulations allowed us to test theories because they were based on theoretical models.  If the simulated universe did not look much like our own, perhaps the theory is wrong.   By 2007 Data Science was promoted as the fourth paradigm.   By mining the vast amounts of the data we generate and collect, we can certainly validating or disproving scientific claims.    But when can a network generating synthetic images qualify as science?  It is not driven by theoretical models.   Generative models can create statistical simulations that are remarkable duplicates of the statistical properties of natural systems.   In doing so they provide a space to explore that can stimulate discovery.   There are three classes of why this can be important.

  • The value of ‘life-like’ samples. In “3D convolutional GAN for fast Simulation” F. Carminati, G.  Khattak, S.  Vallecorsa make the argument that designing and testing the next generation of sensors requires test data that is too expensive to compute with simulation.  But a well-tuned GAN is able to generate the test cases that fit the right statistical model at the rate needed for deployment.
  • Medical records-based diagnosis. The work on medical records described above by Hwang shows that using a VAE to “remove noise” is statistically superior to leaving them blank or filling in averages.   Furthermore their ability to predict disease is extremely promising as science.
  • Inspiring drug discovery. The work of Harel and Radinsky show us that a VAE can expand the scope of potential drug for further study.   This is an advance in engineering if not science.

Can it replace simulation for validating models derived from theory?  Generative neural networks are not yet able to replace simulation.   But perhaps theory can evolve so that it can be tested in new ways.

Part 2. Generative Models Tutorial

Generative Models are among the most interesting deep neural networks and they abound with applications in science. There are two main types of GMs and, within each type, several interesting variations. The important property of all generative networks is that if you train them with a sufficiently, large and coherent collection of data samples, the network can be used to generate similar samples. The key here is the definition of ‘coherent’. One can say the collection is coherent if when you are presented with a new example, it should be a simple task to decide if it belongs to the collection or not. For example, if the data collection consists entirely of pictures of cats, then a picture of a dog should be, with reasonably high probability, easily recognized as an outlier and not a cat. Of course, there are always rather extreme cats that would fool most casual observers which is why we must describe our collect of objects in term of probability distributions. Let us assume our collection c is naturally represented embedded in  R^m for some m. For example, images with m pixels or other high dimensional instrument data. A simple way to think about a generative model is a mathematical device that transforms samples from a multivariant normal distribution  \mathcal{N}^k (0,1)  into so that they look like they come from the distribution  p_c(X) for our collection c. Think of it as a function

 Gen: v \sim \mathcal{N}^k (0,1) \to R^m

Another useful way to say this is to build another machine we can call a discriminator

 Disc: R^m \to [0,1]

such that for  X  \in R^m, Disc(X)   is probability that X is in the collection c. To make this more “discriminating” let us also insist that  Disc(Gen(v)=0 .  In other word, the discriminator is designed to discriminate between the real c objects and the generated ones. Of course, if the Generator is really doing a good job of imitating s3 then the discriminator with this condition would be very hard to build.  In this case we would expect  Disc(Gen(v)=1 .

Generative Adversarial networks

were introduced by Goodfellow et, al (arXiv:1406.2661) as a way to build neural networks that can generate very good examples that match the properties of a collection of objects.  It works by designed two networks:  one for the generator and one for the discriminator. Define s9 to be the distribution of latent variables that the generator will map to the collection space. The idea behind the paper is to simultaneously design the discriminator and the generator as a two-player min-max game.

The discriminator is being trained to recognize object from c (thereby reducing  s10 for  s11) and pushing s13 to zero for s14.   The resulting function

s15

Represents the min-max objective for the Discriminator.

On the other hand, the generator wants to pushs13  to 1 thereby maximizing
s16 .   To do that we minimize

s17.

There are literally dozens of implementations of GANs in Tensorflow or Karas on-line.   Below is an example from one that works with 40×40 color images.   This fragment shows the step of setting up the training optimization.

#These two placeholders are used for input into the generator and discriminator, respectively.
z_in = tf.placeholder(shape=[None,128],dtype=tf.float32) #Random vector
real_in = tf.placeholder(shape=[None,40,40,3],dtype=tf.float32) #Real images
Gz = generator(z_in) #Generates images from random z vectors
Dx = discriminator(real_in) #Produces probabilities for real images
Dg = discriminator(Gz,reuse=True) #Produces probabilities for generator images
#These functions together define the optimization objective of the GAN.
d_loss = -tf.reduce_mean(tf.log(Dx) + tf.log(1.-Dg)) #This optimizes the discriminator.
g_loss = -tf.reduce_mean(tf.log(Dg)) #This optimizes the generator.
tvars = tf.trainable_variables()
#The below code is responsible for applying gradient descent to update the GAN.
trainerD = tf.train.AdamOptimizer(learning_rate=0.0002,beta1=0.5)
trainerG = tf.train.AdamOptimizer(learning_rate=0.0002,beta1=0.5)

#Only update the weights for the discriminator network.
d_grads = trainerD.compute_gradients(d_loss,tvars[9:]) 
#Only update the weights for the generator network.
g_grads = trainerG.compute_gradients(g_loss,tvars[0:9]) 
update_D = trainerD.apply_gradients(d_grads)
update_G = trainerG.apply_gradients(g_grads)

We tested this with a very small collection of images of galaxies found on the web.  There are three types: elliptical, spiral and barred spiral.  Figure 4 below shows some high-resolution samples from the collection.

(Note:  the examples in this section use pictures of galaxies, but , in terms of the discussion in the previous part of this article, these are illustrations only.  There are no scientific results; just algorithm demonstrations. )

galaxy_sample

Figure 4.  Sample high-resolution galaxy images

We reduced the images to 40 by 40 and trained the GAN on this very small collection.  Drawing samples at random from the latent z-space we can now generate synthetic images.  The images we used here are only 40 by 40 pixels, so the results are not very spectacular.  As shown below, the generator is clearly able to generate elliptical and spiral forms.  In the next section we work with images that are 1024 by 1024 and get much more impressive results.

gan_40_galaxies.png

Figure 5.   Synthetic Galaxies produced by the GAN from 40×40 images.

Variational Autoencoders

The second general category generative models are based on variational autoencoders. An autoencoder transforms our collection of object representations into a space of much smaller dimension in such a way so that that representation can be used to recreate the original object with reasonably high fidelity. The system has an encoder network that creates the embedding in the smaller space and a decoder which uses that representation to regenerate an image as shown below in Figure 6.

ae

Figure 6. Generic Autoencoder

In other words, we want s18 to approximate s19 for each i in an enumeration of our collection of objects.  To train our networks we simply want to minimize the distance between s19  and s20 for each i.   If we further set up the network inputs and outputs so that they are in the range [0, 1] we can model this as a Bernouli distribution so cross entropy is a better function to minimize.  In this case the cross entropy can be calculated as

s21

(see http://www.godeep.ml/cross-entropy-likelihood-relation/  for a derivation)

A variational autoencoder differs from a general one in that we want the generator to create an embedding that is very close to a normal distribution in the embedding space.  The way we do this is to make the encoder force the encoding into a representation consisting of a mean and standard deviation.  To force it into a reasonably compact space we will force our encoder to be as close to s32  as possible. To do that we need a way to measuree how far a distribution p is from a Gaussian q. That is given by the Kullback-Leibler divergence which measures now many extra bits (or ‘nats’) are needed to convert an optimal code for distribution q into an optimal code for distribution p.

s22

If both p and q are gaussian this is easy to calculate (thought not as easy to derive).

In terms of probability distributions we can think of our encoder as s23 where x is a training image. We are going to assume  s23 is normally distributed and let s24 be  parameterized by   s25  .  Computing s26  is now easy. We call this the Latent Loss and it is

s27

(see https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians for a derivation).

We now construct our encoder to produce s28 and s29 .  To sample from this latent space, we simply draw froms1 and transform it into the right space.   Our encoder and decoder networks can now be linked as follows.

s30.JPG

the loss function is now the sum of two terms:

s31

Note: there is a Baysian approach to deriving this.  see https://jaan.io/what-is-variational-autoencoder-vae-tutorial   for an excellent discussion.

One of the interesting properties of VAEs is that they do not require massive data sets to converge.   Using our simple galaxy photo collection we trained a simple VAE.  The results showing the test inputs and the reconstructed images are illustrated below.

var-recon

Figure 7.   test input and reconstruction from the galaxy image collection.   These images are 1024×1024.

Using encodings of five of the images we created a path through the latent space to make the gif movie that is shown below.  While not every intermediate “galaxy” looks as good as some of the originals, it does present many reasonable “synthetic” galaxies that are on the path between two real ones.

movie9

Figure 8.  image from the “movie”

The notebook for this autoencoder is available as html (see https://s3.us-east-2.amazonaws.com/a-book/autoencode-galaxy.html) and as a jupyter notebook (see https://s3.us-east-2.amazonaws.com/a-book/autoencode-galaxy.ipynb )  The compressed tarball of the galaxy images is here: https://s3.us-east-2.amazonaws.com/a-book/galaxies.tar.gz.

acGANs

The generative networks described above are just the basic variety.    One very useful addition is the Auxiliary Classifier GAN.    An acGAN allows you to incorporate knowledge about the class of the objects in your collection into the process.   For example, suppose you have labeled images such as all pictures of dogs are labeled “dog” and all pictures of cats have the label “cat”.    The original paper on this subject “Conditional Image Synthesis with Auxiliary Classifier GANs” by Oden, Olah and Shlens  shows how a GAN can be modified so that the generator can be modified so that it takes a class label in addition to the random latent variable so that it generates a new element similar to the training examples of that class. The training is augmented with an additional loss term that models the class of the training examples.

There are many more fascinating examples.   We will describe them in more detail in a later post.

Building a “ChatBot” for Scientific Research

We can use Alexa, Cortana and Siri to check the weather, stream our favorite music and lookup facts like “who directed Casablanca?” But if I want to find all the research on quantum entanglement in the design of topological quantum computers, these services will fall short.   If, in addition, I want these articles cataloged in my personal archive and the citations automatically pulled I need a much more robust digital assistant.  Raj Reddy talks about a broader and more pervasive future for digital assistant he calls Cognition Amplifiers and Guardian Angels.  In this post we look at chatbots and their limitations and show how to build a simple, voice-driven scientific metasearch tool we call the research assistant.  Finally, we discuss the next phase of research assistant.

Smart Speakers and Chatbots.

The revolution in “smart speaker” digital assistants like Siri, Amazon Echo, Google Home is showing us the power of voice to provide input to smart cloud services.   These assistants can take notes, tell us the weather conditions, place on-line orders for us and much more.   I even allow Microsoft’s Cortana to read my email.  If I send the message “I’ll get back to you tomorrow” to a friend, Cortana will remind me the next day that a response is needed.  Amazon allows people to add “skill” (additional capabilities) to there Alexa system.  These smart speakers are designed around open-ended question-response scenario.  These assistants leverage very powerful speech-to-text technology and semantic analysis systems to extract a query or command.  The query answers are derived from web-data analysis and the commands are to a limited number of native services (or external skills).

A chatbot is a system that engages the user in a dialog.   They go beyond the question answering smart speakers and are usually designed to help people interact with the services of a specific company or agency.   The Google, IBM, Amazon and Microsoft have all introduced cloud services to help anybody build a chatbot.   These services guide you through the process of building and training a chatbot.   A good example is Google’s Dialogflow.   Using this tool to create a bot, you specify three things:

  • Intents – which are mapping between what the user says and how you wish the system to respond.
  • Entities – that are the categories of subjects that your bot understands. For example, if you bot is a front end to your clothing store, one category of entity my be product type: shoes, dress, pants, hats,  and another entity might be size: large, xl, small, medium,  and another might be color.
  • Context – This is knowledge obtained by the bot during the course of the conversation. For example, the name of the user, or the users favorite shoe color.

The goal of the bot is to extract enough information from the user to take a meaning action such as fulfilling an order.   The hard part is designing the intents so that a dialog can lead to the desired conclusion.  The resulting sysgtem is called an agent and the flow of action is illustrated in Figure 1.  It might start with ‘Good morning, how can I help you?’   and end with a summary of the discussion.  As the programmer you need to supply as many possible variations on possible user questions and responses as possible.  And you must annotate these with makers where your entities can be extracted.    Your examples are used in a training phase that maps your intents and entities together and builds a model that will learn variations on the input and not repeat the same sequence of responses each time, so it seems less robotic to the user.

Amazon’s Lex provides a similar service that also integrates with their lambda services.  Microsoft has the Azure Bot Service and IBM has Watson assistant chatbot tools.

dialogflow

Figure 1. Google Dialogflow Agent architecture.

These tools are all designed to help you build a very focused conversation with the user about a very narrow universe such as the on-line part of a business.   But this raises the question, can one build a chat bot that can carry out an open-ended conversation.  Perhaps one that could pass the Turing test?   The research literature on the subject is growing and deep learning tools like recurrent and convolutional neural networks have been applied to the problem (see https://arxiv.org/pdf/1606.07056.pdf , https://arxiv.org/pdf/1605.05110.pdf and more).   Unfortunately chatbots designed to engage in open-ended conversation have only had a limited success.   Xiaoice is one that interacts with users on the Chinese micro blogging service Weibo.  The problem is that while it sounds like conversation, it is mindless.  Microsoft’s Tay was an English language version that operated on Twitter until it was taken down after only 16 hours because of the unfortunate language it had learned.  A successor Zo seems to be working better, but it does not produce conversations with intellectual content.

There is an excellent pair of articles by Denny Britz about the role of deep learning for conversational assistants.   He make the point that for open-ended conversations (he calls them open-domain) the challenges are large compared to the fixed domain chatbots because so much more world knowledge is required.

Cognition Amplifiers and the Research Assistant.

In the spring of 2018 Raj Reddy gave the keynote presentation at the IEEE services congress.  His topic was one he has addressed before and it clearly resonated with the audience.   He described Cognition Amplifiers and Guardian Angels.  He defined a Cognition Amplifier (COG) as a Personal Enduring Autonomic Intelligent Agent that anticipates what you want to do and help you to do it with less effort. A Guardian Angel (GAT) is a Personal Enduring Autonomic Intelligent Agent that discovers and warns you about unanticipated events that could impact your safety, security and happiness.

Consider now the application of the Cognition Amplifier to scientific research.   If you are working on writing a research paper, you may wish your autonomic research assistant to provide a fast and accurate search of the scientific literature for a specific list of scientific concepts. In fact, as you write the paper, the assistant should be able to pick up the key concepts or issues and provide a real-time bibliography of related papers and these should be stored  and indexed in a private space on the cloud.  Extracting key phrases from technical documents is already a heavily research field so applying this technology to this problem is not a great leap.   However, key phrase extraction is not the whole challenge.   Take sentence “It seems that these days investors put a higher value on growth than they do on profitability”.  The categorical topic is value of growth vs profitability – investor opinions which is not simply a key phrase, but a concept and we need the research assistant to look for concepts.    Your research assistant should always understand and track the context of your projects.

Finally, a good research assistant for science should be able to help with the analytical part of science.  For example, it should help locate public data in the cloud related to experiments involving your topics of interest.   The assistant should be able to help formulate, launch and monitor data analysis workflows. Then coordinate and catalog the results.

And, of course, if your autonomous research assistant is also a Guardian Angel, it will also keep you informed of grant reporting deadlines and perhaps pull together a draft quarterly report for your funding agency.

I fully expect that it is possible to build such an agent in the years ahead.   However, the remainder of this article is a simple demo that is a far cry from the full research assistant agent described above.

The Research Assistant Metasearch Tool.

In the following paragraphs we describe a very simple voice-driven agent that can be used to look for research articles about scientific topics.  We also show how such a system can be assembled from various simple devices and cloud services.   The system we describe is not very sophisticated.  In fact it is not much better than Cortana at finding things given English input sentence.  However we feel it does illustrate the architecture of a voice-driven agent that can be built by gluing together easy to use cloud services.

Our scenario is a researcher sitting near the device and asking about very specific research topics such as “physical models of immune response” or “programming a topological quantum computer”.    We assume the user wants a spoken response if that response is simple, but we also realize that this is impractical if the system is creating a list research journal papers.  To address this issue, the system also has a display in a web browser.  (We note that the Cortana and Google assistant do the same if the response is a list.)

Figure 2 illustrates the basic architecture of the system.

architectureofRA

Figure 2.  The architecture of the research assistant.

The components of the system are:

  1. The voice-to-text translator. Here we use a simple voice kit from google.   This consists of some special hardware and a raspberry pi 2 computer all packaged in an elegant cardboard box.  You wake the system up by pressing the botton on top and speak.   The audio is captured and sent to the google voice service for transcription and it is returned as a text string.
  2. The next step is to parse the text string into the components that will allow us to extract the topics of the query.  This is another cloud service call.  This time it is to Algorithmia.com and the service is called key phrases.  (we wrote about this in a previous article.)  The service takes English sentences and invoked Googles ParsyMcParseface (another Algorithmia.com AI service) and returns a list composed of three types of phrases: subject (s), actions (a) and objects (o).  It also flags prepositional phrases with a “/” character.   So for example, “I am interested in physical models of immune response” returns

          [‘s: I ‘, ‘a: am ‘,  ‘o: interested /in physical models /of immune response.’]

  1. The analysis engine is a 500-line Python Dash-based web server that extracts the topics and a few other items of interest and decides how to search and display the results on the browser. There are three web services used for the search: Wikipedia,  Bing and Cornell’s ArXiv service[1].  To see how this works, consider the example the sentence “research papers by Michael Eichmair about the gannon-lee singularity are of interest“. The analysis engine detects the topic as the gannon-lee singularity and Michael Eichmar as the author.  The fact that research papers are of interest indicates that the we should look in the Cornell ArXiv repository of papers.   (The results for this query are at the end of this section).    (Truth in advertising: our parser and analysis are far from perfect.   For example, “tell me about Deep Learning”   vs “tell me about deep learning” yield two different parses.  The first yields

            [‘a: tell ‘, ‘o: me /about Deep Learning ‘]

           which is fine. But the second gives us

          [‘a: tell ‘, ‘o: me about deep ‘, ‘a: learning ‘]

           which causes the analysis to fail. )

  1. Finally, we use the Amazon Lex services to generate the audio reading of the Wikipedia results. If you have an aws account, the Python API is easy to use.

Examples

Figure 3 illustrates the interface.  We have started with the statement “I am interested in physical models of immune response.”

fig3-chatbot

Figure 3.   The interface provides a switch to allow the Wikipedia response to read aloud.   In this case we have typed the English statement of intent into the query box and hit the “Translate and Search” button.

We respond with the phrase “look for it in Wikipedia” and get the result in figure 4.    Following that response, we say “how about research papers” and we get the response in figure 5.

webpage3

Figure 4.   The response to “look for it in wikipedia”.  A short summary from Wikipedia is extracted along with related images found on the subject.  The spoken response is controlled by the audio control  at the top of the response.

webpage4

Figure5.   The mention of research papers suggest that we should consult the Cornell library arXiv.    Shown above is only the first result of 10 listed on the page.

Returning to the example mentioned above “research papers by Michael Eichmair about the gannon-lee singularity are of interest” we get the following results.   You will notice that the Wikipedia result is a default hit for “big bang singularity” and not directly related to the precise query.  The Bing results and the ArXiv hits are accurate.

webpage5

Figure 6.  Results for the query “research papers by Michael Eichmair about the gannon-lee singularity are of interest”.  (This page was slightly edited to shorten the list of Bing results.)

The system has a limited capability to pull together concepts that are distributed over multiple sentences.  For example the input string “what are anyons?   How do they relate to topological quantum computation?” will build the topic “anyons topological quantum computation”.

If you are interested in trying to use the system point your browser here.  I will try keep it up and running for a few months. There is no voice input because that requires a dedicated Google voice kit on your desk.   You need to decide if you want to have a playback of the audio for Wikipedia summaries.   If you don’t want it, simply press the “Read Aloud” button.  Then enter a query and press the “Translate and Search” button.   Here are some samples to try:

  1. what did they say about laughter in the 19th century?
  2. are there research papers about laughter by Sucheta Ghosh?
  3. what can you tell me about Quantum Entanglement research at Stanford? (this one fails!)
  4. what can you tell me about research on Quantum Entanglement at Stanford?
  5. what are anyons? How do they relate to topological quantum computation?
  6. Who was Winslow Homer? (this one give lots of images)
  7. I am interested in gravitational collapse. (respond with web, Wikipedia or arxiv)

As you experiment, you will find MANY errors.  This toy is easily confused.   Please email me examples that break it.  Of course, feedback and suggestions are always welcome.  I can make some of the source code available if there is interest. However, this is still too unreliable for public github.

[1] There are other arguably superior sources we would like to have used.  For example, Google Scholar would be perfect, but they have legal restrictions on invoking that service from an application like ours.  Dblp is also of interest but it is restricted to computer science.

The State of the Cloud for Science -2018

Introduction

This post is based on a talk I prepared for the Scientific Cloud Computing Workshop at HPDC 2018.

Two years ago, Ian Foster and I started writing  “Cloud Computing for Science and Engineering”.   That book covers fundamental cloud tools and computational models, but there are some topics we alluded to but did not explore fully because they were still on the horizon.  In other cases, we were simply clueless about changes that were coming. Data center design and cloud services have evolved in some amazing ways in just two years and many of these changes represent opportunities for using cloud technology in scientific research.

Whose cloud?

Any  discussion of cloud computing in science leads to the question of definition.  What defines a cloud for science?   For example, the European Open Science Cloud (EOSC) is a European-wide virtual environment for data sharing and collaboration.  That project will involve multiple data archives, research labs and HPC centers, commercial service providers and EU agencies and funding.  It is truly an important project.  However, my focus here is on the fundamental technologies that are driving hardware and software innovation, and these tend to come from a combination of academic, open source and commercial providers.   The most ubiquitous commercial clouds are:

  • Amazon Web Services (AWS) – 40% of all commercial cloud resources on the planet,
  • Microsoft Azure – about 50% of AWS but growing,
  • Google Cloud – a solid and growing third place,
  • IBM Bluemix – growing very fast and in some measures bigger now that Google.

There are many more, smaller or more specialized providers: Salesforce, DigitalOcean, Rackspace, 1&1, UpCloud, CityCloud, CloudSigma, CloudWatt, Aruba, CloudFerro, Orange, OVH, T-Systems.

There are a number of smaller cloud systems that have been deployed for scientific research.  They  include Aristotle, Bionimbus, Chameleon, RedCloud, indigo-datacloud, EU-Brazil Cloud,  and the NSF JetStream.  The advantage of these research clouds is that they can be optimized for use by a specific user community in ways not possible in a commercial cloud.  For example, Chameleon is funded by the US NSF to support basic computer systems research at the foundational level which is not possible when the foundation is proprietary.

Are these clouds of any value to Science?

When the first commercial clouds were introduced in 2008 the scientific community took interest and asked if there was value there.  In 2011 the official answer to this question seemed to be  “no”.  Two papers (see end node 1) described research experiments designed to address this question.   The conclusion of both papers was that these systems were no match for traditional supercomputers for running MPI-based simulation and modeling.   And, in 2010, they were correct.   Early cloud data centers were racks of off-the-shelf PCs and the networks had terrible bisection bandwidth and long latencies.   They were no match for a proper HPC cluster or supercomputer.

Over the last few years, others have recognized a different set of roles for the cloud in science that go beyond traditional supercomputer simulation.   The biology community was quick to adopt cloud computing especially when it is necessary to do large scale analysis on thousands of independent data samples.  These applications ranged from metagenomics to protein folding.   These computations could each fit on a single server, so network bandwidth is not an issue and, using the scale of the cloud, it is easy to launch thousands of these simultaneously.   Hosting and sharing large public scientific data collections is another important application.   Google, AWS, Microsoft and other have large collections and they also are also providing new ways to host services to explore this data.

However, there are at least three additional areas where the cloud is a great platform for science.

Analysis of streaming data

Microsoft’s AI for earth project (Figure 1) looks at the application of streaming data from sources such as satellites to do land cover analysis,  sensors on and above farm land to improve agriculture outcomes and crowd sourced data to help understand biodiversity.

urban0

Figure 1.  Applications of streaming include land cover analysis, using sensor data for improving agriculture and biodiversity.   From  https://www.microsoft.com/en-us/aiforearth

The internet of things is exploding with sensor data as are on-line experiments of all types.  This data will be aggregated in edge computing networks that do initial analysis with results fed  to the cloud systems for further analysis.   Urban Informatics is a topic that has emerged as critical to the survival of our cities and the people who live in them.  Sensors of all types are being deployed in cities to help understand traffic flow, microclimates, local pollution and energy waste.  Taken together this sensor data can paint a portrait of the city that planners can use to guide its future.  Streaming data and edge computing is a topic that will involve the growing capabilities and architecture of the cloud.  We will return to this later in this document.

Interactive big data exploration

Being able explore and interact with data is a critical component of science.   If it fits on our laptop we can use tools like Matlab, Excel or Mathematica to conduct small computational experiments and visualize the results.  If the data is too big it must be stored on something bigger than that laptop.   Traditional supercomputers are up to the task of doing the analysis, but because they are designed around batch computing there are not well suited to interactive use.   The cloud exists to host services that can be used by thousands of simultaneous users.   In addition, there is a new generation of interactive data analysis tools that are cloud-ready for use on very large data collections.  This collection of tools includes Spark and Python Dask.   In both cases these tools can be driven by the open-source Jupyter studio which provides a powerful, interactive compute and visualization tool.  The commercial providers have adapted Jupyter and its interactive computational model into their offerings.   Google has Cloud Datalab (Figure 2), Amazon uses Jupyter with its SageMaker Machine Learning platform and Microsoft provide a special data science virtual machine that runs Jupyter Hub so that teams of users can collaborate.

google-datalab

Figure 2.  Google’s Cloud Data lab integrates SQL-like queries to be combined with Python code and visualization to a Jupyter based web interface. https://cloud.google.com/datalab/

Being able to interact with data at scale is part of the power of the cloud.   As this capability is combined with advanced cloud hosted machine learning tools and other services, some very promising possibilities arise.

The quest for AI and an intelligent assistant for research

The commercial clouds were originally built to  host web search engines.   Improving those search engines led to a greater engagement of the tech companies with machine learning.   That work led to deep learning which enabled machine language translation,  remarkably strong spoken language recognition and generation and image analysis with object recognition.  Many of these  capabilities rival humans in accuracy and speed.  AI is now the holy grail for the tech industry.

One outcome of this has been the proliferation of voice-driven digital assistants such as Amazon’s Echo, Microsoft’s Cortana, Apple’s Siri and Google Assistant.   When first introduce these were novelties, but as they have improved their ability to give us local information, do web searching, keep our calendars has improved considerably.   I believe there is an opportunity for science here.

Ask the question “what would it take to make Alexa or Cortana help with my research?”   The following use cases come to mind.

  1. Provide a fast and accurate search of the scientific literature for a given specific scientific concept or topic and not just a keyword or specific author. Then ask who is working on or has worked on this topic?  Is there public data in the cloud related to experiments involving this topic?  Translate and transcribe related audio and video.
  2. Understand and track the context of your projects.
  3. Help formulate, launch and monitor data analysis workflows. Then coordinate and catalog results.  If state-space search is involved, automatically narrow the search based on promising findings.
  4. Coordinate meetings and results from collaborators.

If I ask Alexa to do any of this now, she would politely say “Sorry.  I can’t help you with that.”  But with the current rate of change in cloud AI tools, ten years seems like a reasonable timeframe.

siri-grandson

Figure 3.  Siri’s science geek grandson.

Technical Revolutions in the Cloud

Two of the three scenarios above are here now or very close.   The third is some ways off.   There have been three major changes in cloud technology in the past five years and some aspects of these changes are true game-changers for the industry.    The first, and most obvious is the change in scale of cloud deployments. The two leaders, AWS and Azure are planetary in scale.  This is illustrated in Figure 4 below.

cloud-scale

Figure 4.  A 2016 map of cloud coverage from Atomia.com.  https://www.atomia.com/2016/11/24/ comparing-the-geographical-coverage-of-aws-azure-and-google-cloud/ There is some inaccuracy here because AWS and Azure define regional data centers differently, so counting the dots is not a good comparison.   In addition, data centers are now under construction in South Africa and the Middle East.

This map does not include all the data centers run by the smaller cloud providers.

Cloud Services

A major departure from the early days of the cloud, where scientists focused on storage and servers, has been an explosion in pay-by-the-hour cloud hosted services.  In addition to basic IaaS the types of services available now are:

  • App services: Basic web hosting, mobile app backend
  • Streaming data: IoT data streams, web log streams, instruments
  • Security services: user authentication, delegation of authorization, privacy, etc.
  • Analytics: database, BI, app optimization, stream analytics
  • Integrative: networking, management services, automation

In addition, the hottest new services are AI machine learning services for mapping, image classification, voice-to-text and text-to-voice services and text semantic analysis.   Tools to build and train voice activated bots are also now widely available.   We will take a look at two examples.

A planet scale database

The Azure Cosmos DB is a database platform that is globally distributed.   Of course, distributing data across international boundaries is a sensitive topic, so the Cosmos platform allows the database creator to pick the exact locations you want copies to reside.   When you create an instance of the database you use a map of Azure data centers and select the locations as shown in Figure 5.

cosmos1

Figure 5.  Cosmos DB.  A database created in central US and replicated in Europe, South India and Brazil.

The database can support 4 modes: Documents, key-value Graph and NoSQL.  In addition, there are five different consistency models the user can select: eventual, consistent prefix, session, bounded stateless and strong consistency all with 99.9% guarantee of less than 15ms latency.  My own experiments validated many these claims.

Cloud AI Services

The commercial clouds are in a race to see who can provide the most interesting and useful AI services on their cloud platform.   This work began in the research laboratories in universities and companies over the past 25 years, but the big breakthroughs came when deep learning models trained on massive data collections began to reach levels of human accuracy.  For some time now, the public cloud companies have provided custom virtual machines that make it easy for technically sophisticated customers to use state of the art ML and neural network tools like TensorFlow, CNTK and others.  But the real competition is now to provide services for building smart applications that can be used by developers lacking advanced training in machine learning and AI. We now have speech recognition, language translation, image recognition capabilities that can be easily integrated into web and mobile applications.

We gave this a try with services that use a technique called Transfer Learning to make it possible to re-train a deep neural network to recognize objects from a narrow category using a very small training set.   We chose images of galaxies and used the services of IBM Watson, Azure and Amazon.   Figure 6 illustrates the results from IBM’s tool.  The results were surprisingly good.

galaxies

Figure 6.  IBM’s Watson recognizing previously unseen images of three different galaxies.  The details of this study are here: https://esciencegroup.com/2018/02/16/cloud-services-for-transfer-learning-on-deep-neural-networks/

The Revolution in Cloud Service Design

Making all of these services work, perform reliably and scale to thousands of concurrent users forced a revolution in cloud software design.    In order to support these applications, the tech companies needed a way to design them so that they could be scaled rapidly and updated easily.   They settled on a design pattern that based on the idea of breaking the applications into small stateless components with well defined interfaces.   Statelessness meant that a component could be replaced easily if it crashed or needed to be upgraded.   Of course, not everything can be stateless, so state is saved in cloud hosted databases.   Each component was a “microservice” and it could be built from containers or functions.  This design pattern is now referred to as “cloud native” design.   Applications built and managed this way include Netflix, Amazon, Facebook, Twitter, Google Docs, Azure CosmosDB, Azure Event hub, Cortana, Uber.

microserviceFigure 7.  Conceptual view of microservices as stateless services communicating with each other and saving needed state in distribute databases or tables.

To manage applications that required dozens to hundreds of concurrently running microservice you need a software foundation or container orchestration system to monitor the services and schedule them on available resources.  Several candidates emerged and are used. Siri, for example, is composed of thousands of microservices running on the Apache Mesos system.   Recently cloud providers have settled on a de-facto standard container orchestrator built by Google and released as open source called Kubernetes.   It is now extremely easy for any customer to use Kubernetes on many cloud deployments to launch and manage cloud native applications.

Serverless Functions

The next step in the cloud software evolution was the introduction of “serverless functions”.   The original idea of cloud computing involved launching and managing a virtual machine.  However, suppose you want to have a cloud-based application whose sole job is to wait for an event to trigger some action.  For example, monitor a file directory and wait for a change such as the addition of a new file.  When that happens, you want to send email to a set of users alerting them of the change.   If this is a rare event, you don’t want to have to pay for an idle virtual machine that is polling some service looking for a change.  Amazon was the first to introduce the concept of a function-as-a-service.   With AWS Lambda, you only need to describe the function in terms of the trigger event and the output actions it takes when the even appears.  As illustrated in Figure 8, there are many possible triggers and corresponding output channels.

lambda

Figure 8.   From Amazon AWS.   The triggers and outputs of a lambda function.

In addition to AWS Lambda, Azure Functions, Google Functions, IBM OpenWhisk are similar systems.  OpenWhisk is now open source.  Another open source solution is Kubeless that allow you to deploy a lambda-like system on top of your Kubernetes cluster.   These serverless systems let you scale up and down extremely rapidly and automatically.   You can have hundreds of instances responding to events at once.  And the cost is based on charge-by-use models.  AWS has  introduced AWS Fargate which allows any containerized application to run in serverless mode.

The Edge and the Fog

The frontier of cloud computing is now at the edge of the network.  This has long been the home of content distribution systems where content can be cached and access quickly, but now that the Internet-of-Things (IoT) is upon us, it is increasingly important to do some computing at the edge.  For example, if you have a thousand tiny sensors in a sensitive environment or farm and you need to control water from sprinklers or detect severe weather conditions, it is necessary to gather the data, do some analysis and signal an action.   If the sensors are all sending WIFI messages they may be routable to the cloud, but a more common solution is to provide local computing that can do some event preprocessing and response while forwarding summary data to the cloud.  That local computing is called the Edge, or if a distributed systems of edge servers, the Fog.

If serverless functions are designed to respond to signals, that suggests that it should be possible to extend them to run in the edge servers rather than the cloud.  AWS was the first to do this with a tool called GreenGrass that provides a special runtime system that allows us to push/migrate lambda functions or microservices from the data center to the edge.   More recently Microsoft has introduced Azure IoT Edge which is built on open container technologies.  Using an instance open source Virtual Kubelet deployed on the edge devices we can run our Kubernetes containers to run on the edge.  You can think of a Kubelet as the part of Kubernetes that runs on a single node. This enables Kubernetes clusters to span across the cloud and edge as illustrated in Figure 9.

edge-cloud

Figure 9.  shows a sketch of migrating containerized functions to edge function.   That way our IOT devices can communicate with locally deployed microservices.  These microservices can communicate with cloud-based services.  The edge function containers can also be updated and replaced remotely like any other microservice.

The Evolution of the Data Center

As mentioned at the beginning of this paper, the early days (2005) of cloud data center design systems were based on very simple server and networks were designed for outgoing internet traffic and not bisectional bandwidth for parallel computing.   However, by 2008 interest in performance began to grow.  Special InfiniBand sub-networks were being installed at some  data centers.  The conventional dual-core servers were being replaced by systems with up to 48 cores and multiple GPU accelerators.  By 2011 most of the commercial clouds and a few research clouds had replaced traditional network with software defined networking.  To  address the demand of some of its customers, in 2017 Microsoft added  Cray® XC™  and Cray CS™ supercomputers to a few data centers and then acquired the company cycle computing.

From 2016 we have seen progress focused on performance and parallelism.   The driver of this activity has been AI and, more specifically, the deep neural networks (DNNs) driving all the new services.  There are many types of DNNs but two of the most common are convolutional, which look like a linear sequence of special filters, and recurrent networks which, as the name implies, are networks with a feedback component.   And there are two phases to neural network design and use.  The first is the training phase which requires often massive parallelism and time.  But it is usually an off-line activity.  The second phase is called inference and it refers to the activity of evaluating the trained network on classification candidates.  In both the convolutional and recurrent network inference boils down to doing a very large number of matrix-vector and matrix-matrix multiplies where the coefficients of the matrix are the trained model and the vector represent the classification candidates.      To deliver the performance at scale that was needed by the AI services it was essential to do these operations fast.  While GPUs were good, more speed was needed.

Google’s Tensor Processing Unit

In 2015 Google introduced the Tensor Processing Unit (TPU) and the TPU2 in 2017.

google-tf1

google-tf2

Figure 10.  Above, Google data center.  https://www.nextplatform.com/2017/05/22/hood-googles-tpu2-machine-learning-clusters/  Below architecture of Google Tensor Processing Unit TPU. From “In-Datacenter Performance Analysis of a Tensor Processing Unit”, Norman P. Jouppi et al.​  ISCA 2017  https://ai.google/research/pubs/pub46078

Figure 10 illustrates several racks of TPU equipped servers and the functional diagram of the TPU.  One of the key components is the 8-bit matrix multiply capable of delivering 92 TeraOps/second (TOPS).   (It should be noted that DNNs can be well trained on less than IEEE standard floating-point standards and floating point systems with small mantissa are popular.)  The multiply unit uses a systolic algorithm like those proposed for VLSI chips in the 1980s.

Microsoft’s Brainwave

In 2011 a small team in Microsoft Research led by Doug Burger began looking at the use of FPGAs to accelerate the search result ranking produced by Bing.   Over several iterations they arrived at a remarkable design that allowed them to put the FPGA between the network and the NIC so that the FPGA could be configured into separate plane of computation that can be managed and used independently from the CPU (see Figure 11).   Used in this way groups of FPGAs could be configured into a subnetwork to handle tasks such as, database queries and  inference stage of deep learning in addition to Bing query  optimization.

brainwave0

Figure 11.  The Brainwave architecture.

brainwave

Figure 12. The brainwave software stack for mapping a DNN to one or more FPGAs. From BrainWave_HOTCHIPS2017.pptx, Eric Chung, et. al., https://vneetop.wordpress.com/ 2017/10/28/accelerating-persistent-neural-networks-at-datacenter-scale/

The team also built a software stack that could really make this possibility a reality.   What makes much of this possible is that the models for DNNs are all based on flow graphs which describe sequences of tensor operations.  As shown in Figure 12 above, the flow graphs can be compiled to a graph internal representation that can be split and partitioned across one or more FPGAs.  They refer to the result as a hardware microservice.   Just recently Mary Wall [see endnote 2] wrote a nice blog about the teams work on using this idea to do deep learning inference on land cover maps.  Each compiled inference hardware microservice is mapped to a single FPGA, but they used 800 of the inference instances in parallel with 80 VMs to process 20 terabytes of aerial imagery into land cover data for the entire United States.   It took only about 10 minutes for a total cost of $42. [see endnote3] Mary Wall’s code is in the blog and available in Github.

Conclusion

Cloud data centers are getting high performance networks (with latencies of only a few microseconds in the case of Azure Brainwave) and immense computing capacity such as the tensor processing capability of Google’s TPU.  At the same time designers of supercomputers are having to deal with more failure resilience and complexity in the design of the first exascale supercomputers.  For the next generation exascale systems the nodes will be variations on a theme of multicore and GPU-style accelerators.

Observed from a distance, one might conclude the architectures of cloud data centers and the next generation of supercomputers are converging.  However, it is important to keep in mind that the two are designed for different purposes. The cloud is optimized for fast response for services supporting many concurrent globally distributed clients. Supers are optimized for exceptionally fast execution of programs on behalf of a small number of concurrent users.   However, it may be the case that an exascale system may be so large that parts of it can run many smaller parallel jobs at once.  Projects like Singularity provide a solution for running containerized application on supercomputers in a manner similar to the way microservices are run on a cloud.

Possible Futures

The continuum: edge+cloud+supercomputer

There are interesting studies showing how supercomputers are very good at training very large, deep neural networks.  Specifically, NERSC scientists have show the importance of this capability in many science applications[4]. However, if you need to perform inference on models that are streamed from the edge you need the type of edge+cloud strategy described here.   It not hard to imagine scenarios where vast numbers of instrument streams are handled by the edge and fed to inference models on the cloud and those models are being continuously improved on a back-end supercomputer.

A data garden

In the near future, the most important contribution clouds can make to science is to provide access to important public data collections.  There is already reasonable start.   AWS has an opendata registry that has 57 data sets covering topics ranging from astronomy to genomics.   Microsoft Research has a Data Science for Research portal with a curated collection of datasets relating to human computer interaction, data mining, geospatial, natural language processing and more.  Google cloud has a large collection of public genomics datasets.  The US NIH has launch three new cloud data and analytics projects.  They include the Cancer Genomics Cloud led by the Institute for Systems Biology with Google’s cloud, FireCloud from the Broad Institute also using Google’s cloud and Cancer Genomics Cloud (CGC), powered by Seven Bridges.   These NIH facilities also provide analytics frameworks designed to help research access and effective use the resources.

I am often asked about research challenges in cloud computing that student may wish to undertake.   There are many.  The fact that the IEEE cloud computing conference being held in San Francisco in July received nearly 300 submissions shows that the field is extremely active.   I find the following topics very interesting.

  1. Find new ways to extract knowledge from the growing cloud data garden.   This is a big challenge because the data is so heterogeneous and discovery of the right tool to use to explore it requires expert knowledge.  Can we capture that community knowledge so that non-experts can find their way?  What are the right tools to facility collaborative data exploration?
  2. There are enormous opportunities for systems research in the edge-to-cloud-to-supercomputer path.  How does one create a system to manage and optimize workflows of activities that span this continuum?  Is there a good programming model for describing computations involving the edge and the cloud?  Can a program be automatically decomposed into the parts that are best run on the edge and the parts on cloud?  Can such a decomposition be dynamically adjusted to account for load, bandwidth constraints, etc.?
  3. Concerning the intelligent assistant for research, there are a number of reasonable projects short of build the entire thing.  Some may be low hanging fruit, and some may be very hard.  For example, ArXiv, Wikipedia and Google search and Bing are great for discovery but in different ways.   Handling complex queries like “what is the role of quantum entanglement in the design of a quantum computer?” should lead to a summary of the answer with links.   There is a lot of research on summarization and there are a lot of sources of data.  Another type of query is “How can I access data on genetic indicators related to ALS?”  Google will go in the right direction, but it takes more digging to find data.

These are rather broad topics, but progress on even the smallest part may be fun.

[1] L. Ramakrishnan, P. T. Zbiegel, S. Campbell, R. Bradshaw, R. S. Canon, S. Coghlan, I. Sakrejda, N. Desai, T. Declerck, and A. Liu. Magellan: Experiences from a science cloud. In 2nd International Workshop on Scientific Cloud Computing, pages49–58., ACM, 2011.

P. Mehrotra, J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazanoff, S. Saini, and R. Biswas. Performance evaluation of Amazon EC2 for NASA HPC applications. In 3rd Workshop on Scientific Cloud Computing, pages 41–50. ACM, 2012

[2] https://blogs.technet.microsoft.com/machinelearning/2018/05/29/how-to-use-fpgas-for-deep-learning-inference-to-perform-land-cover-mapping-on-terabytes-of-aerial-images/ a blog by Mary Wall, Microsoft

[3] https://blogs.microsoft.com/green/2018/05/23/achievement-unlocked-nearly-200-million-images-into-a-national-land-cover-map-in-about-10-minutes/  from Lucas Joppa – Chief Environmental Scientist, Microsoft

[4] https://www.hpcwire.com/2018/03/19/deep-learning-at-15-pflops-enables-training-for-extreme-weather-identification-at-scale/

Parallel Programming in the Cloud with Python Dask

I am always looking for better ways to write parallel programs.  In chapter 7 of our book “Cloud Computing for Science and Engineering” we looked at various scalable parallel programming models that are used in the cloud.   We broke these down into five models: (1) HPC-style “Single Program Multiple Data” (SPMD) in which a single program communicates data with copies of itself running in parallel across a cluster of machines, (2) many task parallelism that uses many nearly identical workers processing independent data sets, (3) map-reduce and bulk synchronous parallelism in which computation is applied in parallel to parts of a data set and intermediate results of a final solution are shared at well defined, synchronization points,  (4) graph dataflow transforms a task workflow graph into sets of parallel operators communicating according to the workflow data dependencies and (5) agents and microservices  in which a set of small stateless services process incoming data messages and generate messages for other microservices to consume.  While some applications that run in the cloud are very similar to the batch style of HPC workloads, parallel computing in the cloud is often driven by different classes application requirements.  More specifically, many cloud applications require massive parallelism to respond external events in real time.  This includes thousands of users that are using apps that are back-ended by cloud compute and data.   It also includes applications that are analyzing streams of data from remote sensors and other instruments.   Rather than running in batch-mode with a start and end, these applications tend to run continuously.

A second class of workload is interactive data analysis.   In these cases, a user is exploring a large collection of cloud resident data.   The parallelism is required because the size of the data: it is too big to download and if you could the analysis would be too slow for interactive use.

We have powerful programming tools that can be used for each of the parallel computing models described above but we don’t have a single programming tool that support them all.   In our book we have used Python to illustrate many of the features and services available in the commercial clouds.  We have taken this approach because Python and Jupyter are so widely used in the science and data analytics community.  In 2014 the folks at Continuum (now just called Anaconda, Inc) and a several others released a Python tool called Dask which supports a form of parallelism similar to at least three of the five models described above.  The design objective for Dask is really to support parallel data analytics and exploration on data that was too big to keep in memory.   Dask was not on our radar when we wrote the drafts for our book,  but it certainly worth discussing now.

Dask in Action

This is not intended as a full Dask tutorial.   The best tutorial material is the on-line YouTube videos of talks by Mathew Rocklin from Anaconda.   The official  tutorials from Anaconda are also available.  In the examples we will discuss here we used three different Dask deployments.  The most trivial (and the most reliable) deployment was a laptop installation.  This worked on a Windows 10 PC and a Mac without problem.  As Dask is installed with the most recent release of Anaconda, simply update your Anaconda deployment and bring up a Jupyter notebook and “import dask”.    We also used the same deployment on a massive Ubuntu linux VM on a 48 core server on AWS.  Finally, we deployed Dask on Kubernetes clusters on Azure and AWS.

Our goal here is to illustrate how we can use Dask to illustrate several of the cloud programming models described above.    We begin with many task parallelism, then explore bulk synchronous and a version of graph parallelism and finally computing on streams.  We say a few words about SPMD computing at the end, but the role Dask plays there is very limited.

Many Task Parallelism and Distributed Parallel Data Structures

Data parallel computing is an old important concept in parallel computing.  It describes a programming style where a single operation is applied to collections of data as a single parallel step. A number of important computer architectures supported data parallelism by providing machine instructions that can be applied to entire vectors or arrays of data in parallel.  Called Single instruction, multiple data (SIMD) computers, these machines were the first supercomputers and included the Illiac IV and the early Cray vector machines.  And the idea lives on as the core functionality of modern GPUs.   In the case of clusters computers without a single instruction stream we traditionally get data parallelism by distributed data structures over the memories of each node in the cluster and then coordinating the application of the operation in a thread on each node in parallel.   This is an old idea and it is central to Hadoop, Spark and many other parallel data analysis tools.   Python already has a good numerical array library called numpy, but it only supports sequential operations for array in the memory of a single node.

Dask Concepts

Dask computations are carried out in two phases.   In the first phase the computation is rendered into a graph where the nodes are actual computations and the arcs represent data movements.   In the second phase the graph is scheduled to run on a set of resources.  This is illustrated below.  We will return to the details in this picture later.

dask-workflow

Figure 1.  Basic Dask operations: compile graph and then schedule on cluster

There are three different sets of “resources” that can be used.   One is a set of threads on the host machine.   Another is a set of process and the third is a cluster of machines.   In the case of threads and local processes the scheduling is done by the “Single machine scheduler”.   In the case of a cluster it called the distributed cluster.  Each scheduler consumes a task graph and executes it on the corresponding host or cluster.   In our experiments we used a 48 core VM on AWS for the single machine scheduler. In the cluster case the preferred host is a set of containers managed by Kubernetes.   We deployed two Kubernetes clusters:  a three node cluster on Azure and a 6 node cluster on AWS.

Dask Arrays, Frames and Bags

Python programmers are used to numpy arrays, so Dask takes the approach to distributing arrays by maintaining as much of the semantics of numpy as possible.  To illustrate this idea consider the following numpy computation that creates a random 4 by 4 array, then zeros out all elements lest than 0.5 and computes the sum of the array with it’s transpose.

x = np.random.random((4,4))
x[x<0.5] = 0
y = x+x.T

We can use Dask to make a distributed version of the same matrix and perform the same computations in parallel.

Import dask.array as da
x = da.random.random(size = (4,4), chunks =(4,1))
x[x<0.5] = 0
y = x+x.T

The important new detail here is that we give explicit instructions on how we want the array to be distributed by specifying the shape of the chunks on each node.   In this case we have said we want each “chunk” to be a 4×1 slice of the 4×4 array.   We could have partitioned it into square blocks of size 2×2.   Dask takes care of managing each chunk and the needed communication between the processes that handle each chunk.   The individual chunks are managed on each thread/process/worker as numpy arrays.

As stated above, there are two parts to a dask computation.   The first phase is the construction of a graph representing the computation involving each chunk. We can actually take a look at the graph.   For example, in the computation above we can use the “visualize()” method as follows.

y = x+x.T
y.visualize()

big-transpose

Figure 2.   Sample Dask Graph for x+x.T

The nodes represent data or operations and the lines are data movements from one node to another.  As can be seen this is a rather communication intensive graph.   This is becase the transpose operation requires element on the rows (which are distributed) must be moved to columns on the appropriate node to do the addition.  The way we chunck the array can have a huge impact on the complexity of the distributed computation.  For example, 2×2 chuncking makes this one very easy.   There are 4 chunks and doing the transpose involves only a simple swap of the “off diagonal” chunks.   In this case the graph is much simpler (and easier to read!)

small-transpose

Figure 3.  Task graph for x+x.T with 2×2 chunking of data

The second step for Dask is to send the graph to the scheduler to schedule the subtasks and execute them on the available resources. That step is accomplished with a call to the compute method.

y.compute()

Dask arrays support almost all the standard numpy array operations except those that involve complex communications such as sorting.

In addition to numpy-style arrays, Dask also has a feature called Dask dataframes that are distributed versions of Pandas dataframes.   In this case each Dask dataframe is partitioned by blocks of rows where each block is an actual Pandas dataframe.  In other words, Dask dataframes operators are wrappers around the corresponding Pandas wrappers in the same way that Dask array operators are wrappers around the corresponding numpy array operators.    The parallel work is done primarily by the local Pandas and Numpy operators working simultaneously on the local blocks and this is followed by the necessary data movement and computation required to knit the partial results together.  For example, suppose we have a dataframe, df, where each row is a record consisting of a name and a value and we would like to compute the sum of the values associated with each name.   We assume that names are repeated so we need to group all records with the same name and then apply a sum operator.  We set this up on a system with three workers.  To see this computational graph we write the following.

df.groupby(['names']).sum().visualize()

groupby

Figure 4.  Dataframe groupby reduction

As stated earlier, one of the motivations of Dask is the ability to work with data collections that are far too large to load on to your local machine.   For example, consider the problem of loading the New York City taxi data for an entire year.    It won’t fit on my laptop.   The data for is for 245 million passenger rides and contains a wealth of information about each ride.  Though we can’t load this into our laptop we can ask dask to load it from a remote repository into our cloud and automatically partition it using the read_csv function on the distrusted dataframe object as shown below.

taxi1

Figure 5.  Processing Yellow Cab data for New York City

The persist method moves the dataframe into memory as a persistent object that can be reused without being recomputed.  (Note:  the read_cvs method did not work on our kubernetes clusters because of a missing module s3fs in the dask container, but it did work on our massive shared memory VM which has 200 GB of memory.)

Having loaded the data we can now follow the dask demo example and compute the best hour to be a taxi driver based on the fraction of tip received for the ride.

taxi3

Figure 6.  New York City cab data analysis.
As you can see, it is best to be a taxi driver about 4 in the morning.

A more general distributed data structure is the Dask Bag that can hold items of less structured type than array and dataframes.   A nice example http://dask.pydata.org/en/latest/examples/bag-word-count-hdfs.html illustrates using Dask bags to explore the Enron public email archive.

Dask Futures and Delayed

One of the more interesting Dask operators is one that implements a version of the old programming language concept of a future   A related concept is that of lazy evaluation and this is implemented with the dask.delayed function.   If you invoke a function with the delayed operator it simply builds the graph but does not execute it.  Futures are different.    A future is a promise to deliver the result of a computation later.  The future computation begins executing but the calling thread is handed a future object which can be passed around as a proxy for the result before the computation is finished.

The following example is a slightly modified version of one of the demo programs.   Suppose you have four functions

def foo(x):
   return result
def bar(x):    
   return result
def linear(x, y):
   return result
def three(x, y, z):
   return result

We will use the distributed scheduler to illustrate this example. We first must create a client for the scheduler. Running this on our Azure Kubernetes cluster we get the following.

 
from dask.distributed import Client
c = Client()
c

azure-scheduler

To illustrate the delayed interface, let us build a graph that composes our example functions

from dask import visualize, delayed
i = 3
x = delayed(foo)( I )
y = delayed(bar)( x )
z = delayed(linear)(x, y)
q = delayed(three)( x, y, z)
q.visualize(rankdir='LR')

In this example q is now a placeholder for the graph of a delated computation.   As with the dask array examples, we can visualize the graph (plotting it from Left to Right).

delayed-graph

Figure 7.  Graph of a delayed computation.

A call to compute will evaluate our graph.   Note that we have implemented the  four functions each with about 1 second of useless computational math (computing the sum of a geometric series) so that we can measure some execution times.   Invoking compute on our delayed computation gives us

delayed_result

which shows us that there is no parallelism exploited here because the graph has serial dependences.

To create a future, we “submit” the function and its argument to the scheduler client.  This immediately returns a reference to future value and starts the computation.  When you need the result of the computation the future has a method “result()” that can be invoked and cause the calling thread to wait until the computation is done.

Now let us consider the case where the we need to evaluate this graph on 200 different values and then sum the results.   We can use futures to kick off a computation for each instance and wait for them to finish and sum the results.   Again, following the example in the Dask demos, we ran the following on our Azure Kubernetes cluster:

futures-azure-result

Ignore the result of the computation (it is correct). The important result is the time. Calculating the time to run this sequentially (200*4.19 = 838 seconds) and dividing by the parallel execution time we get a parallel speed-up of about 2, which is not very impressive. Running the same computation on the AWS Kubernetes cluster we get a speed-up of 4. The Azure cluster has 6 cores and the AWS cluster has 12, so it is not surprising that it is twice as fast. The disappointment is that the speed-ups are not closer to 6 and 12 respectively.

aws48-future

Results with AWS Kubernetes Cluster

However, the results are much more impressive on our 48 core AWS virtual machine.

aws48-future2

Results with AWS 48-core VM

In this case we see a speed-up of 24.   The difference is the fact that the scheduling is using shared memory and threads.

Dask futures are a very powerful tool when used correctly.   In the example above, we spawned off 200 computations in less than a second.   If the work in the individual tasks is large, that execution time can mask much of the overhead of scheduler communication and the speed-ups can be much greater.

Dask Streams

Dask has a module called streamz that implements a basic streaming interface that allows you to compose graphs for stream processing.   We will just give the basic concepts here.   For a full tour look at https://streamz.readthedocs.io.   Streamz graphs have sources,  operators and sinks.   We can start by defining some simple functions as we did for the futures case:

def inc(x):
    return x+13
def double(x):
    return 2*x
def fxy(x): #expects a tuple
    return x[0]+ x[1]
def add(x,y):
return x+y
from streamz import Stream
source = Stream()

The next step will be to create a stream object and compose our graph.   We will describe the input to the stream later.   We use four special stream operators here.    Map is how we can attach a function to the stream.   We can also merge two streams with a zip operator.   Zip waits until there is an available object on each stream and then creates a tuple that combines both into one object.   Our function fxy(x) above takes a tuple and adds them.   We can direct the output of a stream to a file, database, console output or another stream with the sink operator.  Shown below our graph has two sink operators.

stream1

Figure 8.  Streamz stream processing pipeline.

Visualizing the graph makes this clear.   Notice there is also an accumulate operator.   This allows state flowing through the stream to be captured and retained.   In this case we use it to create a running total.  To push  something into the stream we can use the emit() operator as shown below.

stream2

The emit() operator is not the only way to send data into a stream. You can create the stream so that it takes events from kafka, or reads lines from a file or it can monitor a file system directory looking for new items. To illustrate that we created another stream to look at the home director of our kubernetes cluster on Azure. Then we started this file monitor. The names of the that are there are printed. Next, we added another file “xx” and it picked it up. Next, we invoked the stream from above and then added another file “xxx”.

stream3

Handling Streams of Big Tasks

Of the five types of parallel programming Dask covers 2 and a half:  many task parallelism, map-reduce and bulk synchronous parallelism and part of graph dataflow.   Persistent microservices  are not part of the picture.   However, Dask and Streamz can be used together to handle one of the use cases for microservices.  For example, suppose you have a stream of tasks and you need to do some processing on each task but the arrival rate of tasks exceed the rate at which you can process them.   We treated this case with Microservices while processing image recognition with MxNet and the resnet-152 deep learning model (see this article.)  One can  use the Streams sink operation to invoke a future to spawn the task on the Kubernetes  cluster.   As the tasks finish the results can be pushed to other processes for further work or to a table or other storage as illustrated below.

process-events

Figure 9 Extracting parallelism from a stream.

In the picture we have a stream called Source which gathers the events from external sources.  We then map it to a function f() for initial processing. The result of that step is sent to a function called spawn_work which creates a future around a function that does some deep processing and sends a final result to an AWS DynamoDB table.   (The function putintable(n) below shows an example.  It works by invoking a slow computation then create the appropriate DynamoDB metadata and put the item in the table “dasktale”.)

def putintable(n): 
    import boto3 
    e = doexp(n*1000000) 
    dyndb = boto3.resource('dynamodb', … , region_name='us-west-2' )
    item ={'daskstream':'str'+str(n),'data': str(n), 'value': str(e)} 
    table = dyndb.Table("dasktale") 
    table.put_item(Item= item ) 
    return e 

def spawn_work(n): 
    x = cl.submit(putintable, n)

This example worked very well. Using futures allowed the input stream to work at full speed by exploiting the parallelism. (The only problem is that boto3 needs to be installed on all the kubernetes cluster processes. Using the 48 core shared memory machine worked perfectly.)
Dask also has a queue mechanism so that results from futures can be pushed to a queue and another thread can pull these results out. We tried as well, but the results were somewhat unreliable.

Conclusion

There are many more stream, futures, dataframe and bag operators that are described in the documents.   While it is not clear if this stream processing tool will be robust enough to replace any of the other systems current available, it is certainly a great, easy-to-use teaching tool.   In fact, this statement can be made about the entire collection of Dask related tools.   I would not hesitate to use it in an undergraduate course on parallel programming.   And I believe that Dask Dataframes technology is very well suited to the challenge of big data analytics as is Spark.

The example above that uses futures to extract parallelism from a stream challenge is interesting because it is completely adaptive. However, it is essential to be able to launch arbitrary application containers from futures to make the system more widely applicable.   Some interesting initial work has been done on this at the San Diego Supercomputer center using singularity to launch jobs on their resources using Dask.   In addition the UK Met Office is doing interesting things with autoscaling dask clusters.   Dask and StreamZ are still young.   I expect them to continue to evolve and mature in the year ahead.

CNTK Revisited. A New Deep Learning Toolkit Release from Microsoft

In a pair of articles from last winter (first article, second article) we looked at Microsoft’s “Computational Network Toolkit” and compared it to Google’s Tensorflow.   Microsoft has now released a major upgrade of the software and rebranded it as part of the Microsoft Cognitive Toolkit.  This release is a major improvement over the initial release.  Because these older articles still get a fair amount of web traffic we wanted to provide a proper update.

There are two major changes from the first release that you will see when you begin to look at the new release.   First is that CNTK now has a very nice Python API and, second, the documentation and examples are excellent.   The core concepts are the same as in the initial release.   The original programming model was based on configuration scripts and that is still there, but it has been improved and renamed as “Brain Script”.  Brain Script is still an excellent way to build custom networks, but we will focus on the Python API which is very well documented.

Installing the software from the binary builds is very easy on both Ubuntu Linux and Windows.   The process is described in the CNTK github site.    On a Linux machine, simply download the gziped tared binary and execute the installer.

$wget https://cntk.ai/'BinaryDrop/CNTK-2-0-beta2-0-Linux-64bit-CPU-Only.tar.gz’
$tar -xf CNTK-2-0-beta2-0-Linux-64bit-CPU-Only.tar.gz
$cd cntk/Scripts/linux
$./install-cntk.sh

This will install everything including a new version of Continuum’s Anaconda Python distribution.  It will also create a directory called “repos’’.   To start Jupyter in the correct conda environment do the following.

$source “your-path-to-cntk/activate-cntk"
$cd ~/repos/cntk/bindings/python/tutorials
$Jupyter notebook 

A very similar set of commands will install CNTK on your Windows 10 box. (If you are running Jupyter on a virtual machine or in the cloud you will need additional arguments to the Jupyter notebook command such as “-ip 0.0.0.0 –no browser” and then then you can navigate you host browser to the VM ip address and port 8888. Of course, if it is a remote VM you should add a password. ) What you will see is an excellent set of tutorials as shown in Figure 1.

jupyter-cntk

Figure 1.   CNTK tutorial Jupyter notebooks.

CNTK Python API

CNTK is a tool for building networks and the Python and Brain Script bindings are very similar in this regard.   You use the Python program to construct a network of tensors and then train and test that network through special operations which take advantage of underlying parallelism in the hardware such as multiple cores or multiple GPUs.   You can load data into the network through Python Numpy arrays or files.

The concept of constructing a computation graph for later execution is not new.   In fact, it is an established programming paradigm used in Spark, Tensorflow, and Python Dask.   To illustrate this in CNTK consider the following code fragment that creates two variables and a constructs a trivial graph that does matrix vector multiplication and vector addition.  We begin by creating three tensors that will hold the input values  to the graph and then tie them to the matrix multiply operator and vector addition.

import numpy as np
import cntk
X = cntk.input_variable((1,2))
M = cntk.input_variable((2,3))
B = cntk.input_variable((1,3))
Y = cntk.times(X,M)+B

In this X is a 1×2 dimensional tensor, i.e. a vector of length 2 and M is a matrix that is 2×3 and B is a vector of length 3. The expression Y=X*M+B yields a vector of length 3. However, no computation has taken place. We have only constructed a graph of the computation. To invoke the graph we input values for X, B and M and then apply the “eval’’ operator on Y. We use Numpy arrays to initialize the tensors and supply a dictionary of bindings to the eval operator as follows

x = [[ np.asarray([[40,50]]) ]]
m = [[ np.asarray([[1, 2, 3], [4, 5, 6]]) ]]
b = [[ np.asarray([1., 1., 1.])]]

print(Y.eval({X:x, M: m, B: b}))
array([[[[ 241.,  331.,  421.]]]], dtype=float32)

CNTK has several other important tensor containers but two important ones are

  • Constant(value=None, shape=None, dtype=None, device=None, name=”): a scalar, vector or other multi-dimensional tensor.
  • Parameter(shape=None, init=None, dtype=None, device=None, name=”): a variable whose value is modified during network training.

There are many more tensor operators and we are not going to go into them here.   However, one very important class is the set of operators that can be used to build multilevel neural networks.   Called the “Layers Library’’ they form a critical part of CNTK.    One of the most basic is the Dense(dim) layer which creates a fully connected layer of output dimension dim. As shown in Figure 2.

cntk-dense

Figure 2.   A fully connected layer created by the Dense operator with an implicit  3×6 matrix and a 1×6 vector of parameters labeled here M and B.   The input dimension is taken from the input vector V.  The activation here is the default (none), but it could be set to ReLu or Sigmod or another function.

There are many standard layer types including Convolutional, MaxPooling, AveragePooling and LSTM. Layers can also be stacked with a very simple operator called “sequential’’. Two examples taken directly from the documentation is a standard 4 level image recognition network based on convolutional layers.

with default_options(activation=relu):
    conv_net = Sequential ([
        # 3 layers of convolution and dimension reduction by pooling
        Convolution((5,5), 32, pad=True), MaxPooling((3,3), strides=(2,2)),
        Convolution((5,5), 32, pad=True), MaxPooling((3,3), strides=(2,2)),
        Convolution((5,5), 64, pad=True), MaxPooling((3,3), strides=(2,2)),
        # 2 dense layers for classification
        Dense(64),
        Dense(10, activation=None)
    ])

The other fun example is a slot tagger based on a recurrent LSTM network.

tagging_model = Sequential ([
    Embedding(150),         # embed into a 150-dimensional vector
    Recurrence(LSTM(300)),  # forward LSTM
    Dense(labelDim)         # word-wise classification
])

The Sequential operator can be thought of as a concatenation of the layers that in the given sequence.   In the case of the slot tagger network, we see two additional important operators: Embedding and Recurrence.

Embedding is used for word embeddings where the inputs are sparse vectors of size equal to the word vocabulary (item i = 1 if the word is the i-th element of the vocabulary and 0 otherwise) and the embedding matrix is of size vocabulary-dimension by, in this case, 150.     The embedding matrix may be passed as a parameter or learned as part of training.

The Recurrence operator is used to wrap the correct LSTM output back to the input for the next input to the network.

A Closer Look at One of Tutorials.

The paragraphs above are intended to give you the basic feel of what CNTK looks like with its new Python interface.  The best way to learn more is to study the excellent example tutorials.

CNTK 203: Reinforcement Learning Basics

CNTK version 1 had several excellent tutorials, but version 2 has the Python notebook versions of these plus a few new ones.  One of the newest demos is an example of reinforcement learning.   This application of Neural Nets was first described in the paper Human-level control through deep reinforcement learning, by the Google DeepMind group.  This idea has proven to be very successful in systems that learn to play games.  This topic has received a lot of attention, so we were happy to see this tutorial included in CNTK. The example is a very simple game that involves balancing a stick.   More specifically they use the cart-pole configuration from OpenAI.   As shown in figure 3, the system state can be described by a 4-tuple: position of the cart, its velocity, the angle of the pole and the angular velocity.   The idea of the game is simple.  You either push the cart to the left or the right and see if you can keep the stick vertical.   If you drift too far off course or the pole angle goes beyond an angle of 15 degrees, the game is over.   Your score is the total number of steps you take before failure. The full example is in the github repository and we are not going to go through all the code here.  The Jupyter notebook for this example is excellent, but if you are new to this topic you may find some additional explanation of value in case you decide to dig into it.cntk-cart-pole

Figure 3. Cart-Pole game configuration.

The part of reinforcement learning used here is called a Deep Q-Network. It uses a neural network to predict the best move when the cart is in a given state. This is done by implicitly modeling a function Q(s,a) which is the optimal future reward given state s and the action is a and where the initial reward is r. They approximate Q(s,a) using the “Bellmann equation” which describes how to choose action a in a given state s to maximize the accumulated reward over time based inductively on the same function applied to the following states s’.

cntk-bellmann

The parameter gamma is a damping factor that guarantees the recurrence converges. (Another excellent reference for this topic is the blog by Jaromír Janisch.) The CNTQ team approached this problem as follows. There are three classes.

  • Class Brain.    This hold our neural net and trainer.  There are three methods
    • Create() which is called at initialization.   It creates the network.   There are two tensor parameters: observation, which is used to hold the input state and q_target which is a tensor used for training.   The network is nice and simple:
      l1 = Dense(64, activation=relu)
      l2 = Dense(2)
      unbound_model = Sequential([l1, l2])
      model = unbound_model(observation)
      

      The training is by the usual stochastic gradient descent based on a loss measure.

      loss = reduce_mean(square(model - q_target), axis=0)
      meas = reduce_mean(square(model - q_target), axis=0)
      learner = sgd(model.parameters, lr,    
             gradient_clipping_threshold_per_sample=10)
      trainer = Trainer(model, loss, meas, learner)
      
    • Train(x, y)  which calls the trainer for batches of states x and predicted outcomes y which we will describe below
    • Predict(s) which invokes the model for state ‘s’ and returns a pair of optimal rewards given a left or right move.
  • Class Memory. This hold a record of recent moves.   This is used by the system to create training batches.  There are two methods
    • Add(sample configuration)  – adds a four tuple consisting of a starting state, an action and a result and a resulting  state tuple to a memory.
    • Sample(n) returns a random sample of n configurations samples from the memory.
  • Class Agent which is the actor that picks the moves and uses the memory to train the network.  There are three methods here.
    • Act(state) returns a 0 or 1 (left move or right move) that will give the best reward for the given state.     At first it just makes random guesses, but as time passes it uses the Predict method of the Brain class to select the best move for the given state.
    • Observe(sample configuration) records a configuration in the memory and keeps track of the time step and another parameter used by act.
    • Replay() is the main function for doing the learning.    This is the hardest part to understand in this tutorial. It works by grabbing a random batch of memorized configurations from memory.   What we will do is use the current model to predict an optimal outcome and use that as the next step in training the model.  More specifically for each tuple in the batch we want to turn it into a training sample so that the network behaves like the Bellmann equation.  A tuple consists of the start state, the action, the reward and the following state.   We can apply our current model to predict the award for the start state and also for the result state.  We can use this information to create a new reward tuple for the given action and start state that models the Bellmann recurrence.   Our training example is the pair consisting of the start state and this newly predicted reward.  At first this is a pretty poor approximation, but amazingly over time it begins to converge. The pseudo code is shown below.
      x = numpy.zeros((batchLen, 4)).astype(np.float32)
      y = numpy.zeros((batchLen, 2)).astype(np.float32)
      
      for i in range(batchLen):
          s, a, r, s_ = batch[i]
          # s = the original state (4 tuple)
          # a is the action that was taken
          # r is the reward that was given
          # s_ is the resulting state.
          # let t = the reward computed from current network for s 
          # and let r_ be the reward computed for state s_.
          # now modify t[a] = r + gamma* numpy.amax(r_) 
          # this last step emulated the bellmann equation
          x[i] = s
          y[i] = t
      self.brain.train(x,y)		
      

The final part of the program is now very simple. We have an environment object that returns a new state and a done flag for each action the agent takes. We simply run our agent until it falls out of bounds (the environment object returns done=True). If the step succeeded, we increment our score. The function to run the agent and to keep score is shown below.

def run(agent):
    s = env.reset()
    R = 0 
    while True:            
        a = agent.act(s)
	  s_, r, done, info = env.step(a)
        if done: # terminal state
            s_ = None
        agent.observe((s, a, r, s_))
        agent.replay() #learn from the past           
        s = s_
        R += r
        if done:
            return R

Each time we run “run” it learns a bit more.   After about 7000 runs it will take over 600 steps without failure.

The text above is no substitute for a careful study of the actual code in the notebook.  Also, as it is a notebook, you can have some fun experimenting with it.  We did.

Final Thoughts

CNTK is now as easy to use as any of the other deep learning toolkits.   While we  have not benchmarked its performance they claim it is extremely fast and it make good use of multiple GPUs and even a cluster of servers.    We are certain that the user community will enjoy using and contributing to its success.

Citation.

The team that first created CNTK should be cited.   I know there are likely many others that have contributed to the open source release in one way or another, but the following is the master citation.

Amit Agarwal, Eldar Akchurin, Chris Basoglu, Guoguo Chen, Scott Cyphers, Jasha Droppo, Adam Eversole, Brian Guenter, Mark Hillebrand, T. Ryan Hoens, Xuedong Huang, Zhiheng Huang, Vladimir Ivanov, Alexey Kamenev, Philipp Kranen, Oleksii Kuchaiev, Wolfgang Manousek, Avner May, Bhaskar Mitra, Olivier Nano, Gaizka Navarro, Alexey Orlov, Hari Parthasarathi, Baolin Peng, Marko Radmilac, Alexey Reznichenko, Frank Seide, Michael L. Seltzer, Malcolm Slaney, Andreas Stolcke, Huaming Wang, Yongqiang Wang, Kaisheng Yao, Dong Yu, Yu Zhang, Geoffrey Zweig (in alphabetical order), “An Introduction to Computational Networks and the Computational Network Toolkit“, Microsoft Technical Report MSR-TR-2014-112, 2014.

A Brief Look at Google’s Cloud Datalab

Google recently released a beta version of a new tool for data analysis using the cloud called Datalab.  In the following paragraphs we take a brief look at it through some very simple examples.  While there are many nice features of Datalab, the easiest way to describe it would be to say that it is a nice integration of the IPython Jupyter notebook system with Google’s BigQuery data warehouse.  It also integrates standard IPython libraries such as graphics and scikit-learn and Google’s own machine learning toolkit TensorFlow.

To use it you will need a Google cloud account.   The free account is sufficient if you are interested in just trying it out.   You may ask, why do I need a Google account when I can use Jupyter, IPython and TensorFlow on my own resources?    The answer is you can easily access BigQuery on non-trivial sized data collections directly from the notebook running on your laptop.  To get started go to the Datalab home page.   It will tell you that this is a beta version and give you two choices: you may either install the Datalab package locally on your machine or you may install it on a VM in the Google cloud.   We prefer the local version because it saves your notebooks locally.

 The Google public data sets that are hosted in the BigQuery warehouse are fun to explore.  They include

  • The names on all US social security cards for births after 1879.  (The table rows contain only the year of birth, state, first name, gender and number as long as it is greater than 5.  No social security numbers.),
  • The New York City Taxi trips from 2009 to 2015,
  • All stories and comments from “Hacker News”,
  • The US Dept of Health weekly records of diseases reported from each city and state from 1888 to 2013,
  • The public data from the HathiTrust and the Internet Book Archive,
  • The global summary of the day’s (GSOD) weather from the national oceanographic and atmospheric administration from 9000 weather stations between 1929 and 2016.

And more, including the 1000 genome database.

To run Datalab on your laptop you need to have Docker installed.   Once Docker is running then and you have created a Google cloud account and created a project, you can launch Datalab with simple docker command as illustrated in their quick-start guide.  When the container is up and running you can view it at http://localhost:8081.  What you see at first is shown in Figure 1.  Keep in mind that this is beta release software so you can expect it will change or go away completely. 

 datalab-first-view

Figure 1.  Datalab Top level view.

Notice the icon in the upper right corner consisting of a box with an arrow.   Clicking this allows you to login to the Google cloud and effectively giving your authorization to allow you container to run on your gcloud account.

The view you see is the initial notebook hierarchy.   Inside docs is a directory called notebooks that contain many great tutorials and samples.

A Few Simple Examples of Using Datalab

As mentioned above, one of the public data collections is the list of first names from social security registrations.   Using Datalab we can look at a sample of this data by using one of the built-in Bigquery functions as shown in Figure 2.

datalab-names

Figure 2.   Sampling the names data.

 

This page gives us enough information about the schema that we can now formulate a query.

In modern America there is a movement to “post-gender” names.   Typical examples cited on the web are “Dakota”, “Skyler” and  “Tatum”.   A very simple SQL query can be formulated to see how the gender breakdown for these names show up in the data.  In Datalab, we can formulate the query as shown in Figure 3.

datalab-dakotaplus2

Figure 3.   Breakdown by gender of three “post-gender” names.

As we can see, this is very nearly gender balanced.  A closer inspection using each of the three names separately show that “Skyler” tends to be ‘F’ and “Tatum” tends to ‘M’. On the other hand, “Dakota” does seem to be truly post-gender with 1052 ‘F’ and 1200 ‘M’ occurrences.

We can also consider the name “Billy” which, in the US, is almost gender neutral.   (Billy Mitchel was a famous World Work I general and also a contemporary Jazz musician.  Both male. And Billy Tipton and Billy Halliday were female musicians though Billy Halliday was actually named Billie and Billy Tipton lived her life as a man, so perhaps they don’t count.   We can ask how often Billy was used as a name associated with gender ‘F’ in the database?  It turns out it is most common in the southern US. We can then group these by state and create a count and show the top five.   The SQL command is easily inserted into the Datalab note book as shown in Figure 4.

datalab-billy

Figure 4.   Search for Billy with gender ‘F’ and count and rank by state of birth.

Rubella in Washington and Indiana

 A more interesting data collection is Center for Disease Control and Prevention dataset concerning diseases reported by state and city over a long period.   An interesting case is Rubella, which is virus also known as the “German measles”.   Through our vaccination programs it has been eliminated in the U.S. except for those people who catch it in other countries where it still exists.  But in the 1960s it was a major problem with an estimated 12 million cases in the US and a significant number of newborn deaths and birth defects.   The vaccine was introduced in 1969 and by 1975 the disease was almost gone.   The SQL script shown below is a slight modified version of one from the Google Bigquery example.   It has been modified to look for occurrences of Rubella in two states, Washington and Indiana, over the years 1970 and 1971.

%%sql --module rubella
SELECT
  *
FROM (
  SELECT
    *, MIN(z___rank) OVER (PARTITION BY cdc_reports_epi_week) AS z___min_rank
  FROM (
    SELECT
      *, RANK() OVER (PARTITION BY cdc_reports_state ORDER BY cdc_reports_epi_week ) AS z___rank
    FROM (
      SELECT
        cdc_reports.epi_week AS cdc_reports_epi_week,
        cdc_reports.state AS cdc_reports_state,
        COALESCE(CAST(SUM((FLOAT(cdc_reports.cases))) AS FLOAT),0) 
         AS cdc_reports_total_cases
      FROM
        [lookerdata:cdc.project_tycho_reports] AS cdc_reports
      WHERE
        (cdc_reports.disease = 'RUBELLA')
        AND (FLOOR(cdc_reports.epi_week/100) = 1970 
          OR FLOOR(cdc_reports.epi_week/100) = 1971)
        AND (cdc_reports.state = 'IN'
          OR cdc_reports.state = 'WA')
      GROUP EACH BY
        1, 2) ww ) aa ) xx
WHERE
  z___min_rank <= 500
LIMIT
  30000

We can now invoke this query as part of a python statement so we can capture its result as a pandas data frame and pull apart the time stamp fields and data values.

rubel = bq.Query(rubella).to_dataframe()
rubelIN = rubel[rubel['cdc_reports_state']=='IN']
                 .sort_values(by=['cdc_reports_epi_week'])
rubelWA = rubel[rubel['cdc_reports_state']=='WA']
                 .sort_values(by=['cdc_reports_epi_week'])
epiweekIN = rubelIN['cdc_reports_epi_week']
epiweekWA = rubelWA['cdc_reports_epi_week']
rubelINval = rubelIN['cdc_reports_total_cases']
rubelWAval = rubelWA['cdc_reports_total_cases']

At this point a small adjustment must be made to the time stamps.  The CDC reports times in epidemic weeks and there are 52 weeks in a year.    So the time stamps for the first week of 1970 is 197000 and the time stamp for the last week is 197051.  The next week is 197100.  To make these into timestamps that appear contiguous we need to make a small “time compression” as follows.

realweekI = np.empty([len(epiweekIN)])
realweekI[:] = epiweekIN[:]-197000
realweekI[51:] = realweekI[51:]-48

Doing the same thing with epiweekWA we now have the basis of something we can graph.  Figure 5 shows the progress of rubella in Washington and Indiana over two years.  Washington is the red line and Indiana is blue.   Note that the outbreaks occur about the same time in both states and that by late 1971 the disease is nearly gone.

datalab-rubella.png

Figure 5.   Progress of Rubella in Washington (red) and Indiana (blue) from 1970 through 1971.

Continuing the plot over 1972 and 1973 show there are flare-ups of the disease each year but their maximum size is diminishes rapidly.

(Datalab has some very nice plotting functions, but we could not figure out how to do a double plot, so we used the mathplot library with the “fivethirtheight” format.)

 

A Look at the Weather

 

From the national oceanographic and atmospheric administration we have the global summary of the day’s (GSOD) weather from the from 9000 weather stations between 1929 and 2016.   While not all of these stations were operating during that entire period, there is still a wealth of weather data here.   To illustrate it, we can use another variation on one of Google’s examples.  Let’s find the hottest spots in the state of Washington for 2015.   This was a particularly warm year that brought unusual droughts and fires to the state. The following query will list the hottest spots in the state for the year.

%%sql
SELECT
  max, (max-32)*5/9 celsius, mo, da, state, stn, name
FROM (
  SELECT
    max, mo, da, state, stn, name
  FROM
    [bigquery-public-data:noaa_gsod.gsod2015] a
  JOIN
    [bigquery-public-data:noaa_gsod.stations] b
  ON
    a.stn=b.usaf
    AND a.wban=b.wban
  WHERE
    state="WA"
    AND max

 The data set ‘gsod2015’ is the table of data for the year 2015.  To get a list that also shows the name of the station we need to do a join with the ‘station’ table over the corresponding station identifiers.  We order the results descending from the warmest recordings.    The resulting table is shown in Figure 6 for the top 10.

datalab-hotstations

Figure 6.   The top 10 hottest spots in Washington State for 2015

The results are what we would expect.   Walla Walla, Moses Lake and Tri Cities are in the eastern part of the state and summer was very hot there in 2015.   But  Skagit RGNL is in the Skagit Valley near Puget Sound.   Why is it 111 degrees F there in September?   If it is hot there what was the weather like in the nearby locations?   To find out which stations were nearby we can look at the stations on a map.   The query is simple but it took some trial and error.

%%sql --module stationsx
DEFINE QUERY locations
  SELECT FLOAT(lat/1000.0) AS lat, FLOAT(lon/1000.0) as lon, name
  FROM [bigquery-public-data:noaa_gsod.stations]
  WHERE state="WA" AND name != "SPOKANE NEXRAD"

It seems that the latitude and longitude for the Spokane NEXRAD station are incorrect and resolve to some point in Mongolia.  By removing it we get a good picture of the nearby stations as shown in Figure 7.

datalab-hotstations.png

Figure 7.   Location of weather stations in western Washington using the Bigquery chart map function.

 This is an interactive map, so we can get the names of the nearby stations.   There is one only a few miles away  called PADILLA BAY RESERVE and the next closest is BELLINGHAM INTL.   We can now compare the weather for 2015 at these three locations.

 To get the weather for each of these we need the station ID.   We can do that with a simple query.

%%sql
SELECT
  usaf, name
FROM [bigquery-public-data:noaa_gsod.stations] 
WHERE
    name="BELLINGHAM INTL" OR name="PADILLA BAY RESERVE" OR name = "SKAGIT RGNL"

Once we have our three station IDs we can use the follow to build a parameterized Bigquery expression.

qry = "SELECT max AS temperature, \
       TIMESTAMP(STRING(year) + '-' + STRING(mo) + \
       '-' + STRING(da)) AS timestamp \
FROM [bigquery-public-data:noaa_gsod.gsod2015] \
WHERE stn = '%s' and max /< 500 \
ORDER BY year DESC, mo DESC, da DESC"

stationlist = ['720272','727930', '727976']

dflist = [bq.Query(qry % station).to_dataframe() for station in stationlist]

 We can now render an image of the weather for our three stations as shown in Figure 8.

datalab-3stations-final.png

Figure  8.  Max daily temperatures for Skagit (blue), Padilla Bay (red) and Bellingham (yellow)

 We can clearly see the anomaly for Skagit in September and it is also easy to spot another problem in March where the instruments seemed to be not recording.   Other than that there is close alignment of the readings.

Conclusions

There are many features of Datalab that we have not demonstrated here.   The documentation gives an example of using Datalab with Tensorflow and the charting capabilities are more extensive than demonstrated here.  (The Google maps example here was not reproducible in any other notebook beyond the demo in the samples which we modified to run the code here.)  It is also easy to upload your own data to the warehouse and analyze it with Datalab.

 Using Datalab is almost addictive.  For every one of the data collections we demonstrated here there were many more questions we wanted to explore.  For example, where and when did the name “Dakota” start being used and how did its use spread?   Did the occurrence of Rubella outbreaks correspond to specific weather events?  Can we automate the process of detecting non-functioning weather instruments over the years where records exist?  These are all relatively standard data mining tasks, but the combination of Bigquery and IPython in the notebook format makes it fun.

 It should be noted that Datalab is certainly not the first use of the IPython notebook as a front-end to cloud hosted analysis tools.  The IPython notebook has been used frequently with Spark as we have previously described.  Those interested in an excellent overview of data science using Python should look at “Python Data Science Handbook”  by Jake VanderPlas which makes extensive use of IPython notebooks.     There are a variety of articles about using Jupyter on AWS  and Azure for data analytics.  A good one is by Cathy Ye about deep learning using Jupyter in the cloud where she gives detailed instruction for how to install Jupyter on AWS and deploy Caffe there.

.

Kubernetes and the Google Cloud Container Service: Fun with Pods of Celery.

In a previous post I talked about using Mesosphere on Azure for scaling up many-tasks parallel jobs and I promised to return to Kubernetes when I figured out how to bring it up.   Google just made it all very simple with their new Google Cloud container services.   And, thanks to their good tutorials, I learned about a very elegant way to do remote procedure calls using another open source tool called Celery.

So let me set the stage with a variation on an example I have used in the past.   Suppose we have 10000 scientific documents that are stored in the cloud.   I would like to use a simple machine learning method to classify each of these by topic.    I would like to do this quickly as possible and, because the analysis of each document is independent of the others, I can try to process as many as possible in parallel.  This is the basic “many task” parallel model and one of the most common uses of the cloud for scientific computing purposes.      To do this we will use the Celery distributed task queue mechanism to take a list of our documents and send each one to a work queue where the tasks will be parceled out to workers who will do the analysis and respond.

The Google Cloud Container Service and a few words about Kubernetes.

Before getting into the use of Celery and the analysis program, let’s describe the Google Cloud Container Service and a bit about Kubernetes.   Getting started is incredibly easy.   Google has a small free trial account which is sufficient to do the experiments described.  Go to http://cloud.google.com and sign in or create an account.  This will take you to the “console” portal. The first thing you need to do is to create a project. In doing so it will be assigned an id which is a string of the form “silicon-works-136723”.    There is a drop down menu on the left end of the blue banner at the top of the page.  (Look for three horizontal bars.) This allows you to select the type of service you want to work on.     Select the “Container Engine”.  On the “container clusters” page there is a link that will allow you to create a cluster.   With the free account you cannot make a very big cluster.   You are limited to about 4 dual core servers.   If you fill in the form and submit it, you will soon have a new cluster.  There is a special icon of the form “>_” in the blue banner.  Clicking on that icon will create an instance of a “Cloud Shell” that will be automatically authenticated to your account.   The page you will see should resemble Figure 1 below.   The next thing you need to do is to authenticate your cloud shell with your new cluster.   By selecting your container and clicking on the “connect” button to the right you will get the code to paste into the cloud shell.  The result should now look exactly like Figure 1.

google-container-engine

Figure 1.   Creating a Google cloud cluster and connecting the cloud shell to it.

Interacting with Kubernetes, which is now running on our small cluster, is through command lines which can be entered into the cloud shell.   Kubernetes has a different, and somewhat more interesting architectures than other container management tools.   The basic unit of scheduling in Kubernetes is  launching pods. A pod consists of a set of one or more Docker-style containers together with a set of resources that are shared by the containers in that pod.  When launched a pod resides on a single server or VM.   This has several advantages for the containers in that pod.   For example, because the containers in a pod are all running  on the same VM, the all share the same IP and port space so the containers can find each other through conventional means like “localhost”.   They can also share storage volumes that are local to the pod.

To start let’s consider a simple single container pod to run the Jupiter notebook.  There is a standard Docker container that contains Jupyter and the scipy software stack.  Using the Kubernetes control command kubectl we can launch Jupyter and expose its port 8888 with the following statement.

$ kubectl run jupyter --image=jupyter/scipy-notebook --port=8888

To see that it is up and running we can issue the command “kubectl get pods” which will return the status of all of our running pods.   Though we have launched jupyter it is still not truly visible.   To do that we will associate a load balancer with the pod.   This will expose the port 888 to the open Internet.

$ kubectl expose deployment jupyter --type=LoadBalancer

Once that has been run you can get the IP address for jupyter from the “LoadBalancer Ingress:” field of the service description when you run the following.   If it doesn’t appear, try again.

$ kubectl describe services jupyter

One you have verified that it is working at that address on port 8888, you should shut it down immediately because, as you can see, there is no security with this deployment.  Deleting a deployment is easy.

$ kubectl delete deployment jupyter

 There is another point that one must be aware of when building containers that need to directly interact with the google cloud APIs.  To make this work you will need to get application default credentials to run in your container.   For example if you container is going to interact with the storage services you will need this.   To get the default application credentials follow the instructions here.  We will say a few more words about this below.

The Analysis Example in detail.

Now to describe  Celery and how to use Celery and Kubernetes in the many-task scenario described above.

To use Celery we start with our analysis program.   We have previously described the analysis algorithms in detail in another post, so we won’t duplicate that here.  Let’s start by assuming we have a function predict(doc) that takes a document as a string as an argument and returns a string containing the result from our trained machine learnging classifiers.  Our categories are “Physics”, “Math”, “Bio”, “Computer Science” and “Finance” and the result from each classifier is simply the category that that classifier determines to be the most likely correct answer.

Celery is a distributed remote procedure call system for Python programs.   The Celery view of the world is you have a set of worker processes running on remote machines and a client process that is invoking functions that are executed on the remote machines.   The workers and the clients all coordinate through a message broker running somewhere else on the network.

Here we use a RabbitMQ service that is running on a Linux VM on the NSF JetStream cloud as illustrated in Figure 2.

kubernetes-jetstream-setup

Figure 2.  Experimental Configuration with Celery workers running on Kubernetes in the Google Cloud Container Service,  the RabbitMQ broker running in a VM on the NSF Jetstream Cloud and a client program running as a notebook on a laptop.

The code block below illustrates the basic Celery worker template.    Celery is initialized with a constructor that takes the name of the project and a link to the broker service which can be something like a Redis cache or MongoDB.   The main Celery magic is invoked with a special Python “decorator” associated with the Celery object as shown in the predictor.py file below.

from celery import Celery
app = Celery('predictor', backend='amqp')

#Now initialize and load all the data structure that will be constant 
#and recused for each analysis.  In our case this will include
#all the machine learning models that were trained on the data 
#previously. And create a main worker function to invoke the models.  
def invokeMLModels(statement):
    ....
	return analysis
	
#define the functions we will call remotely here
@app.task
def predict(statement):
	prediction = invokeMLModels(statement)
	return [prediction]

What this decorator accomplishes is to wrap the function in a manner that it can be invoked by a remote client.   To make this work we need create a Celery worker from our predictor.py file with the command below which registers a worker instance as a listener on the RabbitMQ queue.

>celery worker -A predictor -b 'amqp://guest@brokerIPaddr'

Creating a client program for our worker is very simple.   It is similar to the worker template except that our version of the predict( ) function does nothing because we are going to invoke it with the special Celery apply_async( ) method that will push the argument to the broker queue and return control immediately to the client.   The object that is returned from this call is similar to what is sometimes called a “future” or a “promise” in the programming language literature.  What it is a placeholder for the returned value.   Once we attempt to evaluate the get() method on this object our client will wait until a reply is returned from the remote worker that picked up the task.

from celery import Celery
app = Celery('predictor', broker='amqp://guest@brokerIPaddr', backend='amqp')

@app.task
def predict(statement):
	return ["stub call"]
	
res = predict.apply_async(["this is a science document ..."])

print res.get()

Now if we have 10000 documents to analyze we can send them in sequence to the queue as follows.

#load all the science abstracts into a list
documents = load_all_science_abstracts()

res = []
for doc in documents:
   res.append(predict.apply_async([doc])

#now wait for them all to be done
predictions = [result.get() for result in res]
#now do an analysis of the predictions

Here we push each analysis task into the queue and save the async returned objects in a list.  Then we create a new list by waiting for each prediction value to be returned.   Our client can run anywhere there is Internet access.  For example this one was debugged on a Jupyter instances running on a laptop.  All you need to do is “pip install celery” and run Juypter.

There is much more to say about Celery and the interested reader should look at the Celery Project site for the definitive guide. Let us now turn to using this with Kubernetes.  We must first create a container to hold the analysis code and all the model data.   For that we will need a Docker file and a shell script to correctly launch celery one the container is deployed.   For those actually interested in trying this, all the files and data are in OneDrive here.   The Docker file shown below has more than we need for this experiment.

# Version 0.1.0
FROM ipython/scipystack
MAINTAINER yourdockername "youremail"
RUN easy_install celery
RUN pip install -U Sphinx
RUN pip install Gcloud
RUN easy_install pattern
RUN easy_install nltk
RUN easy_install gensim
COPY bookproject-key.json /
COPY models /
COPY config /
COPY sciml_data_arxiv.p /
COPY predictor.py /
COPY script.sh /
ENTRYPOINT ["bash", "/script.sh"]

To build the image we first put all the machine learning configuration files in a directory called config and all the learned model files in a directory called models.  At same level we have the predictor.py source.   For reasons we will explain later we will also include the full test data set: sciml_data_arxiv.p. The Docker build starts with the ipython/scipystack container.   We then use easy_install to install Celery as well as four packages used by the ML analyzers: pattern, nltk and gensim.   Though we are not going to use the Gcloud APIs here, we include them with a pip install.   But to make that work we need an updated copy of Sphinx.  To make the APIs work we would need our default client authentication keys.   They are stored in a json file called bookproject-key.json that was obtained from the Gcloud portal as described previously.   Finally we copy all of the files and directory to the root path ‘/’.  Note that the copy from a directory is a copy of the all contained files to the path ‘/’ and not to a new directory.   The ENTRYPOINT runs our script which is shown below.

cp /predictor.py .
export C_FORCE_ROOT='true'
export GOOGLE_APPLICATION_CREDENTIALS='/bookproject-key.json'
echo $C_FORCE_ROOT
celery worker -A predictor -b $1

Bash will run our script in a temp directory, so we need to copy our predictor.py file to that directory.  Because our bash is running as root, we need to convince Celery that it is o.k. to do that.  Hence we export C_FORCE_ROOT as true.  Next, if we were using the Gcloud APIs we need to export the application credentials.   Finally we invoke celery but this time we use the -b flag to indicate that we are going to provide the IP address of the RabbitMQ amqp broker as a parameter and we remove it from the explicit reference in the predictor.py file.  When run the predictor file will look for all the model and configuration data in ‘/’.   We can now build the docker image with the command

>docker build -t “yourdockername/predictor” .

And we can test the container on our laptop with

>docker run -i -t “yourdockername/predictor ‘amqp://guest@rabbitserverIP’

Using “-i -t” allows you to see any error output from the container.   Once it seems to be working we can now push the image to the docker hub.   (to use our version directly, just pull dbgannon/predictor)

We can now return to our Google cloud shell and pull a version of the container there.   If we want to launch the predictor container on the cluster, we can do it one at a time with the “kubectl run” command.  However Kubernetes has a better way to do this using a pod configuration file where we can specify the number of pod instances we want to create.  In the file below, which we will call predict-job.json we specify a job name, the container image in the docker hub,   and the  parameter to pass to the container to pass to the shell script.   We also specify the number of pods to create.   In this case that is 6 as identified in the “parallelism” parameter.

apiVersion: batch/v1
kind: Job
metadata:
   name:predict-job
spec:
   parallelism: 6
   template:
        metadata:
            name: job-wq
       spec:
            containers:
                  - name: c
                  image: dbgannon/predictor
                  args: ["amqp://guest@ipaddress_of_rabbitmq_server"]
          restartPolicy: OnFailure

One command in the cloud shell will now launch six pods each running our predictor container.

$kubectl create -f predict-job.json 

Some Basic Performance Observations.

When using many tasks system based on a distributed worker model there are always three primary questions about the performance of the system.

  1. What is the impact of wide-area distribution on the performance?
  2. How does performance scale with the number of worker containers that are deployed? More specifically, if we N workers, how does the system speed up as N increases?    Is there a point of diminishing return?
  3. Is there a significant per/task overhead that the system imposes?  In other words, If the total workload is T and if it is possible to divide that workload into k tasks each  of size T/k , then what is the best value of k that will maximize performance?

Measuring the behavior of a Celery application as a function of the number workers is complicated by a number of factors.   The first concern we had was the impact of widely distributing the computing resources on the overall performance.   Our message broker (RabbitMQ) was running in a virtual machine in Indiana on the JetStream cloud.   Our client was running a Jupyter notebook on a laptop and the workers were primarily on the Midwest Google datacenter and on a few on other machines in the lab.   We compared this to a deployment where all the workers, the message broker and the client notebook were all running together on the Google datacenter.   Much to our surprise there was little difference in performance between the two deployments.   There are two ways to view this result.   One way is to say that the overhead of wide area distribution was not significant.   The other way to say this is that the overhead of wide area distribution was negligible compared to other performance problems.

A second factor that has an impact on performance as a function of the number of workers is the fact that a single Celery worker may have multiple threads that are responding to asynchronous function calls.   While we monitored the execution we noticed that the number of active threads in one worker could change over time. This made performance somewhat erratic.  Celery’s policy is that it will never have more threads than the number of available cores, so to limit the thread variability we ran workers in container pods on VMs with only one core.

Concerning the question of the granularity of the work partitioning we configured the program so that a number of documents could be processed in one invocation and this number could be set remotely.    By taking a set of 1000 documents and a fixed set of workers, we divided the document set into blocks of size K where K ranged from 1 to 100.   In general, larger blocks were better because the number of Celery invocations was smaller, but the difference was not great.   Another factor involved Celery’s scheduling for deciding which worker get the next invocation.   For large blocks this was not the most efficient because this left holes in the execution schedule when workers were occasionally idle while another was over scheduled.   For very small blocks these holes tended to be small.   We found that a value of K=2 gave reasonably consistent performance.

Finally to test scalability we used three different programs.

  1. The document topic predictor described above where each invocation classified two documents.
  2. A simple worker program that does no computation but just sleeps for 10 seconds before returning a “hello world” string.
  3. A worker that computes part of the Euler sequence sum(1/i**2, i=1..n) where n = 109 .   Each worker computes a block of 107 terms of the sequence and the 100 partial results are added together to get the final result  (which approximates pi2/6  to about 7 decimal places).

The document predictor is very computational intensive and uses some rather large data matrices for the trained machine learning models.   The size of these arrays are about 150 megabytes total.  While this does fit in memory, the computation is going to involve a great deal of processor cache flushing and there may be memory paging effects.   The example that computes the Euler sum requires no data other than the starting point index and the size of the block to sum.   It is pure computation and it will have no cache flushing or memory paging effects, but it will keep the CPU very busy.   The “sleep” example leaves the memory and the CPU completely idle.

We ran all three with one to seven workers.   (6 workers using 6 cores from the small Google demo account and one on another other remote machine).    To compare the results, we computed the time for each program on one worker and plotted the speed-up ratio for 2, 3, 4, 5, 6 and 7 workers.   The results are shown in the graphs below.

predictor-euler-sleep

Figure 3.  Performance as speed-up for each of the three applications with up to 7 workers.

As can be seen, the sleeper scales linearly in the number of workers.   In fact, when executed on multi-core machines it is almost super-linear because of the extra threads that can be used.  (It is very easy for a large number of threads to sleep in parallel.)   On the other hand, the predictor and the Euler examples reached a maximum speed up with around five workers.  Adding more worker pods to the servers did not show improvement because these applications are already very compute intensive.    This was a surprise as we expect all three experiments to scale well beyond seven workers.  Adding more worker pods to the servers did not show improvement because these applications are already very compute intensive.   When looking for the cause of this limited performance, we considered the possibility that the RabbitMQ broker was a bottleneck, but our previous experience with it has allowed us to scale applications to dozens of concurrent reader and writers.   We are also convinced that the Google Container Engine performed extremely well and it was not the source of any of these performance limitations. We suspect (but could not prove) that the Celery work distribution and result gathering mechanisms have overheads that limit scalability as the number of available workers grows.

Conclusion

Google has made it very easy to deploy containerized applications using Kubernetes on their cloud container service.  Kubernetes has some excellent architectural features that allow multiple containers to be co-located on a single server within a pod.   We did not have time here to demonstrate this, but their documentation gives some excellent examples.

Celery is an extremely elegant way to do remote procedure calls in Python.  One only needs to define the function and annotate it with a Celery object.   It can then be remotely invoked with an asynchronous call that returns control to the caller.  A future like object is returned.  By calling a special method on the returned object the caller will pause until the remote call completes and the value is provided to the caller.

Our experiments demonstrated that Celery has limited scalability if it is used without modification and with the RabbitMQ message broker.   However, celery has many parameters and it may be possible that the right combinations will improve our results.  We will report any improved results we discover in a later version of this document.

The State of the Cloud: Evolving to Support Deep Learning and Streaming Data Analytics and Some Research Challenges

(Note:  This is an updated version on 7/21/2016.   The change relates to containers and HPC and it is discussed in the  research topics at the end.)

I was recently invited to serve on a panel for the 2016 IEEE Cloud Conference.  As part of that panel I was asked to put together 15 minutes on the state of cloud technology and pose a few research challenges.   Several people asked me if I had published any of what I said so I decided to post my annotated notes from that mini-talk here. The slide deck that goes along with this can be found here.  There were three others on the panel who each made some excellent points and this document does not necessarily reflect their views.

Cloud computing has been with us for fifteen years now and Amazon’s Web Services have been around for ten.   The cloud was originally created to support on-line services such as email, search and e-commerce.  Those activities generated vast amounts of data and the task of turning this data into value for the user has stimulated a revolution in data analytics and machine learning.  The result of this revolution has been powerful and accurate spoken language recognition, near real-time natural language translation, image and scene recognition and the emergence of a first generation of cloud-based digital assistants and “smart” services.  I want to touch on several aspects of cloud evolution related to these exciting changes.

Cloud Architecture

Cloud architectures have been rapidly evolving to support these computational and data intensive tasks.   The cloud data centers of 2005 were built with racks of off-the-shelf server and standard networking gear, but the demands of the new workloads described above are pushing the cloud architects to consider some radically different approaches.   The first changes were the introduction of software defined networks that greatly improved bisection bandwidth.   This also allowed the data center to be rapidly reconfigured and repartitioned to support customer needs as well as higher throughput for parallel computing loads.   Amazon was the first large public cloud vendor to introduce GPUs to better support high-end computation in the cloud and the other providers have followed suit. To accelerate the web search ranking process, Microsoft introduced FPGA accelerators and an overlay mesh-like network which adds an extra dimension of parallelism to large cloud applications.

The advent of truly large scale data collections made it possible to train very deep neural networks and all of the architectural advances described above have been essential for making progress in this area.   Training deep neural nets requires vast amounts of liner algebra and highly parallel clusters with multiple GPUs per node have become critical enablers.  Azure now support on-demand clusters of nodes with multiple GPUs and dedicated InfiniBand networks. The FPGAs introduced for accelerating search in the Microsoft data centers have also proved to be great accelerators for training convolutional neural networks.   GPUs are great for training deep networks but Nirvana has designed a custom ASIC that they claim to be a better accelerator.   Even Cray is now testing the waters of deep learning.   To me, all of these advances in the architecture of cloud data centers points to a convergence with the trends in supercomputer design.  The future exascale machines that are being designed for scientific computing may have a lot in common with the future cloud data centers.   Who knows?  They may be the same.

Cloud System Software

The software architecture of the cloud has gone through a related evolution.  Along with software defined networking we are seeing the emergence of software defined storage.   We have seen dramatic diversification in the types of storage systems available for the application developer.  Storage models have evolved from simple blob stores like Amazon’s S3 to sophisticated distributed, replicated NoSQL stores designed for big data analytics such as Google’s BigTable and Amazon’s DynamoDB.

Processor virtualization has been synonymous with cloud computing.   While this is largely still true, container technology like Docker has taken on a significant role because of its advantages in terms of management and speed of deployment.  (It is worth noting that Google never used traditional virtualization in their data centers until their recent introduction of IaaS in GCloud.)   Containers are used as a foundation for microservices; a style of building large distributed cloud applications from small, independently deployable components.   Microservices provide a way to partition an application along deployment and language boundaries and they are well suited to Dev-Ops style application development.

Many of the largest applications running on the cloud by Microsoft, Amazon and Google are composed of hundreds to thousands of microservices.   The major challenges presented by these applications are management and scalability.    Data center operating systems tools have evolved to coordinate, monitor and attend to the life-cycle management of many concurrently executing applications, each of which is composed of vast swarms of containerized microservice.  One such systems is Mesos from Mesosphere.

Cloud Machine Learning Tools

The data analytics needed to create the smart services of the future depend upon a combination of statistical and machine learning tools.  Bayesian methods, random forests and others have been growing in popularity and are widely available in open source tools.  For a long time, neural networks were limited to three levels of depth because the training methods failed to show improvements for deeper networks.  But very large data collections and some interesting advances in training algorithms have made it possible to build very accurate networks with hundreds of layers.  However, the computation involved in training a deep network can be massive.   The kernels of the computation involve the dense linear algebra that GPUs are ideally suited and the type of parallelism in the emerging cloud architecture is well suited to this task.   We now have a growing list of open source machine learning toolkits that have been recently released from the cloud computing research community.   These include Amazon’s Tensorflow, AzureML, Microsoft Research Computational Network Tool Kit (CNTK),  Amazon’s Deep Scalable Sparse Tensor Network Engine (DSSTNE), and Nervana’s NEON.    Of course the academic research community has also been extremely productive in this area.  Theano is an important Python toolkit that has been built with contributions from over a dozen universities and institutes.

cloud-ml-layers

Figure 1. cloud ML tools and services stack

Not every customer of cloud-based data analytics wants to build and train ML models from scratch.   Often the use cases for commercial customers are similar, hence another layer of services has emerged based on pre-trained models.   The use cases include image and language recognition, specialized search,  and voice-driven intelligent assistants.   As illustrated in Figure 1, these new services include Cortana (and MSR project Oxford components), Google ML, Amazon Alexa Skills Kit, IBM Watson Services and (using a different style cloud stack) Sentient Aware.

Streaming Data Analytics Services

There are several “exponentials” that are driving the growth of cloud platforms and services.   These include Big Data, mobile apps, and the Internet of things.   The ability to analyze and act on data in motion is extremely important for application area including urban informatics, environmental and ecological monitoring and recovery, analysis of data from scientific experiments and web and data center log analysis.   The Cloud providers and open source research community has developed a host of new infrastructure tools that can be used to manage massive streams of data from remote sources.  These tools can be used to filter data streams, do on-line analysis and use the backend cloud machine learning services.  The tools include Spark Streaming, Amazon Kinesis, Twitter Heron, Apache Flink, Google Dataflow/Apache Beam and the Azure Event hub and data lake.   A more detailed analysis of these tools can be found here.

A Few Research Challenges

As was evident at the IEEE cloud conference, there is no shortage of excellent research going on, but as promised here are a few topics I find interesting.

  1. Cloud Data Center Architecture.  If you are interested in architecture research the Open Compute Project has a number of challenging projects that are being undertaken by groups of researchers.  They were founded by people from companies including Facebook, Intel, Google, Apple, Microsoft, Rackspace, Ericsson, Cisco, Juniper Networks and more and they have contributed open data center designs.   And it is open, so anybody can participate.
  2. Cloud & Supercomputer convergence.   As the sophistication of the cloud data centers approach that of the new and proposed supercomputers it is interesting to look at what architectural convergence might look like.  For example, which modes of cloud application design will translate to supercomputers?   Is it possible that the current microservice based approach to interactive cloud services could be of value to supercomputer centers?   Can we engineer nanosecond inter-container messaging? Can we do a decent job of massive batch scheduling on the cloud with the same parallel efficiency as current supercomputers?
    Update:  It seems that there is already some great progress on this topic.    The San Diego Supercomputer Center has just announced deployment of Singularity on two of their big machines.   Singularity is a special container platform from Gregory M. Kurtzer of LBNL.  There is a great article by Jeff Layton that gives a nice overview of Singularity.
  3. Porting Deep Learning to Supercomputers. There is currently serious interest in doing large scale data analytics on large supercomputers such as those at the national centers.  Some believe that the better algorithms will be available with these advance parallel machines.   Can we compile tensorflow/CNTK/ DSSTNE using MPI for exascale class machines?  In general, are there better ways to parallelize NN training algorithms for HPC platforms?
  4. The current open source stream analytics platforms describe above are designed to handle massive streams of events that are each relative small. However, many scientific event streams are more narrow and have event object that may be massive blobs.   What would it take to modify the open source streaming tools to be broadly applicable to these “big science” use cases.

I welcome feedback on any of the items discussed here.   Many of you know more about these topics than I, so let me know where you think I have incorrectly or overstated any point.

 

 

 

 

 

Observations About Streaming Data Analytics for Science

I recently had the pleasure of attending two excellent workshops on the topic of streaming data analytics and science.  A goal of the workshops was to understand the state of the art of “big data” streaming applications in scientific research and, if possible, identify common themes and challenges.  Called Stream2015 and Stream2016, these meetings were organized by Geoffrey Fox, Lavanya Ramakrishnan and Shantenu Jha.   The talks at the workshop were from an excellent collection of scientists from universities and the national labs and professional software engineers who are building cloud-scale streaming data tools for the Internet industry.

First it is important to understand what we mean by streaming data analytics and why it has become so important.   Most scientific data analysis involves “data at rest”: data that was generated by a physical experiment or simulation and saved in files in some storage system.   That data is then analyzed, visualized and summarized by various researchers over a period of time.   The sizes of scientific data archives are growing and the number of disciplines creating new ones is expanding.    New organizations like the Research Data Alliance have been created to help coordinate the development and sharing of scientific data collections.   However not all data is “at rest” in this sense.   Sometimes data takes the form of an unbounded stream of information.   For example, the continuous stream of live instrument data from on-line sensors or other “internet of things” (IoT) devices.  Even computer system logs can produce large continuous streams.  Other examples include data from continuously running experiments or automated observatories such as radio telescopes or the output of DNA sequencers.

In some cases, the volume and rate of generation is so large, we cannot keep the data at rest for very long.  The observed data from the Square Kilometer Array (SKA) will be so large that that it is too expensive to contemplate keeping it and therefore it must be immediately processed into a reduced stream.  An important aspect of this large scale streaming scientific data analysis is computational steering: the need for a human or smart processes to analyze the data stream for quality or relevance and then to make rapid adjustments to the source instruments or simulations. The report from the first Streams workshop describes many of these cases.  For example, autonomous vehicles processing radar data streams for oil and gas exploration or modern avionics systems that have to recognize bad data in real-time.  Data coming from superconducting tokamak experiments must be managed and analyzed in real-time to adjust the control settings, and prevent catastrophic events.

This article has two parts.   In this part we will look at the issue of streaming data in science and then present some of the lessons I gathered from the workshops.  The workshop organizers have not released their final report for Stream2016, so their conclusions may be vastly different from my own.   In the second part we take a deep dive into the cloud centric data analytics tools to try to understand the landscape of ideas and approaches that have evolved in recent years in this community.

There are many factors that determine when a particular technology is appropriate for a particular problem.  Streaming data analytics is an interesting case that illustrates how diverse challenges and requirements have led software designers to build vastly different solutions.   For example, the software built to manage the vast Twitter data streams just can’t handle the analytic problems encountered when steering high-end electron microscopy experiments.  It is worth trying to understand why this is the case.

We can divide the spectrum of streaming data scenarios into three basic categories

  1. The data streaming challenges that confront large enterprises when dealing the data from millions of users of Internet enabled devices.   These might be the “click-streams” from browsers to search engines where it is critical to understand user sentiment or where to place advertisements based on previous user queries.  The stream may be the vast logs of the behavior of systems with tens of thousands of active machines that need to be constantly monitored, scaled and serviced.  In these cases, the individual events that make up the stream are often very small records of a few bytes in length.
  2. Large scale environmental or urban sensor networks such as wide-area earthquake sensor networks or the NSF Ocean Observatories Initiative or urban sensors networks such as those proposed in Chicago’s Array of Things project.  These are very heterogeneous collection of data streams that may involve instruments with very different stream characteristics.   For example, some small sensors may generate a high rate of small message while others may generate large bunches of large Mbyte-size messages in bursts such as you would see from an UAV surfacing and uploading many records.  They may require intermediate analysis at various stages but final analysis “downstream”.   Another good example is the stream of data from a swarm of robots that must be analyzed to avoid collision (see the paper by He, Kamburugamuve and Fox which describes this real-time challenge in detail.)
  3. The streams generated by very large experimental facilities like the Large Hadron Collider, Square Kilometer Array, the Advanced Photon Source and massive supercomputer simulations.  These large scale scientific experiments can be extremely complex and involve large numbers of instruments or HPC simulations, multiple data analysis steps and a distributed set of collaborators. Most of the data analysis in these experiments are not like the pure streaming data challenges we see in items 1 and 2.   The data streams are often extremely large file object that must move through complex laboratory networks.   The orchestration of the streaming activity more accurately resembles workflow than data flow and often that workflow must allow a human in the loop.

While it is tempting to think that one solution paradigm can cover all the bases, this may not be the case.  Cases 1 has led to an explosion of creativity in the open source community and several very significant Apache projects.  These include Spark Streaming which has been derived from the Spark parallel data analysis system,  Twitter’s  Storm system which has been redesigned by Twitter as Heron, Apache Flink from the German Stratosphere project, Googles Dataflow (also see this article) which is becoming Apache Beam which will run on top of Flink and Spark.  Other university projects include Neptune and the Granules project at Colorado State.   In addition to Google Cloud dataflow other cloud providers include Amazon Kinesis,  Azure Streaming and IBM Stream Analytics.   (Are you confused yet?   In the second part of this report we will describe many of these in much greater detail.)

It turns out that many of the tools described above for case 1 also apply to case 2 under certain conditions.  The challenges arise in two areas.  If the real-time demands of the application require very low latencies such as is required for various UAV challenges, some cloud solutions can be lacking.  However, Kamburugamuve, Ekanayake, Pathirage and Fox demonstrate that Storm’s basic communication mechanisms can be vastly improved using collective communication that exploit shared memory and optimized routing to meet the demands of the robot swarm example mentioned above.  The second challenge is if the size of the individual events in the stream is large (greater than a megabyte), such as you may find in many instruments that deal with image our sound object, it may not work at all with many of the systems designed with case 1 in mind.  Algorithmic methods can be used to reduce the size so approximate methods can be used to identify events for deeper off-line analysis.  In many of these instrument streaming cases it is necessary to do more processing of the stream “near the edge”.   In other words, many small data sources can be “pre-analyzed” by processors very near the source.   For example, the Apache Quark edge-analytics tools are designed to run in very small systems such as the Raspberry Pi.

Case 3 presents the greatest departure from the emerging open source tools.  The ATLAS experiment on the Large Hadron Collider (LHC) has a large Monte Carlo simulation component and they have converted the processing of the data into a relatively fine-grained event stream, called the Atlas Event Service.  A distributed workload manager, PanDA manages a global queue of analysis tasks that can be executed on a variety of platforms including Amazon and, in a specialized form called Yoda, on HPC systems.

At the other end of the application spectrum, massively parallel simulation model running on an exascale computer can generate vast amounts of data.   Every few simulated time steps the program may generate a very large (50GB or more) data structure distributed over a thousand parallel processing elements.  You can save a few of these to a file system, but it is now preferable to create a stream of these big objects and let another analysis system consume them directly.  The state-of-the-art HPC I/O library, called ADIOS, provided a very simple, standard- looking API to the application programmer, but the back-end of ADIOS can be adapted to a variety of storage or networking layers while taking full advantage of the parallel I/O capabilities of the host system.   One such back-end is facilitated by a networking layer, EVPath  that  provides the flow and control needed to handle such a massive stream.   Another backend target for ADIOS is DataSpaces, a system for creating shared data structures between application across distributed systems.  DataSpaces accomplishes this by mapping n-dimensional array objects to one dimension by using a distributed hash table and Hilbert space filling curves.   Together these provide a variety of streaming abstractions to allow data to move from one HPC application to a variety of HPC data analysis and visualization tools as illustrated in Figure 1.

adios

Figure 1.  From “Stream Processing for Remote Collaborative Data Analysis” by Klasky, Chang, Choi, Churchill, Kurc, Parashar, Sim, Wolf and Wu.  ORNL, PPPL, Rutgers, GT, SBU, UTK, LBNL.  White paper Stream2016 workshop.  

At the Streams 2016 workshop Kerstin Kleese Van Dam makes the important observation that that the workflow systems managing the stream analytics of time-critical experiments can be complex and the success of the experiment depends upon reliable performance of the overall system. The use case she described is “In Operando catalysis experiments”.   More specifically, this involves the steering of high end electron microscopy experiments where a beam of electrons is transmitted through an ultra-thin specimen, interacting with the specimen as it passes through. These experiments can generate atomic resolution diffraction patterns, images and spectra under wide ranging environmental conditions. In-situ observations with these instruments, were physical, chemical or biological processes and phenomena are observed as they evolve.  These experiments generate from 10GB-10’s of TB (e.g. at BNL) of data per at rates ranging from 100 images/sec for basic instruments to 1600 images/sec for state of the art systems. To optimize the scientific outcome of such experiments it is essential to analyze and interpret the results as they are emerging.  It is essential that the workflow system reliably deliver optimal performance, especially in situations where time-critical decisions must be made or computing resources are limited.

The current systems in use include the Analysis in Motion framework developed by PNNL, but the challenge that is presented here is to enact the workflow in a way that yields reliable performance. The workflows are frequently composite applications built from loosely coupled parts, running on a loosely connected set of distributed and heterogeneous computational resources. Each workflow task may be designed for a different programming model and implemented in a different language, and most communicate via files sent over general purpose networks.  This research group currently has a DOE project to demonstrate “Integrated End-to-End Performance Prediction and Diagnosis for Extreme Scientific Workflows (IPPD)”.

Concluding Observations. 

The streaming data landscape is very new and evolving fast.  I have come to the conclusion that of the three application domains described above (1: Internet Data Analysis, 2: Array of Things Instruments, 3: Big Science) only 1 and 2 are starting to see convergence of lines of thought and models of computing while 3 will always be driven by very different requirements and needs.   The bleeding edge of science does not have the deep pockets of Google or Amazon when it comes to IT resources.   Their budgets are dominated by the massive experimental facilities and supercomputers and hence the solutions must be custom.  And each experimental domain is unique enough that few common tools beyond MPI exist. On the other hand, one can argue that Twitter, Google and all of the various Apache projects discussed here are also custom built for the problems they each are trying to solve.   This is a world of “bespoke” software systems.

Algorithms and Analysis

An area where there will be great opportunity for sharing is in the algorithmic techniques that will be used to analyze the data.  The Streams 2015 report observed that a variety of compelling research topics have emerged including adaptive sampling, online clustering, sketching and approximation that trade space and time complexity for accuracy.   Sketching reduces an element of the stream to a basic form that allows easy generation of approximate answers to important queries.  There are many forms of sketching. Szalay described an elegant way to do principal component analysis (PCA) is a streaming context.  This provides a way to reduce the spectral complexity of a stream of big events.  Machine Learning classifiers can and are used as part of stream analytics across application domains as diverse as tweet analysis and medical imaging.      With the growing capabilities of deep learning systems, more data, images and sounds can be analyzed and recognized in near real-time.   Skype can do near-real time natural language translation and face recognition from video streams.   Applying the same technology to sifting through streams of instrument data will lead to new tools to understand earthquakes, hurricanes and tornadoes. We anticipate a lot of great work emerging from this area.