Category Archives: Uncategorized

Modeling Natural Language with Transformers: Bert, RoBERTa and XLNet.

Chapter 10.4 of ‘Cloud Computing for Science and Engineering” described the theory and construction of Recurrent Neural Networks for natural language processing. In the three years since the book’s publication the field of language modeling has undergone a substantial revolution. Forget RNNs. Transformers are now in charge, so this report is an update of that part of the chapter.

The original RNNs were not very good at capturing the context of very long passages. A technique was developed, called ‘attention’ that was an add-on to the RNNs to help them with this problem. In a landmark paper, “Attention is all you need”, Viswani, et. al. showed that the recurrence part was not really needed. This supplement will describe the basic transformer architecture and look at three examples. The first is called BERT and it was the transformer that changed the field of natural language processing. We will briefly describe its architecture and demonstrate how to use it with an optimized version called RoBERTa. Finally we will move onto a more recent transformer called XLNet that has drawn a lot of interest.

There are dozens of blog posts and articles on-line that describe transformers and the original papers do a great job describing the mechanical details. The challenging part is understanding how the details of the network design capture the networks ability to model the probability distributions associated with natural language. What do I mean by that? Consider the sentence “He decided to walk across the lake.” A native English speaker would be troubled by that and, perhaps suggest that an error occurred and the sentence should have been to be “He decided to walk across the lane.” There is nothing grammatically wrong with the first sentence. It just does not feel right. It is beyond normal in the internal language model we use for comprehension.

We can also consider ‘fill in the blank’ test to see how we draw on reasonable expectations of how words fit into context. For example, consider these three sentences.

“Because she was a good swimmer, she decided to <mask> across the <mask>”
“He went to the farmer’s <mask> and <mask> a bunch of green <mask>.”
“Whenever <mask> go to the whiskey <mask>, <mask> have a lot of <mask>”

In sentence 1 the masked pair could have been “walked”, “road”, but context “swam”, “lake” is more natural because, in context, she is a swimmer. For sentence 2, the masked triple could be “home”, ”ate”, ”cars”, but “market”, “purchased”, “beans” feels better. We leave it to your imagination to decide what fits in sentence 3.

A masked language model is one that can input words in sentences like those above and any masked word is replaced by something that, with reasonably high probability, fits in the context of the sentence. Before launching into experiments with masked language modes let’s briefly look at the architecture of transformers as described in the original Viswani paper.

Bert and the Transformer Architecture

A transformer has two major components: an Encoder and a Decoder. The Transformer has an implicit model of language in that it has learned a probability distribution that is associated with “meaningful” sentences in the language. The encoder is a non-linear function that maps an input text object into an object in high dimension real space that is somewhat “near” very similar sentences. To do this, Transformers have a special tokenizer that can convert text into token lists, where each token is an integer index into a vocabulary list for that transformer. More specifically let s be a string, then let In_tokens = Tokenizer.encode(s) and n = length(in_tokens). The Encoder has an embedding function

Encoder.input_embedding: Z^M -> R^k

Where M is the size of the model vocabulary and k is the model-specific dimension of the embedding space. Hence the representation of the entire string is n vectors of dimension k (or vector of dimension R^k*n ). There is a now famous diagram that illustrates the architecture of the transformer.

Figure 1. The transformer model (from “Attention is all you need”, Viswani, et. al.)

The full encoder is the vertical stack on the left. So far, we have only described the Input Embeddings. Ignoring (for now) the Positional Encoding, note that our R^k*n input is fed to a stack of “N” blocks, each of which consists of a Mult-Head Attention and a Feed forward network, each of which generate a new R^k*n vector output. The Multi-Head Attention block consists of a parallel collection of scaled dot product blocks as illustrated in Figure 2.

Figure 2. Multi-Head Attention and Scaled-Dot-Product Attention (again from Viswani et. al.)

The basic Attention part is the critical component. It is a function of three tensors. Q is called the query tensor and it is of dimension nxt for some t and K is called a key tensor and it is of the same dimension. The keys are “associated” with values in the vector V which is of size nxk. The attention function is then

Note that Q*K^T is of size nxn. The softmax is applied to each row of the product (after normalizing by the square root of t), so the final result is nxn time nxk which is size nxk. The way to think of this function is that each t dimensional query vector is being dot-product compared to each key vector. If those two vectors are nearly orthogonal then the dot-product is small and the corresponding row on the result of then attention function will be small. Here is the motivational concept. Let’s suppose the queries are associated with words in a sentence and we want to see what other words in the sentence are most related. The other words are keys. So if the sentence is “John got in his car then he went to the store”, then we would expect a strong link between “John”, “his” and “he” as illustrated below.

Figure 3. Dark lines indicate where we expect the attention between words to be the strongest. (Apologies to those who feel the reference to gendered pronouns is bad.)

Now to explain the Multi-Headed Attention. We are going to replicate the scaled dot-product attention network h times where h is a divisor of k and we set t = k/h. Our embedded sentence is of dimension nxk, so to create the Q, K and V vectors we have to project these into vectors of size nxt (for Q and K) and nxk for V. These projections are the linear transformations in Figure 2. Also we now have h outputs of size nxk. We concatenate those outputs and use a final linear map to project it back to a single nxk vector.

The idea of multiple heads is to allow different heads to attend to different subspaces of attention. It also slightly reduces the computational complexity of the attention convergence computation. However, there is some question as to whether multiple heads helps that much (see “Are Sixteen Heads Really Better than One?” by Michel, Levy and Neubig.)

The final critical component of the network is the feed forward block. This is a basic two-level network with one hidden layer. Notice that this does not involve token-to-token analysis; that was the job of the attention blocks. Hence the network can process the tokens independently, so its internal structure is independent of the token stream length.

There are other components of the Transformer including the add & norm steps and the Decoder. The Decoder half has the same basic components as the Encoder, and it is the part that is critical for building language translation systems but we will not address that task here. (A good discussion of the Decoder is given in “Dissecting BERT Appendix: The Decoder”, a Medium post by Miguel Romero Calvo.)

There are several other ways to train a model like this and one is to turn it into a masked language model. To do this we wrap the basic Encoder with a “Masked Language Model Head”. To explain this, we need to get into a few details related to the experiments we will show.

The original Transformer version, BERT (Bidirectional Encoder Representations from Transformers) was developed by Google, but we will use the more optimized version called RoBERTa (from Facebook and the University of Washington), which was released together with the paper a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. (RoBERTa was trained with mixed precision ﬂoating point arithmetic on DGX-1 machines, each with 8 × 32GB Nvidia V100 GPUs interconnected by Inﬁniband.) The Roberta Masked language model is shown in Figure 4 below. The RobertaMaskedLanguage model is composed of a Language Model head on top of the base language model.

Figure 4. V_word = a vector of probabilities indexed by words in the vocabulary.

The LM head consists of a linear transformation normalized with GeLU (Gaussian Error Linear Unit) activation and then again with the a LayerNorm function ( (x – x.mean)/|x-x.mean| ) followed by a linear transformation mapping the result back into a list of vectors of probabilities each of which has length equal to the size of the vocabular. The length of the vector list is the number of tokens in the original input. If the transformer is doing its job, the vector of probabilities associated with a given work-token has a maximum at the index of that word (or a better one) in the vocabulary. (The Masked Language Header also returns cross entropy loss of the predicted values from the input string, but we will not use that here.)

Demo Time!

To illustrate the behavior of RoBERTa language model can load an instance as follows. We can use the PyTorch-Transformers by HuggingFace Team who have provided excellent implementations of many of the examples in the Transformer family.

Roberta-base has 12-layer, 768-hidden, 12-heads and 125M parameters. To use the model, one need only convert a text string to a tensor of input tokens, feed that to the model and pull out the list of prediction scores (which is returned as a tensor with shape string-length by vocabulary size). Taking the largest prediction as the likely correct word and converting that back to a token ( and removing an internally added character) yields the result. In code this is:

To illustrate this on a simple sentence, we will use one that is not grammatically correct and see what the model comes up with.

Notice two things. The model split the word speller into a root and suffix. But the model generated a new sentence that was closer to what the probability distribution said a current sentence should look like. In this case that was a change from “were” to “was”.

To illustrate its use as a masked model we can substitute words in the text with a <mask> to see how the model replaces the masks.

In this case it matched the pronoun “she” and inferred that “swim” and “lake” was a good choice.

Changing the context slightly we have

Taking our first example from the list of three above,

And finally, our third sample sentence

The source notebook for this example is in the GitHub archive. Load it and try it out yourself.

XLNet and sentence generation.

Another important point we neglected to discuss in the discussion of BERT, involves the position of tokens in the input. The Attention operation compares each work in the string with each other word “in parallel”, however the order of words in the string matters if we want to understand its meaning. Hence, it is necessary to tag each token with position information. The way this is done is to create an additional value (based on a clever use of sine waves of different frequency … the details are not important here) to encode position in the string. These values are literally added to the embedding values before being sent to the Attention function. This is a detail we do not see when we invokee the RobertaForMaskedLM model. It is handled internally. (You can see it referenced in Figure 1.)

BERT is an autoencoding (AE) language model: it is trained to recover masked tokens in its input. XLNet is a newer Transformer language model that is showing better performance than BERT or RoBERTa on many test cases. BERT derives its power to predict a token at sequence position t from the fact that it looks at both the elements larger than t and smaller than t. But BERT has one subtle weakness. Because the it is trying to predict masked tokens ‘in parallel’, that does not mean that it will predict them consistently. For example, “”he went to the <mask> <mask> and purchased lots of postage stamps.” gives BERT a hard time. It tries to replace the pair of masks with “store store”, instead of “post office”. (Replacing postage stamps with vegetables and it guesses “grocery store”)

An autoregressive model, like XLNet, operates differently. Given a string of word-tokens X = [x₁, x₂, … , x_t], an autoregressive language model will try compute a sequence of conditional probabilities to compute p(X)

The problem with this is that is that if we are going from the left to right, we miss the context for words that have important information that are to the right. BERT avoids this by doing all words in parallel. Developed by Yang, Dai, Yang, Carbonell, Sakakhutdinov and Le from CMU and Google, XLNet is referred to as a generalized autoregressive (AR) language model. Rather than do the predictions from the left (x₁) to right (x_T ), XLNet uses all permutations of the tokens to do the computation.

The following is very abstract and a bit confusing, so feel free to skip this next part and go to the fun demos that follow.

Let Z_T be the set of all possible permutations of the length-T index sequence [1, 2 ,… ,T]. Let z be a permutation in Z_T. The notation z_t refers to the t^t^h element in the permutation. Similarly, z_<trefers to the t-1 elements of the permutation that precede it. We can apply the same conditional probability sequence stated above to elements in the permutation order to compute the conditional probability

The key idea is that if we compute the probabilities for the correct word at each position in the string using every permutation of the words in the string then all contexts for the word in that position can be considered. For example the sentence “He is tall.” Looking at the permutations

He is tall –
tall He is +
is tall He +
tall is He +
is He tall –
He tall is –

If we compute the probabilities from left to right of each of these permutations we see that the permutations marked with a + have the word “tall” before “He”, so in calculating the conditional probabilities p(“He” | … ) we see the word “tall” is encountered in that computation. If the sentence had been “He is talls”, these conditional probabilities would have been lower and the result would have been lower.

The XLNet designers decided to train a network that would use this principle as a goal and optimize the network parameters (theta) to find

The result is a sequence of M query operators g that are defined by the model parameters, such that for each permutation the conditional probability can expressed in a softmax form as

That is, for position z_t in a permutation , g is a function of the elements that precede it and on the position itself (and not the value at that position.) Computing the sequence of query operators g we need to use what the authors call a two stream recurrence. One sequence is called the content stream and it is computed as

Notice that the content at the next level up depends on the position z_t as well as the value x_zt. (Notice the <= in the KV pair. While this may seem like an obscure point, it is important that g depend only on the position in the permutation and not the value there because we are trying to compute the condition probability of that value!) We now define the query operator g by

For each permutation we initialized the recurrences with

Where w is a learned parameter and, in the case of strings longer than 512 words, the previous 512 blocks.

If these recurrences look too abstract, here is how they look in a (simplified) version of the attention stack.

What this fails to show is how values are averaged over multiple different permutations.

Demo Time Again!

This demo will be to use XLNet to generate stories. The way this works is that we will start with a sentence fragment and we will add an extra token at the end and get XLNet to predict the next word. We then add that word to the end of the sentence and repeat the process.

We have cast this process as a function which takes a string and an integer to indicate the length of the string we want to make. XLNet likes to have a long string to start, so we have added a paragraph of PADDING_TEXT at the beginning.

To force the model to generate the new last word, we do a bit of black magic. We need to generate a permutation mask and a target mask. We are using the excellent HuggingFaces library of Transformer implementations. The API for their XLNetHeadModel can be found here. We first add a blank token to the end of the list of ids (this is just a 0). If the length of the input_ids is m, then the perm_mask is of dimension 1xmxm and all zeros except the last column which is all ones. The target mapping is 1x1xm is all zeros except the last element which is a one. The target mask tells us which outputs we want and, in this case, it is only the last element.

The output is a list whose first element is a tensor of dimension 1x1x3200. This is the vector of token logit values for each word in the 3200 word dictionary. To select the word, draw a sample from a multinomial distribution based on softmax of the logit vector. Because of this random draw the results are never the same twice when we run the function. Below are some samples.

The source notebook for this demo is xlnet-story-generator.ipynb in github.

Document classification

Document classification with Transformers require you to add a doc classifier head to the basic model. In the case of both Bert and XLNet the 0^th position of the last hidden state can be considered as a summary of the document as a vector of size 765 and a Tanh activation function is applied to that.

The Classifier basically consists of an additional linear layer of size 765xq where q is the number of classes as shown in figure 5 below.

Figure 5. Document classifier uses pooled output from Bert (or XLNet) as input to an additional linear layer to do get final classifier values.

In the experiments that follow we will use Thilina Rajapakse’s Simple transformers library which wraps the standard HuggingFaces library in a way that make the entire process very simple.

Demo—Classifying Scientific Abstracts

The documents we are going to classify are abstracts of papers from Cornel’s ArXIV. The collection we are going to use is small (7100 paragraphs). ArXIV is a collection of papers that are submitted by scientists and are self-classified into one or more of several dozen categories. In our case there are papers from 138 subtopics and we have grouped them into 5 broad categories: Math, Computer Science, Biology, Physics and Finance. These categories are not uniformly represented and to make things very complicated there are many papers that could be classified in several of these broad groups. This is not surprising. Science is now very multidisciplinary. It is not unusual to find a mathematics paper about computational biology that uses techniques from computer science. Or a physics paper that uses neural networks with an application to finance. We first experimented with this data collection in Nov. of 2015. We used a very early version of the Azure ML Studio. The confusion matrix for the result is shown below. As you can see, it is not very impressive.

We looked at this problem again in Nov of 2017, but this time we used document analysis with the package genism.

And once again the results were underwhelming. If want to classify these documents with a Transformer, we must first train the top layer to fit our data. These two lines are sufficient to create the model and train it.

The input to the train function is a Pandas data frame with one column the text of the abstract and the second column the classification into one of our five categories. The training on an Intel Core-7 takes about 20 minutes. Creating and training the same model with XLNet takes about 30 minutes.

We trained the model with 4500 of the abstracts and evaluated it with 2600 of the remaining abstracts. To do the evaluation you use

Result is a 2600×5 tensor where each row a vector of length 5 that is the softmax of the model predictions. Wrong predictions is a tensor that described where the model failed. This information is very interesting and it illustrates the types of “interdisciplinary” confusion that can arise. Here are three examples of failures.

A Math paper predicted to be Computer Science.

A Physics paper predicted to be Math.

Finally a Finance paper predicted to be Physics

It is the last sentence that is the clue that this is finance related. We now can use the results output to compute the confusion matrix. Here are the results for BART and XLNet.

As can be seen the XLNet results were, on average, slightly more accurate and both methods were superior to the older approaches described above.

The interesting data here is the frequency with which Math papers are labeled as Computer Science. This is largely due to the fact that the major of papers about neural networks are in the computer science category, but there are also many mathematicians that are looking at this topic.

Because we have the softmax ‘probabilities’ for the classification of each document, we can ask about the 2^nd most likely choice and compute a “best-of-2” score for each paper. In other words, an X paper is classified as X if one of the two top predictions is an X. The results are shown below. As you can see, it resolves the math-cs confusion as well as bio-physics.

The data and notebook for this demo are available in GitHub.

Final Thoughts

The experiments on document classification are simple illustrations. The Transformer scientific literature has a host of tests to which all of these methods have been subjected. The careful reader will notice that in the examples we have illustrated above, we only used the smallest of the available language models of each type. This was to allow our Notebooks to run in a relatively standard laptop. Of course XLNet is not the last word in Transformers. Some of the new models are huge. For example Microsoft had just announced Turing-NLG, with 17 billion parameters.

One of goals of Turing-NLG and many other Transformer models is to improve performance around some important tasks we did not discuss above. For example, question answering and document summarization. These are very important examples. For example, the Stanford Question Answering Dataset (SQuAD 2.0) is often cited. Another application of Transformers is uncovering the structure of language expressions. John Hewitt has done some interesting experiments along these lines in Finding Syntax with Structural Probes, In Oct of 2019 we posted an article “A ‘Chatbot’ for Scientific Research: Part 2 – AI, Knowledge Graphs and BERT” where we discussed the role Transformers are playing in only search and the role they will play in smart digital assistants. We concluded there that it was necessary to extend the analysis from sentences and paragraphs to incorporate additional information from knowledge graphs. Chen Zhao et.al. from Microsoft consider aspects of this problem in the paper Transformer-XH: Multi-Evidence Reasoning with Extra Hop Attention. We are excited by the progress that has been made in this area and we are convinced that many problems remain to be solved.

Doing Deep Learning in Parallel with PyTorch.

Leave a reply

This is a small tutorial supplement to our book ‘Cloud Computing for Science and Engineering.’

Introduction

Machine learning has become one of the most frequently discussed application of cloud computing. The eagerness of cloud vendors to provide AI services to customers is matched only by their own interest in pushing the state of the art for their own internal use. In Chapter 10 of our book we discussed several “classical” machine learning algorithms and we introduced some the methods of deep leaning using three different toolkits: MXNet, Tensorflow and CNTK. We also described several cloud AI services available on AWS and Azure. However, one topic that we did not address at all was the training of neural nets that use the parallel computing capabilities available in the cloud. In this article we will do so using another deep learning toolkit, PyTorch, that has grown to be one of the most popular frameworks.

In the simple tutorial that follows, we will first describe PyTorch in enough detail to construct a simple neural network. We will then look at three types of parallelism that can be used while training a neural net. The easiest to use is GPU parallelism based on Nvidia-style parallel accelerators. We will illustrate this with an example based on the PageRank algorithm. Next, we will consider distributed parallelism where multiple processes collaborate and synchronize around the training of a single neural network. The obvious extension of this is to use multiple processes, each with a GPU to accelerate performance. Finally, we will briefly describe the use of multiple GPUs in a single thread to pipeline the training of a network.

A Tiny Intro to PyTorch.

PyTorch 1.0, which was open sourced by Facebook in 2018, has become one of the standards for deep learning. The website is well documented with some excellent tutorials, so we will not duplicate them here. However, to make this readable, we will introduce some basic Torch ideas here and refer to the tutorials for in-depth learning. PyTorch is deeply integrated with Python. As with all deep-learning frameworks, the basic element is called a tensor. At a superficial level, a PyTorch tensor is almost identical to a Numpy array and one can convert one to the other very easily. The primary difference is the set of operators possible on a PyTorch tensor and the fact that a tensor can retain the history of operators that created it. This history can be used to create derivative and gradients that are essential for training neural networks. We will return to this property later, but first let’s look at some basic tensor properties.

In the example below we create a Numpy array of 5 rows and 3 columns with 1’s for each element and use it to create a PyTorch tensor.

PyTorch is extremely flexible. For example, we can create a tensor from a python list of values and use this tensor to create a diagonal matrix.

Operations such as matrix-matrix multiply (torch.mm()) and matrix-vector multiply (torch.mv()) are based on reasonable standard algebraic rules. We will illustrate this with our vector z and matrix M from above. Matrix-Matrix multiply requires 2-d tensors, but we can reshape our vector into a 2-d array. Once it is 2-D we can multiply it by our matrix. Or we can do a matrix vector multiply using M and the original 1-d vector z. The result of the matrix multiply is a 2D 5×1 matrix. The matrix vector multiply yields a 1-d vector.

Regular multiplication and addition are different. if you multiply a tensor of the form [a, b]*[c,d] the result is the elementwise product [ac, bd]. In the case one of the operands is a matrix the result is shown below. Let’s start with M and add 1 to each element then do a point wise multiply.

These are only a few of the PyTorch operators, but they are sufficient for describing the material that follows.

Using the GPU

If you have access to a server with a GPU, PyTorch will use the Nvidia Cuda interface. Working with the GPU is not very elegant, but it is simple and explicit. If you have a server with three GPUs, they are named “cuda:0”, “cuda:1”, ‘’cuda:2”. To do computation on the GPUs you must move all the associated data explicitly to the GPUs. To create a tensor on “cuda:1” one writes

If there is only one GPU, the name “cuda” is sufficient. To move a tensor to the GPU from the CPU memory to the GPU you write

Moving a GPU resident tensor back to the CPU memory one uses the operator .to(‘cpu’).

GPU parallelism: The PageRank algorithm

To illustrate the programming and behavior of PyTorch on a server with GPUs, we will use a simple iterative algorithm based on PageRank. It is not necessary to understand the derivation below because we will only be concerned with the performance results. But, for completeness we describe it anyway.

PageRank is an algorithm introduce in 1996 by Larry Page and Sergey Brin to rank web pages in their early version of the Google search engine. The rank can be considered the relative importance of a page. Consider a graph G of N nodes. Let link(i) by the set of nodes that point to node i. And let out(j) be the number of links out of node j. The formula for computing the rank Pr(i) for each node i in g is given by

Where the scalar d is called the damping factor and it is a number between 0 and 1. This formula is for the special case that there is at most one link from a node to another. (You can read the Wikipedia article to learn more about this equation.) We can recast this in matrix algebra terms as follows. Let G(I,j) be the adjacency matrix of G , i.e. G(i,j) = 1 if there is a link from j to i in G and zero otherwise. Let Out be a diagonal matrix with the value 1/out(j) on the j^th diagonal element when out(j) is not zero and zero otherwise. The formula above can be written in matrix form as

Where Pr is the vector of Pr(j)’s and the “dot” is matrix or vector multiply depending on the association (G*Out)*Pr or G*(Out*Pr). (As we shall see the choice can make the difference in computational cost substantial.) As you can see, Pr is closely related to an eigenvector of G*Out. In fact if d = 1, it is an eigenvector. To compute it, we can turn this into a simple iteration

Fortunately, this converges to the “principle eigenvector” solution rather rapidly. We will use this to compare GPU computation to regular CPU computing.

Our goal is to demonstrate how to do PyTorch computation on the GPU and compare the performance to running on the cpu and we will use the iterative page rank algorithm to do this. To build a sample network and adjacency graph, we will use the NetworkX python package to create a slightly modified random (binomial) Erdős-Rényi graph (see this page for details). The graph was modified so that every node has at least one outgoing node. To use this graph with PyTorch, we use the lovely DGL library from library from NYU and Shanghai by Minjie Wang, Jake Zhao, Prof. Zheng Zhang and Quan Gan.

We considered 4 different implementations of the page rank algorithm and run them on a single cpu and a single GPU. (We describe multiple CPUs and GPUs in the neural net example in the next section.) Version-0 is one that is described in the DGL documents and we will not reproduce it here because it uses many of the features DGL that are not important for our discussion. We ran it (and the others) on AWS with a p2.xlarge server which has a single NVIDIA K80 GPU. The graph we created and tested with has N= 10000 nodes. We used K=200 iterations and a value of d=0.85. In all our tests the results converged to the same result within roundoff error. (The complete source for these examples and experiments are available in gitub.)

Version-1 of the algorithm uses the fact that the DGL library of a sparse graph interoperates fully with the PyTorch library. Hence the PyTorch matrix-matrix multiply and matrix-vector multiply work when one of the arguments is a sparse matrix representation of our graph. The core of the algorithm is shown below. This function takes the DGL representation of the graph, the number K of iterations and a parameter device that is either the text “cuda” or “cpu”.

Our pv vector is created as all ones on the selected device. Lines 2 and 4 are DGL operators. In line 2 and 3 we construct the Out array as a Nx1 matrix. (To simplify things, there is no divide-by-zero here because of the modification to the graph above.) In line 4 we extract the adjacency matrix as edges. Idmp is a Nx1 matrix that is (1-DAMP)/N for each element where DAMP is the value for d. It is important to notice we are using the sparse representation of the matrix edges in the matrix-matrix multiply.

Version-2 is almost identical to Version-1 except that we convert the edges matrix to a dense form prior to executing the loop as shown below.

Notice that in version-2 above there is a pointwise product pv*Out inside the loop prior to the matrix vector multiply. Because of the associativity we described above, we can move Out outside the loop and convert it to a diagonal matrix and multiply it with gm as shown in version-3 below.

The performance of all four of these Pagerank algorithms is shown in the table below. Each algorithm was run with the parameter device set to “cpu” and “cuda”. As one can see the CPU-only time for version-0 is the second best, but it does not profit from the GPU as much as version 2 and 3. Version-1 is not competitive at all. It is most interesting to note the difference between Version-2 and Version-3. The behavior can be explained easily enough. Replacing K (200) vector point products with an NxN matrix-matrix multiply is not very smart if N=10000 and matrix multiply is an O(N³) algorithm. The surprising thing is how well the matrix multiply and matrix vector multiply are optimized on the GPU. In these cases speed-ups of 2 and 3 orders of magnitude are not uncommon. Unfortunately, the DGL sparse matrix-vector multiply in Version-1 is not as well optimized.

This example illustrates that the GPU is capable of substantial performance improvements in a matrix-vector computation in PyTorch. The source code for this example as a Jupyter notebook is in github along with the other examples from this chapter. Now we return to Neural nets.

Neural Net training with the PyTorch and the GPU

One of the advantages of frameworks like PyTorch is how easy it is to build a neural network. To illustrate this point, we will build network that takes real, scalar values for x and learns to compute an arbitrary function f(x). To do this we will train the network by selecting a sequence of values x_i in an interval [a, b] for i=0,….N and supplying it with f(x_i). The result will be a network that is very good at computing f(z) for any z in [a, b]. Of course, this is not as exciting as deep neural networks that take inputs x that represent an image and f(x) is a description of the content of the image, but the concept is exactly the same. The only difference is that when x is a 64×64 pixel color image it is not a single real number but a vector of real numbers in a space of dimension 64x64x3, which is pretty big. For our purposes, dimension = 1 is sufficient.

We will construct the network as four layers with an input layer of size 80, two middle layers of size 80 and an output layer of size 40. (80 and 40 are somewhat arbitrary here. There was no attempted to find optimal values.) These are linear layers so the connections to the previous ones are complete as illustrated in the figure bellow.

This is an extremely simple type of network that has enough layers we can say it is “deep-ish”. (A really deep network for a serious imaging problem will have around 50 layers.) The PyTorch code to specify this network is shown below.

The network is created as a subclass of torch.nn.Module and each instance contains instances of our four layers. It also contains an instance of a rectified linear operator defined by ReLU(x) = max(0, x). The method forward(x) defines how an input scalar tensor x is processed through the network. Each linear layer takes a vector of length 80 and produced an output of the same shape. However, we apply the ReLU operator to the input of all interior nodes except the input layer. ReLU is an “activation” function that decides when a “neuron” turns on (in this case when the input is positive). In more mathematical terms, this allows the network to behave like an arbitrary continuous non-linear function. Without it, the network is a completely linear transformation. (The mathematical proof if this “universality” can be seen in the early work of George Cybenko and others.)

To train the network we will provide inputs and target values and adjust the parameters that define the links. More specifically each layer has an associated matrix W and offset vector b. The forward(x) function computes the following. It is the training step determines the “best values” of the Ws and bs.

We use the standard least squares error between the values for Out and the target data to determine “best values” and we use stochastic gradient decent to do the optimization. But to do this we need to compute the gradient of Out with respect to each of the Ws and bs. As mentioned previously, torch tensors are capable of recoding the history of their creation and we can work backward to compute the derivate of values computed with them. To force a tensor to begin a chain of computations we create it with the flag requires_grad=True as in the example below.

Then to compute the gradient of out with respect to x we invoke the backward() operator as follows.

To verify this, we can get out the old derivative chain rule and compute this by hand as

and plug in the numbers to get.

You don’t want to compute the gradient of our neural network output by hand. Let Backward do it automatically.

Training the network is now easy. We first gather together values for the input and target. In our trivial test case we will use the input to be numbers from the interval [0.0, 20.0] and the function is f(x) = sqrt(x)*sin(3.14*x/5.0).

The training loop, in its simplest form, is shown below with an SGD optimizer and the mean squared error loss function. The optimizer is given the learning rate of 0.001. Like our choice of 80 for the size of the network layers, this was somewhat arbitrary, and it worked.

Running this with various parameters to make it converge, one can test the network by providing values for the input and recoding the output. Plotting the result (blue) against the target function (orange) one can see the convergence.

A more conventional approach to the training is to divide the input and target vectors into small “mini-batches” and passing each mini-batch through the loss-gradient-generation-optimization step independently. This generates substantially faster convergence rates. As we shall see, this independence also allows us to compute them in parallel and merge the results later.

Basic distributed computing with PyTorch

We have already described how PyTorch exploits GPU parallelism. If a server has multiple cores or if you have a cluster of servers available, it is also possible to use MPI-like message passing style to coordinate multiple thread of computation. One must create a master process that forks off child processes that do the work. One way to do this that works for Jupyter notebooks but not GPUs looks like the following.

The Process function creates an instance of a processes that is passed an initialization function and an identifier “rank”, the total “size” of the collection of threads and a function “run” that will be invoked once the initialization is complete. The init_process is run on each created thread and it initializes a communication backend that the spawned processes share. In the case above it is called “gloo” and it works for a single multicore server. However, one can also MPI or NCCL. If you have a cluster with Infiniband, NCCL is the one to use. We will not discuss NCCL here. See this page for more details.

If you are running on a multi-GPU system you must use a different method to launch the processes from a python program and not a Jupyter notebook. The program looks like

We will call this the mp.spawn(…) version of the launch program. It is run from the command line and not Jupyter.

In this case the run(rank, size) function must call setup(rank, size) first then then clean-up later. The Github site has three programs: distributed_learn_multicore-final.ipynb which illustrates the training of our neural net using the first method described above, distributed_lean_gpu-final.py which uses a server with 8 GPUs to do the training, and distributed_local.ipynb which illustrates the basic communication primitive. We discuss these basic communication methods next and then give some performance results of the distributed learning cases.

We illustrate the communication system with three different “run” function examples. The first one illustrates basic MPI-style message passing. This one simply passes a tensor from process 0 to 1 and then from 1 to 2 and then to 3. Each time it adds 1 to the tensor.

The output from each process is routed to the master and is what you would expect (except the order in which the lines are received are random). Running this with P=3 child process gives the result below.

PyTorch also support many standard collective communication routines. For example, to compute the sum of elements from each process we use a reduction operator. The input is a Torch data structure of identical shape on each process. The reduction overwrites the input tensor with the sum of the corresponding elements of all processes.

In this case process 0 has a scalar tensor with value 1, process 1 has a tensor with value 2 and process 2 has a tensor with value 3. The output of the reduction is

The notebook distributed-local.ipynb has complete code for these communication examples.

Parallelizing the Training of our Neural Net

In this section we will describe the results of using two different approaches to parallelize the training phase of our demo neural network. The first method will be to use a server with multiple GPUs. The second method use multiple cores on a multicore server without GPUs.

The multi-GPU method

In this case we are using an AWS p2.8xlarge server which has 8 NVIDIA K80 GPUs and 16 cores with 488 GBytes of memory. Because we are using GPUs we have to use the mp.spawn( … ) version of the launch program described above. The network is the same one we described above. We created a dataset of size 80000. The variable M is the number of processes and BS is the batch size. A function batchtodev(rank, device) delivers a list of 80000/(M*BS) unique batches of data to the process identified by the local variable rank. The run function now takes the form

The initialization function setup(…) is described with the mp.spawn(…) code described above. The way in which the GPUs are allocated to threads is a bit confusing. It is designed so that if you have D GPUs and M threads, each thread is assigned a unique list of D/M GPUs. We only use the first GPU on the in each thread.

The training loop in this version now iterates through each of the elements in the batch list for each epoch. There is function sync_init_weights( ), a conditional controlling a periodic call to sync_gradients( ) and every 200 epochs we call a special function average_mode( ). Sync_init_weights() uses a simple broadcast to copy the initial random-state network to the other threads. While it is possible to allow each thread to create its own converged model they will not converge to anything much because each thread has only one fraction of the full data set. We need a way to periodically tie the independently evolved models together. That is done every 200 epochs with the call to average_model( ) which uses a global reduce sum operation. See distributed-lean-gpu-final.py for the details of these functions.

Another approach to blending the results of multiple training threads is to average the gradients from each parallel set of mini-batches before the optimizer step. This is what the function sync_gradients( ) accomplishes. But doing this every mini-batch step adds a great deal of thread synchronization overhead. Is it really necessary to do this every step? Doing it often does improve the rate of convergence, but what happens if we do it periodically? In the table below, we run the training for our network of size N=80000 with BS = 1000 until the loss is below 0.05. This is done for doing it every mini-batch step, then for every 2, 4,10 and never mini-batch steps and finally “never” to see what happens if we only rely on average_model().

frequency	1	2	4	10	never
time	330	183	136	99	88
loss	0.039	0.05	0.04	0.025	0.03
Epochs x 1000	4	6	6	7	8
total time	1320	1098	816	693	704

This version was run on an Intel core-i7 with M = 4 threads using the no-GPU method (described in the next section). The results from run-to-run vary greatly because of the randomness in the processes, but it was clear that doing it every mini-batch step produced excellent convergences, but at high cost. It must also be noted that the penalty for the overhead depends on the size of the network (which equals the size of the gradient) and many other factors.

Running on 8 GPUs on one server.

Turning to the performance of our training code on the AWS 8-cuda core server, we ran the same configuration (N=80000, BS=1000) but rather than compare convergence rates, we ran it for 10000 epochs for 1 GPU, 2 GPUs, 4 GPUs and 8 GPUs. At 10000 epochs they were all converged, but using the same number of epochs means that the total amount of work was the same in each case. The results are below. If we compare the performance of 1 GPU to 8 GPU we realize a speed-up of 6.74.

gpus	1	2	4	8
elapsed time	934	489	249	139
cuda speed-up	1	1.91	3.75	6.74

Our test case is extremely small. Much greater performance gains are possible with greater numbers of GPUs when larger problems are used.

The no-GPU method

This version uses multi-core servers without GPUs. We ran this on an AWS c5n4xl server with 32 gig of memory with 8 real cpus (16 virtual cpus). The network is the same one we described above. We created a dataset of size 80000. The variable M is the number of processes and BS is the batch size. As before, A function batchtodev(rank, ‘cpu’) delivers a list of 80000/(M*BS) unique batches of data to the process identified by the local variable rank. The run function now takes the form

In this case we had a surprise. Using PyTorch multiprocessing and increasing the number of processes thread did not increase performance. In fact, it got worse! The table below is for 8, 4, 2 and 1 processes with the best performance for 1 process. We also give the results for the Intel Core i7 running ubuntu in a Docker container.

processes	8	4	2	1
cpu elapsed time	4801	3003	1748	1403
cpu speed-up	1	1.7	1.8	2
cuda/cpu speedup	5.1	6.1	7	10.1
corei7-docker	1282	1274	1064	1151

There is an interesting explanation for this behavior. The fact is that for PyTorch 1.3.0 and 1.3.1, the basic neural net functions (model evaluation, backward differentiation, optimization stepping) are all optimized to use all available cores. So in the case of one process thread, all 16 cores are dividing the work. In the case of 2 processes, there are 2 groups of 8 cores each working and the multithreading is pure overhead. In all cases all cores are busy and the number of process threads only adds overhead. This was verified by running the Linux “top” command.

Of course, if you run this on a cluster of multicore servers using the MCCI or MPI runtime, one can expect different behavior. There have been numerous studies of the parallel training performance. One we like is Parallel and Distributed Deep Learning , by Vishakh Hegde and Sheema Usmani.

Model Parallelism

In the examples above we have used multiple copies of a neural net running on separate threads and then merged the results together. As we have shown this works with multiple GPU on a single server, but without a GPU, we did not get performance that beat the native PyTorch built-in multi-core parallelism.

There is another approach to parallelizing the training and model evaluation computation that is in some sense, orthogonal to the method we described above. This can be very valuable if the model does not fit in a single GPU memory. One can slice the network so that the first few layers are in one GPU and the rest are in a second GPU. Then you can feed inputs into the first, transfer the data from the last layer in the first GPU to the first layer in the second GPU. Of course doing this, by itself, does not buy you any parallelism: one GPU is idle while the other is work. However, one can “pipline” batches through the model. In this way batch0 goes to GPU0 and then to GPU1. A soon as batch0 clears GPU0, push batch1 to GPU0 while GPU1 is working on batch0. You need to accumulate the set of batches at the end and them concatenate them together as described in this PyTorch best practices tutorial.

However, one can do even more with pipelining as part of parallelization. The back propagation algorithm is based on computing the gradients of the error from the last layer to the first layer it is possible to push to gradient back through our pipeline of GPUs. A group at Microsoft Research, CMU and Stanford has pushed this idea and produced a very nice paper at SOSP19. The figure below, from that paper, illustrates the idea. You distribute the layers among the GPUs (using data parallelism for very dense layers) and pipeline batches as described above. Then you also let the back propagation push back up the pipeline. After a startup, there is plenty of work to fill the available gpu slots. However, to get this to work requires a sophisticated resource scheduler and care must be taken to make sure the model is consistent and converges correct. Read their paper. It is very nice.

Figure from https://www.microsoft.com/en-us/research/blog/pipedream-a-more-effective-way-to-train-deep-neural-networks-using-pipeline-parallelism/

A “Chatbot” for Scientific Research: Part 2 – AI, Knowledge Graphs and BERT.

Leave a reply

Abstract.

In 2018 I published a blog about building a cloud-resident “Research Assistant” (RA) chatbot that would be the companion of each scientist. The RA would be responsible for managing scientific data, notes and publication drafts. It could create intelligent summaries and search for important related scientific articles. That post demonstrated a simple prototype that provided spoken English input and simple dialog responses to search for available, relevant research. But it did not address the important issues of data management and textual analysis required to make the RA real. In a short, invited “vision talk” I gave at the e-Science 2019 conference I tried to address the technology that, in 2030, we would need to solve these problems. This article does not describe an implementation. Rather it is a survey of the missing pieces I alluded to in the talk in terms of the current, related literature.

Introduction

2017 was the year of the smart on-line bot and smart speaker. These are cloud based services that used natural language interfaces for both input and output to query knowledge graphs and search the web. The smart speakers, equipped with microphones listen for trigger phrases like “Hello Siri” or “hello Google” or “Alexa” and recorded a query in English, extracted the intent and replied within a second. They could deliver weather reports, do web searches, keep your shopping list and keep track of your online shopping. The impact of this bot technology will hit scientific research when the AI software improves to the point that every scientist, graduate student and corporate executive has a personal cloud-based research assistant. Raj Reddy calls these Cognition Amplifiers and Guardian Angels. We call it a research assistant.

Resembling a smart speaker or desktop/phone app, the research assistant is responsible for the following tasks:

Cataloging research data, papers and articles associated with its owner’s projects.
The assistant will monitor the research literature looking for papers exploring the same concepts seen in the owner’s work.
Automatically sifting through open source archives like GitHub that may be of potential use in current projects.
Understanding the mathematical analysis in the notes generated by the scientist and using that understanding to check proofs and automatically propose simulators and experiment designs capable of testing hypotheses implied by the research.

Understanding the implications of these 4 properties will be the central theme of this post.

In 2017 we published a short article about how we could build a chatbot for research. In that paper we presented a short overview of chatbot software circa 2017 and demonstrated a very simple toy example meta-search engine that used spoken commands about research interests and the bot would respond with matching documents from Bing, Wikipedia and ArXiv. To illustrate this, consider the sentence “Research papers by Michael Eichmair about the gannon-lee singularity are of interest.” This required out Bot, called the Research Assistant, to understand that the main topic of the sentence was the gannon-lee singularity (an obscure reference to a paper from the 1970s that I happen to know about) and the fact that we want related papers by Michael Eichmair. The result obtained by our Bot shown in Figure 1.

Figure 1. The results (shortened) from our original 2017 Science Bot to the Eichmair question.

In 2019 the same results can now be obtained by directly inserting this sentence into Google or Bing. We suspect one reason for this is the use of vastly improved language models based on Transformers (that we will briefly describe below). Our bot is not only obsolete, we will argue in this article that it completely misses the boat on what is needed to make something truly useful. This report will not present any new research results. Instead it will try to outline the types of tasks required to make the research assistant capable of demonstrating the capabilities listed above. We will try to also give a survey of the best published work leading in these directions. (This report is an expansion of a paper that was originally written for an invited “vision” talk entitled “eScience 2050: a look Back” for the eScience 2019 conference held in San Diego, Sept. 2019.)

Knowledge Graphs

If we look at the first two items in the RA capabilities list above, we see that they go well beyond simple meta search. These tasks imply that the research assistant will need to keep an organized research archive of data, notes and papers and have the ability to extract knowledge from the literature. We will assume that the individual items the RA manages will be cloud-resident objects that are described by a searchable, heterogeneous database of metadata. One such database structure that can be used for this purpose is a Knowledge Graph (KG). KGs are graphs where the nodes are entities and the links between nodes are relations. Often these node-edge-node triples are represented using Resource Description Framework (RDF) which consist of a subject, a relationship and an object. Each element of the triple has a unique identifier. The triple also has an identifier so that it can also be subjects or objects.

Having a KG that is based on scientific ontological terms and facts that can be augmented with the content of the individual scientist would be the proper foundation for our RA. To help explain this we need to take a diversion into the existing triple store KGs to see if there is one we can build upon.

There are several dozen implementations of RDF triple stores and many are open source. In addition, there are a number of commercial products available including

Ontotext which produces GraphDB a commercial RDF knowledge graph used by commercial custormers in publishing ( BBC and Elsevier), pharmaceuticals (AstraZeneca) and libraries (Mellon funded projects for the British Museum and the US National Galery of Art)
Grakn Labs in the UK had a knowledge graph Grakn that has special versions such as BioGrakn for life science apps.
Cambridge Semantics has a product called AnzoGrapDB which has numerous customers in the pharmaceutical domain.
And, of course, Oracle has a version of its database called “Spatial and Graph” that supports very large triple stores.

If you search for Knowledge Graph on the web or in Wikipedia you will lean that the KG is the one introduced by Google in 2012 and it is simply known as “Knowledge Graph”. In fact, it is very large (over 70 billion nodes) and is consulted in a large fraction of searches. Having the KG available means that a search can quickly surface many related items by looking at nearby nodes linked to the target of the search. This is illustrated in Figure 2 for the result of a search for “differential equation” which is displayed an information panel to the right of the search results.

Figure 2. Google information panel that appears on the right side of the page. In this case the search was for “differential equation”. (This image is shortened as indicated by …).

Google’s Knowledge Graph is not as good for science topics as the example in Figure 2 suggests. In fact, it is extremely good with pop culture, but for science applications like our RA, Google’s KG often just takes information from Wikipedia. In its earliest form Google KG was based on another KG known as Freebase. In 2014 Google began the process of shutting down Freebase and moving content to a KG associated Wikipedia called Wikidata. However, the Freebase archive is still on-line had has some excellent science content.

Launched in 2012 with a grant from Allen Institute, Google and the Gordon and Betty Moore Foundation Wikidata information is used in 58.4% of all English Wikipedia articles. Items in Wikidata each have an identifier (the letter Q and a number) and each item has a brief description and a list of alias names. (For example, the item for Earth (Q2) has alternative names: Blue Planet, Terra Mater, Terra, Planet Earth, Tellus, Sol III, Gaia, The world, Globe, The Blue Gem, and more.) each item has a list of affiliated “statements” which are the “object-relation-object” triples that are the heart of the KG. Relations are predicates and are identified with a P and a number. For example, Earth is an “instance of” (P32) “inner planet” (Q3504248). Figure 3 shows an illustration of the item “Hubble Telescope”. There are currently 68 million items in Wikidata and, like Wikipedia it can be edited by anyone.

Having a KG is not very useful unless you have a way to search it. In the case of Wikidata (and other RDF KGs) the language for making queries is called SPARQL. Combined with Wikidata, SPARQL queries are a very powerful way to search the KG. To give a trivial example of what a SPARQL query look like let’s search for all the scientific instruments carried on the Hubble Telescope.

Figure 3. Wikidata object Q2513, Hubble Telescope. This is an edited version of the full entry which has dozens of property statements.

To write the query we need to know that Hubble had id wd:q2513 and that the predicate “carries scientific instrument” is wdt:P1202. The query and results are shown below. To read the query note there are two unbound variables ?Inst and ?InstLabel. The only significant part of the request is a match for tuples of the form (Hubble telescope, carries scientific instrument, ?Inst).

SELECT ?Inst ?InstLabel WHERE {
SERVICE wikibase:label { bd:serviceParam wikibase:language “[AUTO_LANGUAGE],en”. }wd:Q2513 wdt:P1202 ?Inst.
} LIMIT 100

The table below shows the output.

This example does not do justice to the power of the search capabilities. A look at the example in the Wikidata Query Service will illustrate that point.

One of the more impressive KGs for science is the Springer Nature SciGraph which has over 2 billion triples related to scientific topics. While the content contains the full Springer content, it goes well beyond that such patents and grant awards. Zhang et.al [zhang] have demonstrated the use of knowledge graphs for recommendations in the NASA Science Knowledge Graph (SKG) .

Building specialized KGs for science domains has been going on for a while. In 2009, the Hanalyzer (short for high-throughput analyzer) system uses natural language processing to automatically extract a semantic network from all PubMed papers relevant to a specific scientist.

Where’s the Data?

This brings us to the question is Wikidata a place to store experimental data sets? The usual approach to data set description is via Schema.org. However recent work by one of the Wikidata founders, Denny Vrandecic, and presented at the workshop Advanced Knowledge Technologies for Science in a FAIR World (AKTS) entitled Describing datasets in Wikidata described how this can be done when schema.org may not be sufficient. At that same workshop Daniel Garijo, Pedro Szekely described a way the extended Wikidata to support external collection in a presentation entitled WDPlus: Leveraging Wikidata to Link and Extend Tabular Data. We shall ague below that this is an important possible component of the research assistant.

The Semantic Scholar Literature Graph

There is a very different approach to the problem of storing information about research papers than Wikidata. The Allen Institute for Artificial Intelligence (AI2) has built the Semantic Scholar, a graph of the scientific literatures that has a structure that is tightly focused on research paper, their authors and the concepts in the papers that link them together. More specifically, the Semantic Scholar Literature Graph, as described by Waleed Ammar, et. al has the following node types:

Author – a person record
Paper – a paper has a title, venue, year, etc.
Entities – unique scientific concepts like “deep learning” or “natural language processing”.
Mentions – references to entities from text

The nodes are linked by edges including author-to-paper, paper-citations, and mentions which are references in the text to entities. Between mentions, edges link mention in the same sentence and between entities that are somehow related. Figure 4 illustrates the graph.

Figure 4. An approximate subgraph of the Literature Graph for a paper in Semantic Scholar.

Many of the entity node are associated with items in Wikimedia.

Another related project from AI2 is the GraphAL [GraphAL] query system for the knowledge graph. The query system can be accessed on-line. The types of queries that can be expressed are powerful. For example, finding the papers that mention certain pairs of entities, or all authors of papers that mention a certain entity. We shall return to this capability below.

Building the Research Assistant

If we consider the first of two tasks on our requirements list for the RAs functionality

Cataloging research data, papers and articles associated with its owner’s projects

we see that this challenge may be well met by having the RA possess a copy of Wikidata together with the extensions described by Denny Vrandecic discussed above. If not that then Garijo and Szekely’s WDPlus Wikidata extension may be a perfect solution.

Turning now to the second task:

The assistant will monitor the research literature looking for papers exploring the same concepts seen in the owner’s work

we see the nature of the challenge is very different, but progress has been made on this task. Xiaoyi et.al have shown it is possible to use a combination of neural networks and rule-based reasoning to identify semantic entities and even implicitly cited datasets in earth science papers.

Given a set of research notes, grant proposal or draft of research papers, we need a way the way to identify the concepts in the user’s documents and then insert them into a version of the Semantic Scholar Literature graph. To do we can use a language model to scan the documents looking for interesting literature terms. The state of the art for language parsing has made great strides over the last few years and we will look at one called Bidirectional Encoder Representation from Transformers (called BERT)

Using BERT to extract knowledge from documents

Most older language analysis models were built from deep LSTM networks (which we discussed in our book on cloud computing). These models were unidirectional in that the processed text from right to left or left to right in order to train the network. Devlin et.al published the BERT paper in 2018 and revised it in 2019. BERT is unique in several respects. First it is designed so that it can be “pre-trained” on plane text to build a special encoder. Then, for various language analysis tasks, such as question answering, paraphrasing and language inference, an additional layer is added so that the encoder plus the new layer can be tuned to address the task. (This is a type of transfer learning we have described before.) What makes this possible is the way BERT uses an encoder that captures a whole sentence at a time. The training is done by masking out a small number of words (15%) in the input and then using a loss function that measures how well the network predicts the correct masked word. Figure 5 below illustrates this. The core of the encoder is based on transformers which have been shown to be powerful ways to capture context. (See the harvardNLP Annotated Transformer for a detailed walk through of building transformers.

Figure 5. Bert training of the encoder based on masking random words for the loss function. This figure taken from “BERT – State of the Art Language Model for NLP” by Rani Horev in Lyrn.

Another good blog explaining BERT and the transformers is by Ranko Mosic. The AllenNLP group has an excellent demo using the masked language model and this is illustrated in Figure 6. This shows the result of using a sentence “Multicore processors allow multiple threads of execution to run in parallel on the various cores.” with processors, execution and parallel masked. You can note that it did a rather good job (code is a reasonable substitute for execution here.)

Figure 6. The AI2 masked language model demo using the sentence “Multicore processors allow multiple threads of execution to run in parallel on the various cores.” with processors, execution and parallel masked.

Another application of a BERT based language model is semantic role labeling. This is good for analyzing sentences and identifying a subject verb and object. For our purposes this is important. We would like to extract from the scientists document key scientific terms and the implied relations between them. With this we can query the literature graph for matches, or we can use it to extend the scientist private version of the literature graph or knowledge graph.

For example, a researcher working on papers related to programming of parallel computers may have key phrases that include, multicore programming, data parallel, multithreaded programs, synchronization, map reduce, BSP, etc. The type of triples we may discover by mining the documents may include

(map reduce, used in, data parallel programming)

(multicore processors, speedup, multithreaded execution)

(synchronization problems, encountered in, multithreaded programs)

(locking mechanisms, solve, synchronization problems)

(bulk synchronous parallel, solve, synchronization problems)

(BSP, alias, bulk synchronous parallel)

(map reduce, type of, parallel algorithm)

The first and third elements of the triples correspond to entities that are associated with mentions in the document. The verbs are potential labels for entity-entity edges in the graph.

To demonstrate the capability the AI2 implementation of sematic role labeling we downloaded the language model and used it in a Jupyter notebook. We tested it with a sentence related to general relativity:

A gravitational singularity is a place where gravity has caused an event horizon which created a black hole in space, but a naked singularity is a gravitational singularity that does not have an event horizon and, therefore naked singularities do not create a black hole.

Running this through the “predictor” function of the bert-base-srl-2019.06.17 model gives the output in Figure 7.

Figure 7. Output of “predictor” function of the bert-base-srl-2019.06.17 AI2 model. The code to download the model for python is given in the document associated with the demo. The function pullTripples is a post processor which removes annotations not essential for this illustration and formats the output.

As can be seen in the figure the model identified the key noun phrases (naked singularity, gravitational singularity, event horizon, black hole and gravity) as well as a sequence of reasonable triples. It should be possible to use the GraphAL query system to find associated entities on the literature graph. Indeed, a simple search in Semantic scholar for these terms will find dozens of related papers. From these results, one can build a personal literature graph for each of the owner’s documents with links to the discovered material.

The Really Hard Problems

The final two requirements for the research assistant pose some really hard problems.

Automatically sifting through open source archives like GitHub that may be of potential use in current projects.
Understanding the mathematical analysis in the notes generated by the scientist and using that understanding to check proofs and automatically propose simulators and experiment designs capable of testing hypotheses implied by the research.

Github already has a very good search interface that can be used to discover resources related to specific general topics. For example, searching for “multicore programming” retrieves an excellent collection of archives that address the relevant to topics of parallelism and synchronization.

The Github machine learning group (yes, every organization these days has a ML or AI group) has done some nice work on using LSTM networks to translate English language text such as “ Read csv file into Pandas dataframe”, into the corresponding Python code. This is done by building a good vector space embedding of the English statement and a trained LSTM that creates English summaries of code fragments. By associating the summaries with the original English question, they can map the question to the associated code. The Github team is also collaborating with Microsoft Research Cambridge where a team is working on Program Understanding. While all of this is still very early work it appears to be very promising.

Automatically “understanding” mathematical analysis

The fourth property in our RA list reaches way beyond current capabilities. The work from the GitHub team described above can make translating English program requirements into code very similar to natural language translation, but anything involving “understanding” is, for now, out of reach. However, there have been some interesting early attempts to bridge language and models of scientific theory. Eureka (now DataRobot) does automatic AI based time series analysis and DataRobot is also a tool for automatically building ML models given only data. Michael Schmidt and Hod Lipson consider the problem of deriving theoretical mathematical models directly from experimental data (see Figure 8).

Figure 8. From Michael Schmidt and Hod Lipson, Distilling Free-Form Natural Laws from Experimental Data. (SCIENCE VOL 324 3 APRIL 2009)

Automatic theorem checking research has been going on for years, but these systems require formal statements of the theorem to be checked and are usually designed for human-machine collaboration. If it were possible to create a system that could take a journal paper and automatically extract a formal expression of the mathematical content in a form that a checker could input, then we would be getting close to the goal.

The most impressive work on using the advanced deep learning technology to “comprehend” scientific text comes again from the AI2 team. Their system Aristo is “an intelligent system that reads, learns, and reasons about science”. Aristo recently got an “A” on the N.Y. Regents 8^th grade science exams. This exam consists of multiple-choice questions such as the following:

Which object in our solar system reflect light and is a satellite that orbits around one planet? (A) Moon, (B) Earth, (C) Mercury, (D) Sun.

Aristo works by combining a number of component solvers to bear on the problem. Information retrieval and statistics for am important layer. Pointwise mutual information is used to measure the likely hood of each Question-Answer pair against information retrieved from the text corpus. A quantitative reasoning solver is used to address questions that involved basic quantitative statements. As shown in Figure 9, a tuple inference solver builds graphs that connect tuples from a scientific knowledge base to the terms in the question and the answers. Dalvi, Tandon and Clark have constructed an excellent knowledge base of science-related triples called the Aristo Tuple KB/.

The graphs with the most promising connection to one of the answers is the winner.

Figure 9. From Clark, et.al, From ‘F’ to ‘A’ on the N.Y. Regents Science Exams: An Overview of the Aristo Project. https://allenai.org/content/docs/Aristo_Milestone.pdf Aristo Tuple Inference Solver. Key terms in the question and answer candidates are linked to triples from the knowledge base.

While the Aristo work is a real milestone, it has far to go. In particular, it does not yet have the ability to relate technical diagrams and equations in the text into its deductive (or abductive) analysis. I expect AI2 is working on this now. The bigger challenge, being able to classify documents by the content of the mathematical arguments used, is very hard when reasoning is spread over many pages. There is some interesting automatic document summarization work, but it is not up to this challenge.

Final Thoughts

This discussion is far too incomplete to warrant a “conclusions” section. The ability of the research assistant to take and idea and run with it is central to what we need. The idea may be a theory expressed in a draft technical paper or research proposal. Finding all the related publication is certainly a start, but first the RA must be able to abstract the important original ideas and not just the keywords and phrases. It may be that the key idea is a more of a metaphor for a larger truth that manifests itself in research in various disciplines. But this is probably more than any RA can grasp.

There is going to be amazing progress over the next 30 years. This is obvious when one looks at the state of computing 30 years ago. Much of what we have today was then only a dream.

This post contains a look at many active research projects, and I am sure I am missing some very important ones. Please contact me if I have mischaracterized any of them or if I have missed something really important.

References

Most of the citations to literature in this blog are linked in-line. Here are two for which I found it easier to provide explicit reference.

[grapAL] Christine Betts, Joanna Power, Waleed Ammar, GrapAL: Connecting the Dots in Scientiﬁc Literature, arXiv:1902.05170v2 [cs.DB] 19 May 2019

[zhang] Jia Zhang, Maryam Pourreza, Rahul Ramachandran, Tsengdar J. Lee, Patrick Gatlin, Manil Maskey, and Amanda Marie Weigel, “Facilitating Data-Centric Recommendation in Knowledge Graph”, in Proceedings of The 4th IEEE International Conference on Collaboration and Internet

Computing (CIC), Philadelphia, PA, USA, Oct. 18-20, 2018, pp. 207-216.

A Very Gentle Introduction to Probabilistic Programming Languages

Abstract. Probabilistic programming languages (PPLs) allow us to model the observed behavior of probabilistic systems in terms its underlying latent variables. Using these models, the PPL provides tools to make inferences concerning the latent variables that give rise to specific observed behaviors. In this short report, we look at two such programming languages: Gen, a language based on Julia from a team at MIT and PyProb which is based on Python and Torch from the Probabilistic Programming Group at the University of Oxford. These are not the only PPls nor are they the first, but they illustrate the concepts nicely and they are easy to describe. To fully understand the concepts behind these systems requires a deep mathematical exploration of Bayesian statistics and we won’t go there in this report. We will use a bit of math, but the beauty of these languages is that you can get results with a light overview of the concepts.

Introduction

In science we build theories that tell us how nature works. We then construct experiments that allow us to test our theories. Often the information we want to learn from the experiments is not directly observable from the results and we must infer it from what we measure. For example, consider the problem of inferring the masses of subatomic particles based on the results of collider experiments, or inferring the distribution of dark matter from the gravitational lensing effects on nearby galaxies, or finding share values that optimize financial portfolios subject to market risks, or unravelling complex models of gene expression that manifest as disease.

Often our theoretical models lead us to build simulation systems which generate values we can compare to the experimental observations. The simulation systems are often programs that draw possible values for unknowns, call them x, from random number generators and these simulations use these values to generate outcomes y. In other words, given values for x, our simulation is a “generative” function F which produces values y = F(x). If our experiments give us values y’, we can think of the inference task as solving the inverse problem x = F^-1(y’), i.e. finding values for the hidden variables x that give rise to the observed outcomes y’. The more mathematical way to say this to say that our simulation gives us a probability distribution of values of y given the distribution associated with the random draws for x, which we write as p(y | x ). What we are interested in is the “posterior” probability p(x | y’) which is the distribution of x given the evidence y’. In other words, we want samples for values of x that generate values close to our experimental values y’. These probabilities are related by Bayes Theorem as

Without going into more of the probability theory associated with this equation, suffice it to say that the right-hand side of this equation can be very difficult to compute when F is associated with a simulation code. To get a feel for how we can approach this problem, consider the function F defined by our program as a generative process: each time we run the program it makes a series of decisions based on random x values it draws and then generates a value for y. What we will do is methodically trace the program, logging the values of x and the resulting ys. To get a good feel for the behavior of the program, we will do this a million time.

Begin by labeling each point in the program where a random value is drawn. Suppose we now trace the flow of the program so that each time a new random value is drawn we record the program point and the value drawn. As shown in Figure 1, we define a trace of the program to be the sequence [(a₁, x₁), (a₂, x₂), …(a_n, x_n), y] of program address points and random values we encounter along the way.

Figure 1. Illustration of tracing random number draws from a simulation program. A trace is composed of a list of address, value tuples in the order they are encountered. ( If there are loops in the program we add an instance count to the tuple.)

If we can trace all the paths through the program and compute the probabilities of their traversal, we could begin to approximate the joint distribution p(x,y)=p(y|x)*p(y) but given that the x’s are drawn from continuous distributions this may be computationally infeasible. If we want to find those traces that lead to values of y near to y’, we need to use search algorithms that allow us to modify the x’s to construct the right traces. We will say a bit more about these algorithms later, but this is enough to introduce some of the programming language ideas.

To illustrate our two probabilistic programming languages, we will use an example from the book “Bayesian Methods for Hackers” by Cameron Davidson-Pilon. (There are some excellent on-line resources for the book. This includes Jupyter notebooks for each chapter that have been done with two other PPLs: PyMC3 and Tensorflow Probability.) The example comes from chapter 1. It concerns the logs of text messages from a user. More specifically, it is the number of text messages sent per day over a period of 74 days. Figure 2 shows bar graph of the daily message traffic. Looking at the data, Davidson-Pilon made a conjecture that the traffic changes in some way at some point so that the second half of the time period has a qualitative difference from the first half. Data like this is usually Poisson distributed. If so, there is an average event rate such that the probability of k events in a single time slot is given by

If there really are two separate distributions the let us say the event rate is for the first half and for the second half and a day such that for all days before that date the first rate applies and it is the second rate after that. (This is all very well explained in the Davidson-Pilon book and you should look at the solution there that uses PPL PyMC3. The solutions here are modeled on that one.)

Figure 2. From Chapter 1 of “Bayesian Methods for Hackers” by Cameron Davidson-Pilon.

Gen

Gen is a language that is built on top of Julia and Tensorflow by Marco Cusumano-Towner, Feras A Saad, Alexander K Lew and Vikash K Mansinghka at MIT and described in their recent POPL paper [1]. In addition they have a complete on-line resource where you can download the package and read tutorials.

We gave a brief introduction to Julia in a previous article, but it should not be hard to understand the following even if you have never used Julia. To cast this computation into Gen we need to build a model that captures the discussion above. Shown below we call it myModel.

The first thing you notice about this code are the special annotations @gen and @trace. This tells the Gen system that this is a generative model and that it will be compiled so that we can gather the execution traces that we discussed above. We explicitly identify the random variables we want traced by the @trace annotation. The argument to the function is a vector xs of time intervals from 1 to 74. We create it when we read the data from Figure 2 from a csv file (which is shown in detail in the full Jupyter notebook for this example). Specifically, xs = [1.0, 2.0, 3.0 …, 74.0] and we set a vector ys so that ys[i] is the number of text messages on day i.

If our model process is driven by a Poisson to generate y value, then the math says we should assume that the time interval between events is exponentially distributed. Gen does not have an exponential distribution function, but it does have a Gamma distribution and gamma(1, alpha) = exponential(1.0/alpha) . The statement

lambda1 = @trace(gamma(1, alpha), :lambda1)

tells Gen to pull lambda1 values from the exponential with mean alpha and we have initialized alpha to be the mean of the ys values (which we had previously computed to be 19.74…). Finally note we have used a special Julia labeling technique :variable-name to label this to be :lambda1. This is effectively the address in the code of the random number draw that we will use to build traces.

We draw tau from a uniform distribution (and trace and label it) and then for each x[i] <= tau we assign the variable lambda to lambda1 and for each x[i] > tau we assign lambda to lambda2. We use that value of lambda to draw a variable from the Poisson distribution and label that draw with a string “y-i”.

We can now generate full traces of our model using the Gen function simulate() and pull values from the traces with the get_choices() function as shown below.

The values for the random variable draws are from our unconstrained model, i.e. they reflect the joint probability p(x,y) and not the desired posterior probability p(x | y’) that we seek. To reach that goal we need to run our model so that we can constrain the y values to y’ and search for those traces that lead the model in that direction. For that we will use a variation of a Markov Chain Monte Carlo (MCMC) method called Metropolis-Hastings (MH). There is a great deal of on-line literature about MH so we won’t go into it here, but the basic idea is simple. We start with a trace and then make some random mods to the variable draws. If those mods improve the result, we continue. If not, we reject it and try again. This is a great oversimplification, but Gen and the other PPLs provide library functions that allow us to easily use MH (and other methods.) The code below shows the how we can invoke this to make inferences.

The inference program creates a map from the labels for the y values to the actual constraints from our data. It then generates an initial trace and iteratively applies the MH algorithm to improve the trace. It then returns the choices for our three variables from the final trace. Running the algorithm for a large number of iterations yields the result below.

This result is just one sample from the posterior probabilities. If we repeat this 100 times we can get a good look at the distribution of values. Histograms of these values are shown below in Figure 3.

Figure 3. Histograms of the tau and lambda values. While difficult to read, the values are clustered near 44, 18, 24 respectively.

If we compare these results to the Davidson-Pilon book results which used the PyMC3 (and Tensorflow Probability) PPL, we see they are almost identical with the exception of the values of tau near 70 and 5. We expect these extreme values represent traces where the original hypothesis of two separate alphas was not well supported.

There is a great deal about Gen we have not covered here including combinators which allow us to compose generative function models together. In addition, we have not used one of the important inference mechanisms called importance sampling. We shall return to that topic later.

PyProb

Tuan Anh Le, Atılım Günes Baydin, Frank Wood first published an article about PyProb in 2017 [3] and another very important paper was released in 2019 entitled “Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale” [4] which we will describe in greater detail later. PyProb is built on top of the deep learning toolkit PyTorch which was developed and released by Facebook research.

Many concepts of PyProb are very similar to Gen, but PyProb is Python based so it looks a bit different. Our generative model in this case is an instance of a Python class as shown below. The main requirement of the is that it subclass Model and have a method called forward() that describes how to generate our traces. Instead of the trace annotation used in Gen, we use PyProb sample and observe functions. The random number variables in PyProb are all Torch tensors, so to we need to apply the method numpy() to extract values. The functions Normal, Exponential and Uniform are all imported from PyProb. Other than that, our generator looks identical to the Gen example.

Also note we have used the name mu1 and mu2 instead of alpha1 and alpha2 (for no good reason.) Running the MH algorithm on this model is almost identical to doing it in Gen.

Again, this is just a sample from the posterior. You will notice that the posterior result function also tells us what percent of the traces were accepted by the MH algorithm. PyProb has its own histogram methods and the results are shown in Figure 4 below. The legend in the figure is difficult to read. It shows that the tau value is clusters near 44 with a few traces showing between 5 and 10. The mu1 values are near 17 and mu2 values are near 23. In other words, this agrees with our Gen results and the PyMC3 results in the Davidson-Pilon book.

Figure 4. Histogram of tau, mu1 and mu2 values.

Building a PyProb Inference LSTM network.

There are several additional features of PyProb that are worth describing. While several of these are also part of Gen, they seem to be better developed in PyProb. More specifically PyProb has been designed so that our generative model can be derived from an existing scientific simulation code and it has an additional inference method, called Inference Compilation, in which a deep recurrent neural network is constructed and trained so that it can give us a very good approximation of our posterior distribution. In fact the neural network is a Long Short Term Memory (LSTM) network that that is trained using traces from out model or simulation code. The training, which can take a long time, produces a “distribution” q(x | y) that approximates our desired p(x | y). More of the details are given in the paper “Inference Compilation and Universal Probabilistic Programming” by Anh le, Gunes Baydin and Wood [3]. Once trained, as sketched in Figure 5, when the network is fed our target constraints y’ and trace addresses, the network will generate the sequence of components needed to make q(x|y= y’).

Figure 5. Recurrent NNet compiled and trained from model. (see [3, 4])

Building and training the network is almost automatic. We had one problem. The compiler does not support the exponential distribution yet, so we replaced it with a normal distribution. To do create and train the RNN was one function call as shown below.

Once trained (which took a long time), using it was also easy. In this case we use the importance sampling algorithm which is described in reference [3] and elsewhere.

Figure 6 illustrates the histograms of values drawn from the posterior.

Figure 6. Using the trained network with our data. As can be seen, the variance of the results is very small compared to the MH algorithm.

The fact that the training and evaluation took so much longer with our trivial example is not important, because the scalability of importance sampling using the compiled LSTM network. In the excellent paper “Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale” [4] Güneş Baydin, et. Al. describe the use of PyProb with a very large simulation code that models LHC experiments involving the decay of the tau lepton. They used 1024 nodes of the Cori supercomputer at LBNL to train and run their IC system. To do this required using PyProb’s ability to link a PyProb model to a C++ program. Using the IC LSTM network, they were able achieve a speed-up of over 200 over a baseline MCMC algorithm. The paper describes the details of the implementation and testing.

Conclusion

The goal of this paper was to introduce the basic ideas behind Probabilistic Programming Languages by way of two relatively new PPLs, Gen and PyProb. The example we used was trivial, but it illustrated the concepts and showed how the basic ideas were expressed (in very similar terms) in both languages. Both languages are relatively new and they implementations are not yet fully mature. However, we are certain that probabilistic programming will become a standard tool of data science in the future. We have put the source Jupyter Notebooks for both examples on GitHub. Follow the installation notes for Gen and PyProb on their respective webpages and these should work fine.

https://github.com/dbgannon/probablistic-programming

The traditional way computer science is taught involves the study of algorithms, based on cold, hard logic which, when turned into software, runs in a deterministic path from input to output. The idea of running a program backward from output to the input does not make sense. You can’t “unsort” a list of number. The problem is even more complicated if our program is a scientific simulation or data science involving machine learning. In these cases, we learn to think about the results of a computation as representatives of internally generated probability distributions.

Some of the most interesting recent applications of AI to science have been the result of work on generative neural networks. These systems are trained to perfectly mimic the statistical distribution of scientific data sets. They can allow us to build “fake” human faces or perfect, but artificial spiral galaxies, or mimic the results of laboratory experiments. They can be extremely useful but, in the case of science, they tell us little about the underlying laws of nature. PPLs allow us to begin to rescue the underlying science in the generative computation.

References

Some of these are link. Two can be found on arXiv and the Gen paper can be found in the ACM archive.

Gen: A General-Purpose Probabilistic Programming System with Programmable Inference. Cusumano-Towner, M. F.; Saad, F. A.; Lew, A.; and Mansinghka, V. K. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ‘19).
https://github.com/probprog/pyprob
Tuan Anh Le, Atılım Günes Baydin, Frank Wood, Inference Compilation and Universal Probabilistic Programming, arXiv:1610.09900v2
Güneş Baydin, et al, “Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale”, arXiv:1907.03382v1
https://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/
ABC–Fun: A Probabilistic Programming Language for Biology https://link.springer.com/chapter/10.1007/978-3-642-40708-6_12
https://knect365.com/quantminds/article/7e2e4aaa-4115-42ca-8b84-4f4512e07974/deep-probabilistic-programming-for-financial-modeling

Scientific Workflow in the Cloud using Serverless Functions

Leave a reply

Introduction

Wikipedia has a pretty good definition of workflow: an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information. Doing Science involves the workflow of repeating and documenting experiments and the related data analysis. Consequently, managing workflow is critically important for the scientific endeavor. There have been dozens of projects that have built tools to simplify the process of specifying and managing workflow in science. We described some of these in our 2007 book “Workflows for e-Science” and another Wikipedia article gives another list of over a dozen scientific workflow system and this page lists over 200 systems. Many of these systems were so narrowly focused on a single scientific domain or set of applications that they have not seen broad adoption. However, there are a few standouts and some of these have demonstrated they can manage serious scientific workflow.

Pegasus from ISI is the most well-known and used framework for managing science workflow. To use it on the cloud you deploy a Condor cluster on a set of virtual machines. Once that is in place, Pegasus provides the tools you need to manage large workflow computations. Makeflow from Notre Dame University is another example. Based on a generalization of the execution model for the old Unix Make system, Makeflow uses Condor but it also has a native distributed platform called WorkQueue. Makeflow is commonly used on large HPC Supercomputers but they also claim an implementation on AWS Lambda.

Workflow in the Cloud.

Doing scientific computing in the cloud is different from the traditional scientific data centers built around access to a large supercomputer. A primary difference is that the cloud support models of computing not associated with traditional batch computing frameworks. The cloud is designed to support continuously running services. These may be simple web services or complex systems composed of hundreds of microservices. The cloud is designed to scale to your needs (and budget) on-demand. The largest commercial public clouds from Amazon, Google and Microsoft are based on far more than providing compute cycles. They offer services to build streaming applications, computer vision and speech AI services, scalable data analytics and database services, managing edge devices, robotics and now attached quantum processors. While there are tools to support batch computing (Microsoft Azure even has an attached Cray), the cloud is also an excellent host for interactive computational experimentation.

Christina Hoffa, et. al. “On the Use of Cloud Computing for Scientific Workflows” describe some early 2008 experiments using cloud technology for scientific workflow. The cloud of 2019 presents many possibilities they did not have access to. Two of these are “cloud native” microservice frameworks such as Kubernetes and serverless computing models.

Kubernetes has been used for several workflow systems. An interesting example is Reana from CERN. Reana is a research data analysis platform that runs on your desktop or on a cloud Kubernetes cluster. Reana uses several workflow languages but the one that is most frequently used is CWL, the Common Workflow Language, which is rapidly becoming an industry standard. CWL is used in a number of other cloud workflow tools including AVADOS from Veritas Genetics, a version of the popular Apache Airflow workflow tools and several other systems with implementations “in progress”. Argo is another workflow took that is designed to work with Kubernetes.

Workflow Using Serverless Computing

Serverless computing is a natural fit for workflow management. Serverless allows applications to run on demand without regard to compute resource reservation or management. Serverless computations are triggered by events. Typical among the list of event types are:

Messages arriving on Message Queues
Changes in Databases
Changes in Document Stores
Service APIs being invoked
Device sensor data sending new updates

These event types are each central to workflow automation. AWS Lambda was the first production implementation of a serverless platform, but not the last. Implementations from Microsoft, IBM and Google are now available and the open source implementation from OpenWhisk is available for OpenStack and other platforms.

Serverless computing is built around the concept of “function as a service” where the basic unit of deployment is not a VM or container, but the code of a function. When the function is deployed it is tied to a specific event category such as one of those listed above. These functions are intended to be relatively light weight (not a massive memory footprint and a short execution time). The semantics of the function execution dictate that they are stateless. This allows many instances of the same function to be responding to events at the same time without conflict. The function instances respond to an event, execute and terminate. While the function itself is stateless, it can affect the state of other object during its brief execution. It can write files, modify databases, and invoke other functions.

Workflow State Management

Most scientific workflows can be described as a directed acyclic graph where the nodes are the computational steps. An arc in the graph represents a completion of a task that signals another it may start. For example, the first task writes a file in a storage container and that triggers an event which fires the subsequent task that is waiting for the data in the file. If the graph takes the shape of a tree where one node creates events which trigger one or more other nodes, the translation to serverless is straightforward: each node of the graph can be compiled into one function. (We show an example of this case in the next section.)

One of the challenges of using serverless computing for workflow is state management. If the “in degree” of a node is greater than one, then it requires more than one event to trigger the event. Suppose there are two events that must happen before a node is triggered. If the function is stateless it cannot remember that the one of the conditions has already been met. The problem is that the graph itself has state defined by which nodes have been enabled for execution. For example, Figure 1 is a CWL-like specification of such a case. NodeC cannot run until NodeA and NodeB both complete.

Figure 1. A CWL-like specification of a three step workflow where the third step requires that both the first step and second step are complete. The first and second step can run simultaneously or in any order.

One common solution to this problem is to assume there is a persistent, stateful instance of a workflow manager that holds the graph and keeps track of its global state. However, it is possible to manage the workflow with a single stateless function. To see how this can be done notice that in the above specification each of the step nodes requires the existence of one or more input files and the computation at that node produces one or more output files. As shown in Figure 2 below, workflow stateless function listens to the changes to the file system (or a database).

Figure 2. A workflow manager/listener function responds to events in the file system created by the execution of the applications. As shown in the next session, if the app is small, it may be embedded in the manager, but otherwise it can be containerized and run elsewhere in the cloud.

Here we assume that the application invocations, which are shown as command-line calls in Figure 1, are either executed in the lambda function or by invocations to the application wrapped as a Docker container running as a microservice in Kubernetes. When an application terminates it deposits the output file to the file system which triggers an event for the workflow manager/listener function.

The workflow manager/listener must then decide which step nodes were affected by this event and then verify that all the conditions for that node are satisfied. For example, if the node requires two file, it much check that both are there before invoking the associated application. There is no persistent state in the manager/listener as it is all part of the file system. To run multiple workflow instances concurrently each event and file must have an instance number ID as part of its metadata.

A Simple Example of Workflow using AWS Lambda

In the following paragraphs we describe a very simple implementation of a document classification workflow using AWS Lambda. The workflow is a simple tree with two levels and our goal here is to demonstrate the levels of concurrency possible with a serverless approach. More specifically we demonstrate a system that looks for scientific documents in an AWS S3 bucket and classifies them by research topic. The results are stored in an Amazon DynamoDB table. The documents each consist of a list of “scientific phrases”. An example document is below.

“homology is part of algebraic topology”,
‘The theory of evolution animal species genes mutations’,
“the gannon-lee singularity tells us something about black holes”,
‘supercomputers are used to do very large simulations’,
‘clifford algebras and semigroup are not fields and rings’,
‘galaxies have millions of stars and some are quasars’,
‘deep learning is used to classify documents and images’,
‘surgery on compact manifolds of dimension 2 yields all possible embedded surfaces’

(In the experiments the documents are the titles of papers drawn from ArXiv.)

The first step of the workflow classifies each statement according to basic academic field: Physics, Math, Biology and Computer Science. (Obviously there more fields than this, but this covered most of the collection we used to train the classifiers.) Once a sentence is classified as to topic it is then passed to the second stage of the workflow where it is classified as to subcategory. For example if a sentence is determined to belong to Biology, the subcategories that are recognized include Neuro, Cell Behavior, Genomics, Evolution, Subcellular, Health-Tissues&Organs and Molecular Networks. Physics sub areas are Astro, General Relativity and Quantum Gravity, Condensed Matter, High Energy, Mathematical Physics, Quantum mechanics and educational physics. Math is very hard, so the categories are simple: algebra, topology, analysis and other. The output from the DynamoDB table for this list of statements is shown in Figure 3 below.

Figure 3. Output from the sample document in an AWS DynamoDB table. The “cmain predict” column is the output of the first workflow phase and the “cpredicted” column is the output of the second phase of the workflow.

The AWS Lambda details.

A Lambda function is a very simple program that responds to an event, does some processing and exits. There is a complete command line interface to Lambda, but AWS has a very nice portal interface to build Lambda functions in a variety of standard languages. I found the web portal far superior to the command line because it gives you great debugging feedback and easy access to you function logs that are automatically generated each time your lambda function is invoked.

The example below is a very simple Python lambda function that waits for a document to arrive in a S3 storage bucket. When a document is placed in the bucket, the “lambda_handler” function is automatically invoked. In this simple example the function does three things. It grabs the name of the new S3 object and the bucket name. It then opens and reads the document (as text). If the document is not text, the system throws an error and the evidence is logged in the AWS CloudWatch log file for this function. In the last step, it saves the result in a DynamoDB table called “blanklambda”.

To make this work you need to assign the correct permissions policies to the lambda function. In this case we need access to S3, the DynamoDB and the basic Lambda execution role which includes permission to create the execution logs.

To tell the system which bucket to monitor you most go to the S3 bucket you want to monitor and add to the property called “Events”. Follow the instructions to a reference to your new Lambda function.

In our workflow example we used 5 lambda functions: a main topic classifier, and a classifier lambda function for each of the subtopics. It is trivial to make one Lambda function create an event that will trigger another. We send each document as a string json document encoding our list of statements.

The call function is

Lam = boto3.client("lambda",region_name="us-east-1")
resp = Lam.invoke(FunctionName="bio-function", InvocationType="RequestResponse",
                                  Payload=json.dumps(statementlist))

The “bio-function” Lambda code receives the payload as a text string and convert it back to a list.

The workflow is pictured in Figure 4 below. When a document containing a list of statements lands in S3 it invokes an instance of the workflow. The main classifier lambda function invokes one instance each of the sub-classifiers and those four are all running concurrently. As illustrated in Figure 5, when a batch of 20 documents files land in s3 as many as 100 Lambda instances are running in parallel.

Figure 4. When a document list file lands in a specified S3 bucket it triggers the main lambda function which determines the general topic of each individual statement in the document list. The entire document is sent to each of the subtopic specialist lambda function that classifies the them into subtopics and places the result in the DynamoDB table.

Figure 5. As multiple documents land in S3 new instances of the workflow are run in parallel. A batch of 20 documents arriving all at once will generate 5*20 concurrent Lambda invocations.

AWS Lambda’s Greatest Limitation

Lambda functions are intended to be small and execute quickly. In fact, the default execution time limit is 3 seconds, but that is easily increased. What cannot be increased beyond a is the size of the package. The limit is 230Mbytes. To import python libraries that are not included in their basic package you must add “layers”. These layers are completely analogous to the layers in the Docker Unified File System. In the case of Python a layer is a zipped file containing the libraries you need. Unfortunately, there is only one standard library layer available for python 3 and that includes numpy and scipy. However, for our classification algorithms we need much more. A grossly simplified (and far less accurate) version of our classifier requires only sklearn, but that is still a large package.

It takes a bit of work to build a special Lambda layer. To create our own version of a Scikit learn library we turn to an excellent blog “How to create an AWS Lambda Python Layer” by Lucas. We were able to do this and have a layer that also included numpy. Unfortunately, we could not also include the model data, but we had the lambda instance dynamically load that data from S3. Because our model is simplified the load takes less than 1 second.

We tested this with 20 document files each containing 5 documents uniformly distributed over the 4 major topics. To understand the performance, we captured time stamps at the start and end of each instance of the main Lambda invocation. Another interesting feature of AWS Lambda execution is that the amount of CPU resource assigned to each instance of a function invocation is governed by the amount of memory you allocate to it. We tried our experiments with two different memory configurations: 380MB and 1GB.

Figure 6. Each horizontal line represents the execution of one Lambda workflow for one document. Lines are ordered by start time from bottom to top. The figure on the top show the results when the lambda memory configuration was set to 380MB. The figure on the bottom shows the results with 1GB of memory allocated (and hence more cpu resource).

The results are shown in Figure 6. There are two interesting points to note. First, the Lambda functions do not all start at the same time. The system seems to see that there are many events that were generated by S3 and after a brief start it launches an additional set of workflow instances. We don’t know the exact scheduling mechanism used. The other thing to notice is that the increase in memory (and CPU resource made a profound difference. While the total duration of each execution varied up to 50% between invocations the mean execution time for the large memory case was less than half that of the small memory case. In the 1GB case the system ran all 20 documents with a speed-up of 16 over the single execution case.

Having to resort to the simplified (and far less accurate) small classifier was a disappointment. An alternative, which in many ways is more appropriate for scientific workflows, is to use the Lambda functions as a the coordination and step-trigger mechanism and have the lambda functions invoke remote services to do the real computation. We ran instances of our full model as docker containers on the AWS LightSail platform and replaced the local invocations to the small classifier model with remote invocations to the full model as illustrated in Figure 7.

Figure 7. Invoking the remote classifiers as web services.

Results in this configuration were heavily dependent on the scale-out of each of the classifier service. Figure 8 illustrate the behavior.

Figure 8. Running 20 documents simultaneously pushed to S3 using remote webservice invocations for the classification step. Once again these are ordered from bottom to top by lambda start time.

As can be seen in the diagram the fastest time using a remote service for the full classifier was nearly three times faster than the small classifier running on the lambda function. However, the requests to the service caused a bottleneck which slowed down others. Our version of the service ran on two servers with a very primitive random scheduler for load balance. A better design would involve many more servers dynamically allocated to meet the demand.

Conclusion

Serverless computing is a technology that can simplify the design of some scientific workflow problems. If you can define the workflow completely in terms of the occurrence of specific triggers and the graph of the execution is a simple tree, it is easy to setup a collection of function that will be triggered by the appropriate events. If the events don’t happen, the workflow is not run and you do not have to pay for servers to host it. (The example from the previous section was designed, debugged on AWS over a few weeks and run many times. The total bill was about $5.)

There are two downsides to using serverless as part of a workflow execution engine.

The complexity of handling a workflow consisting of a complex non-tree DAG is large. A possible solution to this was described in Figure 2 and the accompanying text. The task of compiling a general CWL specification into such a workflow manager/listener is non-trivial because CWL specification can be very complex.
A topic that was only briefly discussed here is how to get a Lambda function to execute the actual application code at each step. In most workflow systems the work steps are command line invoked applications that take input files and produce output files. To invoke these from a serverless lambda function in the cloud requires that each worker application is rendered as a service that can be invoked from the lambda function. This can be done by containerizing the applications and deploying them in pods on Kubernetes or other microservice architecture as we illustrated in Figure 2. In other scenarios the worker applications may be submitted to a scheduler and run on a batch system.

While we only looked at AWS Lambda here we will next consider Azure functions. More on that later.

The source code for the lambda functions used here can be found at this GitHub repo

Note: there are various tricks to getting lambda to use larger ML libraries, but we didn’t go down that road. One good example is described in a blog by Sergei on Medium. Another look at doing ML on lambda is described by Alek Glikson in this 2018 article.

Quantum Computing in the Cloud Part 2. A look at Quantum AI Algorithms

Leave a reply

In Part 1 of this article we looked at how quantum computers are now beginning to appear as cloud hosted attached processors and we also looked at how one can go about programming them. The most important question that part 1 failed to address is why quantum computers are interesting? The most well-known uses of a quantum computer are for building interesting quantum chemistry models and for factoring large numbers to break cryptographic codes. The chemistry application was the original suggestion from Richard Feynman in 1981 for building a quantum computer. The factoring application is important, but it will only remain interesting until the cryptographers find better algorithms. Another application area that has received a lot of attention lately is applying quantum computing to machine learning (ML). This is not surprising. (Everything in AI is hot now, so quantum AI must be incandescent.)

Much of the published work on quantum ML is very theoretical and this article will try to provide pointers to some of the most interesting examples, but we will not go deep into the theory. However there are a few algorithms that run on the existing machines and we will illustrate one that has been provided by the IBM community. This algorithm builds a Support Vector Machines for binary classification and it runs on the IBM-Q system. More on that in the second half of this article.

The following is a list of papers and link that formed the source for the brief summary that follows. They are listed in chronological order. Several of these are excellent surveys with lots of technical details.

Lloyd, M. Mohseni, and P. Rebentrost, Quantum algorithms for supervised and unsupervised machine learning, https://arxiv.org/abs/1307.0411 (2013)
Lloyd, M. Mohseni, and P. Rebentrost, Quantum principal component analysis, https://arxiv.org/abs/1307.0401 (2013)
Nathan Wiebey, Ashish Kapoor and Krysta M. Svore, Quantum Algorithms for Nearest-Neighbor Methods for Supervised and Unsupervised Learning, https://arxiv.org/abs/1401.2142 2014
Maria Schuld, Ilya Sinayskiy and Francesco Petruccione, An introduction to quantum machine learning, https://arxiv.org/abs/1409.3097v1 2014
Nathan Wiebe, Ashish Kapoor, and Krysta M. Svore, Quantum Deep Learning, https://arxiv.org/abs/1412.3489v2 2015
Verdon, M. Broughton and J. Biamonte, A quantum algorithm to train neural networks using low-depth circuits, https://arxiv.org/pdf/1712.05304.pdf
Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, Seth Lloyd, Quantum Machine Learning, https://arxiv.org/pdf/1611.09347v2.pdf 2018
Seth Lloyd and Christian Weedbrook. Quantum generative adversarial learning, https://arxiv.org/pdf/1804.09139v1.pdf 2018
Gao1, Z.-Y. Zhang, L.-M. Duan, A quantum machine learning algorithm based on generative models, http://advances.sciencemag.org/content/4/12/eaat9004.full, 2018
Dawid Kopczyk “Quantum machine learning for data scientists” 2018

A Tour of Quantum ML Algorithms

The normal breakdown of machine learning algorithms is to divide it into 2 categories: supervised algorithms and unsupervised algorithms. Supervised algorithms include those where a quantity data and the “answers” are known, and we seek to build a model that will allow us to consider previously unseen data. These include simple regression, support vector machines (which we will consider in detail later), neural networks, restricted Boltzmann machines and deep learning classifiers. Unsupervised algorithms include clustering, dimensionality reduction and methods such as principle component analysis, and generative adversarial networks.

Because most of the work on quantum ML is theoretical the measure of success is based on complexity theory and the holy grail for quantum algorithms is a proof of exponential speed up of the quantum algorithm over the best sequential counterpart. In the case of the classical ML algorithms that run in polynomial time (as a function of input data size), an exponential speedup for a quantum algorithm would be one that runs in polylogarithmic time or less. But if the data input must be loaded into the quantum computer, then any sequential read of the data would require linear time and we cannot boast true exponential speed-up. One way around this is to assume the data has been preloaded into a “quantum random access memory” (qRAM). An alternative is to consider problems where the data is generated by an internal algorithm or a natural process.

A commentary in Nature by Scott Aaronson entitled “Read the fine print” (2015) points out that there are several difficulties with many quantum algorithms. These include the transformation of the data vector of dimension N into to log(N) qubits, the fact that the matrices must be well conditioned and the fact that the vector outputs may be difficult to read if they contain components that differ greatly in scale. In general, the “Quantum Supremacy” of a quantum algorithm may be theoretically true but it may be lost to fast classical algorithms in real applications if nasty constant factors dominate.

Looking at a Few Algorithms

A frequent assumption about data is that it is composed of vectors of dimension N=2ⁿ. These may be real or complex values. Let v = {v₁, v₂, … , v_N} be such a vector, then we assume we can imbed this into n qubits as

If this can be done, it has some nice properties that can be used in some important quantum subroutines in the quantum ML algorithms.

K-Means Algorithm

K-means is one of the standard classic ML unsupervised algorithms. It uses the vector-to-quantum vector map described above. If we define a second vector w in a similar way and then define

A bit of algebra will verify that

This implies we can compute Euclidian distance if we can compute |<ϴ | φ > |² . It turns out there is a simple quantum circuit, called the swap-test that can help us estimate this value.

Given this distance algorithm and another standard subroutine called Grover’s algorithm (that allows us to pick the best from alternatives) we can build a K-means algorithm. The quantum advantage of this algorithm only show up when N is large because the complexity of the classical algorithm with M data points is O(MN) and with the quantum algorithm it is O(M log(N)).

Principal Component Analysis

PCA is based on finding the eigenvectors of the covariance matrix of a set of input vectors. The quantum version of this algorithm uses an important subroutine called quantum phase estimation, which is a method to find the eigenvalues of a unitary matrix. For unitary matrices the eigenvalues are always of the form

The algorithm computes the phase ϴ.

For PCA in general, we start with a set of vectors {vⁱ , i = 1, M}. The j^th component of vⁱcan be thought of j^th feature of the vector. To simplify the discussion let’s assume that mean of this feature is 0 for all M vectors. In otherwords, set vⁱ = vⁱ – 1/M*sum(vⁱ, i=1,M). We are looking for eigenvalues and eigenvectors of the matrix C = V^t* V where V is the NxM matrix consisting the vectors {vⁱ , I = 1, M}. Another way to describe C is by the fact that it is NxN with C_{i, j}= sum( v^k_i* v^k_j , k =1,M). C is M times the covariance matrix for our set of normalized vectors. The eigenvectors of C form a new basis that diagonalizes C with the eigenvalues on the diagonal.

In the quantum algorithm we use the map of our vector data into the quantum space as above

Where we have also normalized each vector to unit length. Now we build the Hermitian operator

Expanding out the tensor product from the embedding representation we see this is precisely the quantum version of our covariance matrix. We can now turn this into a problem of finding the eigenvalues of a unitary matrix with the formula

The algorithm now can use phase estimation and other tricks to get to the eigenvectors and values. For the serious details see reference [7] and [10] above.

Neural Networks: RBMs and GANs

Paper [5] and [7] discuss approaches to quantum versions of restricted Boltzmann machines (RBMs). Unlike the standard feed-forward deep neural network used for classification tasks, RBMs are a type of generative neural network. This means that they can be used to build a model of the statistical distribution of observed data. Sampling from this distribution can give you new examples of “similar” data objects. For example, a picture of a human face that is like others but different, or a potential chemical compound for drug discovery research. The simplest form of a RBM is a two layer network with one layer which is the visible nodes (v₁, v₂, .. v_m) and the other is called the hidden layer of nodes (h₁, h₂, …, h_n). Each visible node is connected to each hidden node as shown in the figure below.

We assume that all the data are binary vectors of length m. We use each data sample to predict values for the hidden state using a sigmoid function and a set of weights w_i,j connecting visible node i to hidden node j and an offset a_i. What results is a conditional probably for h given v. Then given a sample of hidden states we project them back to the visible layer by a similar conditional probability function.

The joint probability for v and h is given by

Where the energy function is given by

Where Z is the normalizing sum. This distribution is the Boltzmann or Gibbs distribution. There are several quantum solutions. One is to us simulated annealing which can be done on a machine like the D-wave system. The other approach is to exploit a technique known as Gibbs sampling. One can use the quantum computers can draw unbiased samples from the Gibbs distribution, thereby allowing the probabilities P(v,h) to be computed by quantum sampling. It seems that this can be done with a relatively small number of qubits, but the reader should consult [5] to get the full picture.

Paper [9] considers the problem of creating quantum generative systems by starting with factor graphs and then then construct a quantum state that represents the distribution associated with the graph. They then show that this is conceptually more powerful that then factor graph approach.

RBMs are one of the original forms of neural networks. (One survey article calls them the “model T” of networks.) RBM are not used very often because other generative methods seem to work better. One such better method is the Generative Adversarial Network (GAN), in which two networks compete. We discussed GANS in a previous post. One network called the generator and its job is to generate sample data that will fool the other network, the discriminator. The discriminator’s job is to tell the real data samples from the fake. When the system converges the discriminator is correct half the time and it is wrong half of the time.

In paper [8] the authors generalize this to the quantum case by proposing three variations: both the discriminator are quantum systems, the discriminator is quantum and the generator is classical and the configuration where the discriminator is classical and the generator is quantum. They argue that for high dimensional data the quantum-quantum case may give exponential speed-up.

An Example: Support Vector Machines

To illustrate a simple quantum machine learning algorithm that runs on the IBM system we turn to a “classic” machine learning method known as Support Vector Machine (SVM). In general, an SVM can be used as a binary classifier of measurements of experiments where each experiment is represented by its features rendered as vector or point in Rⁿ. For example, these might be measurements of cell images where we want to identify which cells are cancerous or it may be astronomical measurements of stars where we seek to determine which have planets. To understand the quantum version of SVM it is necessary to review the derivation of the classical case. In the simplest cases we can find a plane of dimension R^n-1 that separates the vectors (viewed as points) into two classes with the class on one side of the plane having the property we want and the class on the other side failing. These points with their label constitute the training set for the algorithm. In the case that n = 2, the plane is a line and the separator may look like figure 1a. The hyper plan can be described in terms of a normal vector w and an offset b. To determine if a point X in Rⁿ is one side or the other of the plane, we evaluate

If f(X) > 0 then the point is on one side and if f(X) < 0 the point is on the other side.

Figure 1. a) on the left is a linear separator between the red and the blue points. b) on the right is an example without a linear separator.

The name “support vector” comes from the optimal properties of the separating hyperplane. The hyperplane is selected so that it maximized the distance from all the training examples. The distance of a point to the hyperplane is the minimum distance to the hyperplane. This minimum distance is on a vector from the plane to the point that is parallel to the normal vector of the plane. Those points whose distance is the minimum are called the support vectors. Because the plane maximizes the distance from all points, those minima must be equal. A bit of algebra will show the normal distance between the nearest points on one side of the plane and the nearest point on the other side is 2/||w||. (Hence to maximize the distance we want to minimize ||w||²/2.) Let {x_i, y_i} I = 1,N be the training points where y_i= -1 if x_i is in the negative class (making f(x_i) < 0) and y_i= 1 if x_iis in the positive class ( f(x_i) > 0) . To compute the values for w and b we first note that because f(X) is only + or -, we can scale both w and b by the same factor without changing the sign. Hence, we can assume

where equality happens only for the support vector points. To solve this problem of maximizing the minimum distance we are going to need to invoke a technique from math where we insert some new variables α_i (known as Lagrange multipliers) and then compute

As stated above we need minimize ||w||/2 so we can minimize its square. Looking at the min term, we must compute

Taking the derivative with respect to w and b we get the minimum when

Substituting these into L we see we must compute

A look at L shows that when x_iis not a support vector then the max cannot happen unless α_i = 0. Numerically solving for the remaining α_i’s we can compute b from the fact that for the K support vectors we have

The matrix

Is called the kernel matrix. The function f(X) now takes the form

The Nonlinear Case and the Kernel Trick

There are obviously many cases where a simple planar separator cannot be found. For example, Figure 1b contains two classes of points where the best separator is a circle. However, that does not mean there is no interesting combination of features that can be used to separate the classes with a plane if viewed in the right way. Let ϕ: Rⁿ -> M be a mapping of out feature space into a space M of higher dimension. The goal is to find a function that spreads the data out so that we can apply the linear SVM case. All we require of M is that it be a Hilbert space where we can compute inner products in the usual way. If we define the function K(X,Y) = (ϕ(X) . ϕ(Y)) then our function f(X) above becomes

Where the a_is and b are computed in the space M. To illustrate this idea let ϕ: R²-> R³ be given by the mapping ϕ(X) = (X[0], x[1], 3 – 0.35*||X||²) and look at the data from figure 1b, but now in 3D (figure 2). As can be seen the mapping moves the data points onto a paraboloid where the points inside the circle are in the positive half-space and the rest are below. It is now relatively easy to find a planar separator in R3.

Figure 2. Kernel mapping R² to R³ providing a planar separator for the data in Figure 1b.

When does this kernel function trick work? It turns out that it almost always does provided you can find the right mapping function. In fact the mapping function ϕ is not the important part. It is the function K(X,Y). As long as K is symmetric and positive semi-definite, then there is a function ϕ such that for every X and Y, K(X,Y) = (ϕ(X) . ϕ(Y)). But from the function f(X) above we see that we only need K and the inner product in M. As we shall see below M may be derived from quantum states.

Quantum Support Vector Machines.

We will look at the results of using a quantum SVM algorithm that run on the IBM quantum hardware. The complete mathematical details are in a paper by Vojtech Havlicek, et. al. entitled “Supervised learning with quantum enhanced feature spaces“ (https://arxiv.org/abs/1804.11326). They describe two algorithms. One of the algorithms is called a variational method and the other is a direct estimation of a quantum version of a Kernel function K(x,z). Because the later method is a tiny bit easier to explain we will follow that approach. There are two parts to this. We will work with examples with data in R².

First we construct a function φ from R² into the space of positive semideﬁnite density matrices with unit trace. We need this function to be hard to compute classically so we can preserve the “quantum advantage”. We can then create a 2 qubit function from R² as

The transform U_φ(x)is the key to embedding out training and test data into 2 qubits. The exact definition involves selecting a set of nonlinear functions ϕ_S(x): R²-> R where S ranges over the subsets of the set {1,2}. Given these functions, the unitary function U_φ(x) is defined as

Were Z_iis the Pauli Z operator on the i^th qubit.

To create our kernel function we look at

This is almost the kernel, but not quite. What we define as the K(x,z) is related to the fidelities between x and z. To get that from the transition amplitudes above we measure this R times and if R is sufficiently large it will give us a good estimate.

Once we have computed K_i,j = K(x_i, x_j) for the training set we can now use the “classical” computer to calculate the support vector coefficients a_i and offset b. To make predictions of f(Z) we now use the quantum computer to calculate K(x_i, Z) for each of the support vectors and plug that into the formula above. The exact mathematical details for deriving φ are, of course, far more complicated and the reader should look at the full paper.

This example quantum support vector machine is available on-line for you to try out on IBM’s system. The code is simple because the details of the algorithm are buried in the qiskit.aqua libraries. To run the algorithm, we create a very small sample dataset from their library “add_hoc_data” and extract training and testing files.

There are two ways to run the algorithms. One way is a 3-line invocation of the prepackaged algorithm Below is the “programmers” which shows a few more details. This shows that we are using the qasm_simulator with 2 qubits where qubit 0 is connected to qubit 1 on the simulated hardware (this reflects the actual hardware). We create an instance and train it with 1024 “shots” (R above).

We can next print the kernel matrix and the results of the test.

Using the programming method, we can directly invoke the predict method on our trained model. We can use this to show a map of the regions of the quantum support “projected” to the 2-D plane of the sample data. We do this by running the predict function on 10000 points in the plane. And plot this as a map and then add the training points.

The resulting image is below. As you can see, the dark blue areas capture the orange data points and the lighter orange areas capture the light blue data points.

It is interesting to compare this result to running this with the Scikit SVM library. Using the library is very simple. Converting the data set from the quantum algorithm to one we can give to the Scikit library as vectors X and Y, we have

The Kernel function in this case is one of the standard kernels: RBF. Plotting the projection of the support surface wit the 2-D plain we see the image below.

The match to the training data is perfect, but in this case the accuracy is only 0.6. We tried two additional test. The first was a simple linear partition along the line y = 0.6*x-0.2 with 20 points above the line and 20 below. In this case the quantum computation did not as well at first, but after several attempts with different data sets, it achieved a score of .95 and the Scipy RBF kernel also got a score of 0.95. The figure below illustrates the regions captured by both algorithms. We also used another example from their data collection consisting of breast cancer cases. The data was relatively high dimensional, so they projected it onto the 2 principal axes. In this case the quantum and the Scipy RBF both achieved an accuracy of 0.9.

After numerous experiments we found that the quantum algorithm was unstable in that there were several cases that caused it to fail spectacularly (accuracy = 0.4). However, when it worked it worked very well. The experiments above were with very tiny data sets that were selected so the results could be easily visualized. A real test would require data sets 100 to 1000 time larger.

Conclusions

Quantum computers are now appearing as cloud-based resources and when used with algorithms that exploit the quantum subsystem and classical computer working together, real breakthroughs may be possible. In the area of AI and machine learning, the current work is primarily very theoretical, and we hope that we have given the interested reader pointers to some of the recent papers. We took a deep dive into one classical algorithm, support vector machines, and illustrated it with a code that runs on the IBM-Q system. Our results were impressive in terms of accuracy, but we did not see speed-up over the classical algorithm because of the small size and dimensionality of the data sets.

The current crop of algorithms will need improvements if they are to show substantial speed-up on real world problems. In particular, the mapping from real data to quantum states will need to improve. But this will be an area where we can expect to see substantial investment and research energy over the next few years.

Quantum Computing and the Cloud

Over the last five years cloud computing systems have evolved to be the home to more than racks of servers. The need for specialized resources to for various classes of customers has driven vendors to add GPU server configurations as a relatively standard offering. The rise of Deep Learning has seen the addition of special hardware to accelerate both the training and inference phases of machine learning. This include the Google Tensor Processing hardware in the Google cloud and the FPGA arrays on Microsoft’s Azure. Quantum computing has now moved from theory to reality. Rigetti, IBM, D-Wave and Alibaba now all now have live quantum computing services in the cloud. Google and Microsoft will follow soon. In the paragraphs that follow we will dive a bit deeper into several of these services with illustrations of how they can be programmed and used. In this article we will look at two different systems: the IBM-Q quantum computer and it software stack qiskit and the Microsoft Q# quantum software platform. We could have discussed Rigetti which is similar to IBM and D-Wave but it is sufficiently different from the others to consider it elsewhere. We don’t have enough information about Alibaba’s quantum project to discuss it.

The Most Basic Math

As our emphasis here will be on showing you what quantum programs look like, we will not go into the physics and quantum theory behind it, but it helps to have some background. This will be the shallowest of introductions. (You will learn just enough to impress people at a party as long as that party is not a gathering of scientists.) If you already know this stuff or you only want to see what the code looks like skip this section completely.

There are many good books that give excellent introductions to quantum computing. (My favorite is “Quantum Computing: A Gentle Introduction” by Rieffel and Polak.) We must begin with qubits: the basic unit of quantum information. The standard misconception is that it is a probabilistic version of a binary digit (bit). In fact, it is a two-dimensional object which is described as a complex linear combination of two basis vectors |0> and |1> defined as
Using this basis, any qubit |Ψ> can be described as a linear combination (called a superposition) of these basis vectors
where α and β are complex numbers whose square norms add up to 1. Because they are complex this means the real dimension of the space is 4 but then the vector is projected to complex projective space so that two qubit representatives |Ψ> and |г> are the same qubit if there is a complex number c such that

A slightly less algebraic representation is to see the qubit |Ψ> as projected onto the sphere shown on the right where |Ψ> is defined by two angles where

Qubits are strange things that live in a world where we cannot know what the parameters α and β are unless we attempt to measure the qubit. Measurement can be thought of as projecting the qubit onto a special set of basis vectors and each device has its own set of basis vectors. Here we will assume all our measurements are with respect to the standard basis |0> and |1>. Measuring a qubit changes it and results in a classic bit 0 or 1. However, the probability that it is a 0 is || α ||² and the probability we get a 1 is || β ||². After a qubit has been measured it is projected into one of the basis vectors |0> and |1>.

Basic Qubit Operators

In addition to measurement, there are a number of operators that can transform qubits without doing too much damage to them. (As we shall see, in the real world, qubits are fragile. If you apply too many transformations, you can cause it to decohere: and it is no longer usable. But this depends upon the physical mechanism used to render the qubit.) The basic transformation on a single qubit can all be represented by 2×2 unitary matrices, but it is easy to just describe what the do to the basis vectors and extend that in the obvious way to linear combinations.

X is the “not” transform. It takes |0> to |1> and |1> to |0>. By linearity then X applied to any qubit is
H the Hadamard transform takes |0> to 1/√2(|0> + |1>) and |1> to 1/√2(|0> – |1>). This transform is used often in quantum programs to take an initial state such as |0> to a known mixed state (called superposition).

Multiple Qubits

Quantum computing gets interesting when multiple qubits interact. The way we describe the state space of two qubits is with the tensor product. For two qubits we can describe the state of the system in terms of the basis that is just the tensor product of 1-qubit basis:

So that any state of the 2-qubit system can be describe as a linear combination of the form

For three qubits the eight basis elements are |000> through |111>. The real dimension of this 3 qubit space is 2³= 8. There are deep mathematical reasons for why the tensor product is the correct formulation rather than the direct sum of vector spaces as in classical mechanics, but the result is profound. The real dimension of an n-qubit system is 2ⁿ. When n = 50 the amount of standard computer memory required to store a single qubit is 8*2⁵⁰= 16 petabytes (assuming 2 8-byte floating point values per complex coefficient). Hence there are limits to the size of a quantum systems we can simulate on a classical computer. We now have functional quantum computers with 20 qubits and we can simulate 32 qubits, but it may take 50 qubits or more to establish the more promising advantages of quantum computing.

For two qubits there is a standard operation is often used. It is the

C_not Controlled Not_.This operation is so called because it can be thought of as using the first qubit (left most) in a pair to change the value of the second.   More specifically, if the first bit of the pair is 0 then the result is the identity op:
C_not(|0x>) = |0x>.   If the first bit is one then the second bit is flipped (not-ed):
C_not(|10>) = |11>, C_not(|11>) =   |10>.

One of the most interesting things we can do with these operations is to apply the H operator to the first qubit and then C_not to the result. In tensor product terms applying an operation H to the first qubit by not the other is to apply the operator H⊗Id to the pair, where Id is the identity operator. We can now compute C_not(H⊗Id)(|0>|0>) as

The result B =1/√2(|00> + |11>) is called a Bell state and it has some remarkable properties. First it is not the product of two 1-qubit vectors. (Some easy algebra can prove this claim.) Consequently B is a qubit pair that is not the simple the co-occurrence of two independent entities. The pair is said to be entangled. Information we can derive from the first qubit can tell us about the second. If we measure the first qubit of the pair we get 0 or 1 with equal likelihood. But if it is 0 then M(B) is transformed to |00>. If we get 1 M(B) becomes |11> . If it is |00>, measuring the second bit will give 0 with 100% certainty. If it is |11>, we will get 1 for the second bit. As this is true even if the two qubits are physically separated after they have been entangled, the fact that measurement of one qubit determines the result of measuring the second leads to amusing arguments about action at a distance and quantum teleportation.

We now have enough of the math required to understand the programs that follow.

IBM-Q

The IBM system is real and deployed on the IBM cloud. The core computational components are made up of superconducting Josephson Junctions, capacitors, coupling resonators, and readout resonators. As shown in Figure 1, the induvial qubits are non-linear oscillators. Tuned Superconducting resonator channels are used for readout. Also, the qubits are tied together by additional superconducting channels.

IBM has several deployments each named after one of their research and development Labs. They include:

IBM-Q Tokyo. 20 qubits available for IBM clients
IBM-Q Melbourne 14 qubits and available for public use.
IBM-Q Tenerife 5 qubits available for public use.
IBM-Q Yorktown 5 qubits available for public use.

Figure 1. A 5-qubit IBM-Q computational unit. Source: IBM-Q website.

In addition, they have a very large simulator in the cloud, IBM-Q QASM_simulator, with 32 qubit capability. There is much more to the architecture of a complete quantum system. Two big challenges. First the qubit devices must be cryogenically cooled and howare how do you connect a system running at 15 millikelvins to a room temperature computing environment and how do you minimize noise to reduce errors. As shown in Figure 2, it takes several thermal layers and superconducting connections to make it happen.

Figure 2. System Architecture of IBM-Q quantum architecture. Source: IBM Research

Signing up to use the IBM-Q pubic systems is extremely easy. Go to the qx community page and sign-up and you can get an access code. The best way to use it is with Python Jupyter so download the most recent Anaconda distribution.

The IBM-Q qiskit software stack is how we program the system. Ali Javadi-Abhari and Jay M. Gambetta have an excellent series of articles on Medium describing it. The current state of the art in the hardware is what they call “noisy intermediate-scale quantum computers (NISQ)”. The software stack is designed to allow researchers to explore several levels of NISQ computing. There are for components.

Qiskit Terra. This is the core software platform containing all the Python APIs for describing quantum circuits and the interface to submit them to the hardware and simulators. There are many qubit operators in Tera. See this notebook for a good sample.
Qiskit Aqua. This is the high-level application layer. It contains templates for building advanced applications in areas including chemistry, optimization and AI. (We will discuss this in more depth in part 2 of this article.)
Qiskit Aer. Qiskit has several different simulators that run in the cloud or locally on your laptop.
The unitary_simulator. The standard operations on a quantum circuit are all unitary operations. The unitary_simulator computes the result of computation and displays the result as a complex unitary matrix.
The statevector_simulator allows you to initialize a mullti-qubit to an arbitray unit combination of the basis vectors and it will do the simulation in terms of the state vectors.
The qasm_simulator provides a detailed device level simulator that takes into account the fact that the hardware is noisy. We will illustrate this simulator below

A Quantum “Hello World” example.

Qiskit is a programming system based on compiling Python down to the basic assembly language of the IBM-Q hardware (or to the format needed by one of the simulators). For a simple “hello world” example we will use the simple demonstration of entanglement describe in the mathematical introduction section above. We start with two qubits in |0> state and apply the H transform to the first and then the controlled not (C_not) operation to the pair. We next measure the first qubit and then the second. If they have become properly entangle then if the first was measured at a 1 then the second will be a 1. If the first is a 0, then the second will be measured as a zero.

This code is based on the sample by Jay Gambetta and Ismael Faro. Using Jupyter we start by loading the libraries we will use and then load our account information (this was established earlier on the local machine earlier using the IBMQ.save_account(‘…. Key …’) operation. With the account information we can inquire about the backend quantum systems we are allowed to use. There are three: two are hardware and one is the cloud-based qasm-simulator.

In line [4] we ask for the least busy of machines and it is the 16 qubit system.

The next step is to define the program. We declare an array of 2 qubit registers and 2 classical 1- bit registers. As shown below, we create a circuit consisting these two resources. We apply the H operator to the first qubit and the C_notto the first and the second. Finally, we measure both.

We have not executed the circuit yet, but we can draw it. There is a standard way quantum circuits are drawn which is similar to a musical score. The registers (both quantum and classical) are drawn has horizontal lines and operators are placed on the lines in temporal order from left to right.

As you can see, the H gate is simply represented as a box. The controlled not consists of a dot on the “control” qubit and a circle on the “controlled” instance. Recall that the value of the control determines what happens to the second qubit. Of course, we can’t know the result until we do the measurement and that is represented with a little dial and a line to the output classical bit.

We next execute our circuit on the ibmq_16_melbourne hardware. We will run it 1000 times so we can get some interesting statistics.

The execute command is a type of “future” call. It returns a placeholder for the result. The actual job will go into a queue and we can monitor the status. When complete we can get the data. The data is always returned from the experiment as a Python dictionary with keys labeled 0x0 for the basis |00>, ox1 for the basis vector |01> and 0x2 for |10> and 0x3 for |11>.

As we can see from the results below, the hardware is indeed noisy. The two bits are corelated 92.1% of the time. We can also plot a histogram to see the results.

Given that we should have created the Bell state 1/√2(|00> + |11>) with our two qubit operations there should be no |01> or |10> components. But the systems are noisy, and the measurements of qubits produce results that defined by probability distributions, the outcome should not surprise us. Even the initial state of the qubit may be |0> with only 99.9% accuracy. That means it is occasionally |1>.

The IBM-Q qasm_simulator can provide a model of the execution at the device level. To use it we extract data about the device we are going to simulate. We can get very low-level details about the device and measured noise characteristics. We can also get the coupling_map that tells us how the individual qubit cells are connected on the chip.

Using this device data we can invoke the simulator. As can be seen below, we get results that are very similar to the actual experiment.

If we had used one of the other simulators we would see only the theoretically perfect results.

Microsoft Q# Quantum Programming Toolkit

Perhaps the greatest challenge when building a quantum computer is designing it so that it is stable in the presence of noise. Qubits that are too fragile will experience decoherence if they are subject to prolonged episodes of noise while they are undergoing the transformations required by a quantum algorithm. The depth of a quantum algorithm is the count of the longest path of operations in the circuit. Based on the intrinsic error characteristics of the devices and the noise there may be a limit of a few tens of thousands in the circuit dept before decoherence is likely. For some algorithms, such as those involving iteration, this limit may make it unusable. One way to solve this is by introducing error correction through massive redundancy.

Microsoft has been taking a different approach to building a qubit, one that, if successful, will yield a much more robust system without as much need for error correction. Called a topological qubit, it Is based on different physics.

Topology is the branch of mathematics that is concerned with the properties of objects that are not changed when they are perturbed or distorted. For example a torus is a 2-dimenional object that cannot be deformed into a sphere without ripping the surface of the torus. But the torus can be deformed into various other shapes, such as the surface of a coffee mug with no such tearing. Or consider points on a 1-D line as beads on a string. We cannot change their order unless we can move them from the one dimensional line to a second dimension and back to the line. This braded structure is a topological constraints that is a global property and therefor very robust. If you can make a qubit from this property it would impervious to minor noise.

In condensed matter physics the 2016 Nobel prize was awarded to David Thouless, Duncan Haldane and Michael Kosterlitz for their work understanding strange behavior of matter when restricted to thin films. Their discovery demonstrated that the behavior had to do with the topology of 2-D surfaces. A similar discovery had to do with chains of atoms on a thin (1-D) , superconducting wire. The properties of the pair of objects at the ends of the chain were tied to the whole of the chain and not subject to minor local perturbations. Microsoft uses a similar idea to construct their “topological qubits” made from spitting electrons to form “Majorana fermion quasi-particles”. Situated at the opposite end of topological insulators they are highly noise resistant. This implies that one does not need massive redundancy in the number of qubits required for error correction that is needed for many other qubit models.

Of course the above description does not tell us much about the exact nature of their process, but several interesting theory papers exist.

The Q# programming environment.

The first iteration of a quantum computing software platform from Microsoft was based on the F# functional programming language in 2014 and called LIQUi|> (see Wecker and Svore). The current version is based on C# and is nicely embedded in Visual Studio and Visual Studio Code. You can download it here for Windows 10, Mac and Linux. The installation straightforward. There is also a Python binding but we will look at the Visual Studio version here.

The Q# programming language is designed as a hybrid between quantum operations on qubits and classical procedural programming designed to operate on a digital computer that contains the quantum device as a co-processor. Q# extends C# by introducing a number of new standard types including Qubit, Result (the result of a qubit measurement), unit (indicating an operator returns no result) and several additional operators. A complete description of the operational semantics is defined on the Q# link above. For our purposes here any Java or C# program should be able to follow the code.

There are two standard libraries and a set of research libraries. The standard libraries include

Prelude: the collection of logic, libraries and runtime specific to a particular quantum computer architecture.
Cannon: The hardware independent library of primitive operator that can be used as part of quantum algorithm design.

There is also a set of excellent standard libraries that include important topics like amplitude magnification, quantum Fourier Transforms, iterative phase estimation and other topics more advanced than this article can cover. There is also a set of research libraries with a focus on applications in Chemistry and Quantum Chemistry.

Hello World in Microsoft’s Q#

We begin by creating a new C# project with the “file->new->project” menu. If Q# has been correctly installed, you can select “Q# application” from the list of C# configurations and fill in the name for the project. In our case we are using “BellTest” and the system now shows the following.

This is almost exactly what you will find in the introduction when you download the kit. The file called operations.qs contains the main part of our algorithm. We will have two operations, one for initializing a quantum variable and one for the bulk of our algorithm.

The operation Set takes a qubit and a desired value (0 or 1) and sets the qubit to have that value. It does this as follows. First we measure the qubit. Recall that measurement (in the standard basis) returns a 0 or a 1 and (in the standard basis) projects the qubit to either |0> or |1>. If measures 0 and a |1> is desired, we use the Not gate (X) to flip it. If a 1 is measured and 1 is desired no change is made, etc.

The operation BellTest takes an iteration count and an initial value we will use for qubit 1. The code is essentially identical to the example we described for qiskit. We start with an array of 2 qubits. We set qubit 0 to |0> and qubit 1 to |initial> . Next we apply H to qubit 0 and Cnot to the pair as before. We measure qubit 0 and compare it to the measurement of qubit 1. If they agree we increment a counter. If the measurement is 1 we also count that. This process is repeated count time and returns our counter values.

The main program calls BellTest two different ways: once with the initial value for qubit 1 to be 0 and one with the value of qubit 1 to be 1. In the case that both qubits are initialized to 0 we saw from the mathematical introduction that the state after the Cnot should be the Bell state 1/√2(|00> + |11>). Consequently, after the measurements both qubits should always agree: if one is 0 the other is 0 and if one is 1 the other is 1. However if qubit 1 is initialized to 1 the situation is different. Evaluating the state mathematically we get

Hence, we should see that the two qubits never agree when measured. The main program that drives this experiment is

which runs BellTest 1000 times with each configuration of qubit 1. The results are below.

These results agree with the theoretical result. The only probabilistic effect is the count in the number of zeros/ones measured. Unlike the Qasm_simulator, there was no noise introduced into the initialization because the Q# simulator was not modeling a specific hardware configuration.

Conclusion

The near-term future of quantum computers will be as co-processors tightly integrated as cloud services. While IBM is now ready to start selling there systems into private clouds, others like Google and Microsoft will probably stay with a cloud service offering.

This short paper was intended to illustrate two approaches to programming quantum computers. It is certainly not sufficient to begin any serious quantum algorithm development. Fortunately, there is a ton of great tutorial material out there.

There is a great deal of exciting research that remains to be done. Here are a few topics.

Quantum Compiler Optimzation. Given the problem of qubit decoherence over time, it is essential that quantum algorithms terminate in as few step as possible. This is classic compiler optimization.
Efficient error correction. If you have a great quantum algorithm that can solve an important problem with 100 qubits, but the error correction requires 1000 qubits, the algorithm may not be runnable on near term machines.
Breakthrough algorithmic demonstrations. “Quantum supremacy” refers to concrete demonstrations of significant problem solutions on a quantum computer that cannot be duplicated on a classical machine. In some cases this is argued to be algorithms that are “exponentially” faster than the best classical algorithm. However, good quantum algorithms may lead to the discovery of classical algorithms that are quantum inspired. For example, here is one for recommendation systems.

Part 2 of this report will address the application of quantum computing to AI and machine learning. This is both a controversial and fascinating topic.

Julia Distributed Computing in the Cloud

Leave a reply

Abstract.

This brief note is intended to illustrate why the programming language Julia is so interesting to a growing number of computational and data scientists. Julia is designed to deliver high performance on modern hardware while retaining the interactive capabilities that make it well suited for Jupyter-style scientific exploration. This paper illustrates, with some very unscientific examples, how Julia can be deployed and with Docker and Kubernetes in the cloud.

Introduction

In all our previous posts we have used Python to build applications that interact with cloud services. We used Python because everybody knows it. However, as many scientists have now discovered, for scientific applications the programming language Julia is a better alternative. This note is not a Julia tutorial, but rather, it is intended to illustrate why Julia is so interesting. Julia was launched in 2012 by Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman and is now gaining a growing collection of users in the science community. There is a great deal of Julia online material and a few books (and some of it is up to date). In a previous blog post we looked at Python Dask for distributed computing in the cloud. In this article we focus on how Julia can be used for those same parallel and distributed computing tasks in the Cloud. Before we get to the cloud some motivational background is in order.

1. Julia is Fast!

There is no reason a language for scientific computing must be as dull as Fortran or C. Languages with lots of features like Java, Python and Scala are a pleasure to use but they are slow. Julia is dynamic (meaning it can run in a read-eval-print-loop in Jupyter and other interactive environments), it has a type system that support parametric polymorphism and multiple dispatch. It is also garbage collected and extremely extensible. It has a powerful package system, macro facilities and a growing collection of libraries. It can call C and Python directly if it is needed.

And it generates fast code. Julia uses a just-in-time compiler that optimized your program depending on how you use it. To illustrate this point, consider a simple function of the form

function addem(x)
   # do some simple math with x
   x += x*x
return x
end

Because we have not specified the exact type of x, this defines a generic function: it will work with arguments of any type that has meaning for the math operations used. But when we invoke it with a specific type, such as int or float, a version of the function is compiled on the fly that is specialized to that type. This specialization takes a bit of time, but when we invoke the function a second time with that argument type, we used the specialized version. As illustrated in Figure 1, we called the function twice with integer arguments. The second call is substantially faster than the first. Calling it with a floating point argument is slow until the specialized version for floating point variables is created, then the function runs fast.

Figure 1. Using a timing macro to measure the speed of function calls. In step 23 the generic version is called with an integer. The second call uses the version optimized for integers. In steps 25 an 26 the specialized version for floating point numbers is generated and run.

The Julia team has put together a benchmark set of small examples written in several languages. We extracted the benchmarks for C, Python and Julia and ran them. The resulting execution time are shown below. As you can see, the Julia code generation is arguably as good as C (compiled with optimized gcc). Relative to Python it is as much as 50 times faster. Figure 2 makes this more explicit. The only benchmark where Python is within a factor of two is matrix multiply and that is because Python is using the optimized numpy.linalg libraries.

Benchmark	python	gcc – O	julia
recursion_fibonacci	2.977	0.000	0.037
parse_integers	2.006	0.117	0.197
userfunc_mandelbrot	6.294	0.068	0.065
recursion_quicksort	12.284	0.363	0.364
iteration_pi_sum	558.068	22.604	22.802
matrix_statistics	82.952	11.200	10.431
matrix_multiply	70.496	41.561	41.322
print_to_file	72.481	17.466	8.100

Figure 2. Speed-up of Julia and C relative to Python

2. Julia Math and Libraries

Doing basic linear algebra and matrix computation in Julia is trivial. The following operations each take less than one second in Jupyter using the package LinearAlgebra.jl.

M = rand(1000, 1000); # generates a 1000 by 1000 matrix of random floats
N = M^-1 # compute the inverse of M
C = M*M’ # M’ is the transpose so the product is symmetric.
Q = svd(C) # computes the singular value decomposition of C.
Q.S # are the singular values.

Sparse matrices are also supported along with a large number of other matrix creation and manipulation operations.

Differential Equations

The differential equation package built by a team led by Christopher Rackauckas is one of the most impressive. To illustrate this we consider an example from their excellent tutorial. The Lorenz attractor is a fascinating example of a chaotic solution to a system of differential equations that model convection. The system involves the evolution in 3D of a system that is characterized by three parameters. As shown below the equations are expressed exactly as you would describe them mathematically (dx is dx/dt) and the three parameters ar sigma, rho and beta. The initial point is (1,0,0) and the region is integrated over [0,100]. The package automatically picks an appropriate solver and the output is plotted as shown in Figure 3. Running this on a Mac mini took about 5 seconds. We used another package “Plots” to render the image.

g = @ode_def LorenzExample begin
dx = σ*(y-x)
dy = x*(ρ-z) – y
dz = x*y – β*z
end σ ρ β

u0 = [1.0;0.0;0.0]
tspan = (0.0,100.0)
p = [10.0,28.0,8/3]
prob = ODEProblem(g,u0,tspan,p)
sol = solve(prob)
plot(sol)

Figure 3. Plot of the Lorenz attractor solution. (yes, Greek Unicode character names are valid.)

Note: Thomas Breloff provides a more explicit integration with a Gif animation in this Plots tutorial. As with the above example, Breloff’s solution works well in Jupyter.

Julia Distributed Computing in the Cloud

Julia has several packages that support parallel computing. These include a library for CUDA programming, OpenMP, Spark and MPI.jl for MPI programming. MPI is clearly the most reliably scalable parallel computing model for tasks involving 10000 or more cores. It is based on low-latency, two-sided communication in which coordinated, synchronized send-receive pairs and collective operations are executed in parallel across large clusters equipped with specialized networking. While Julia can be used with MPI, the natural model of communication in Julia is based on one-sided communication based on threads, tasks, futures and channels. We will use the package Distributed.jl to demonstrate this.

The Julia distributed computing model is based on distributing tasks and arrays to a pool of workers. Worker are either Julia processes running on the current host or on other hosts in your cluster. (In the paragraphs that follow we show how to launch local workers, then workers on other VMs in the cloud and finally workers as docker containers running in a Kubernetes cluster.

Let’s begin with a trivial example. Create a function which will flip a coin “n” times and count the number of heads.

Here we generated random Boolean values and converted them to 0 or 1 and added them up. Now let’s add some workers.

First we include the “Distributed” package and count the current number of workers. Next using the “addproc( )” function we added two new worker processes on the same server running this Jupyter notebook. Workers that our notebook process knows about are identified with an integer and we can list them. (The notebook process, called here the master, is not a worker)

The worker processes are running Julia but they don’t know about the things we have defined in the master process. Julia has a very impressive macro facility that is used extensively in the distributed computing package to distributed code objects to the workers and to launch remote computations.

When we create a function in the master that we want executed in the workers we must make sure it also gets defined in that worker. We use the “@everywhere” macro to make sure things we define locally are also defined in each worker. We even must tell the workers we are using the Distributed package. In the code below we created a new version of our count_heads function and distributed it.

Julia uses the concept of futures to launch computation on the workers. The “@spawnat” macro takes two parameters: the ID of a worker and a computation to be launched. What is returned is a future: a placeholder for the final result. By “fetch( a)” we can grab the result of the future computation and return it to our calling environment. (Some library functions like “printf()” when executed on the workers are automatically mapped back to the master.)

In the following example we create two local workers and define a function for them to execute. Workers each have a unique integer ID and we print them. Then we use @spawnat to launch the function on each worker in turn.

We can easily measure the cost of this remote function evaluation with the @time macro as follows. We enclose the call and fetch in a block and time the block. (Notice we have increased the number of coins to 10⁹ from 10⁴.)

If we want both workers working together, we can compose a list of spawned invocations with a construct of the form

[ @spawn at i expr(i) for i in range]

This returns a list (technically, in Julia it is an array) of futures. We can then grab the result of each future in turn. The result is shown below. This is run on a dual core desktop machine, so parallelism is limited but, in this case, the parallel version is 1.85 times faster than the individual call. The reason that it is not exactly two time faster is partly due to the overhead in sequentially launching and resolving the futures. But it is also due to communication delays and other OS scheduling delays.

A more compact and elegant way to write this parallel program is to use the Julia distributed parallel map function pmap(f, args). pmap takes a function and applies it to each element of a set of arguments and uses the available workers to do the work. The results are returned in an array.

In this case count_headse did not need an argument so we constructed an anonymous function with one parameter to provide to the pmap function. In this execution we were only concerned with dividing the work into two parts and then letting the system schedule and execute them with the available worker resources. We chose 2 parts because we know there is two workers. However, we could have divided into 10 parts and applied 10 argument values and the task would have been accomplished using the available workers.

Julia distributed across multiple host machines.

To create a worker instance on another host Julia uses secure shell (ssh) tunnels to talk to it. Hence you need five things: the IP address of the host, the port that secure shell uses, the identity of the “user” on that host and the private ssh key for that user. The ssh key pair must be password-less. The location of the Julia command on that host is also needed.

In this and the next example we simplify the task of deploying Julia on the remote host by deploying our Julia package as a docker container launched on that host. To make the ssh connection work we have mapped the ssh port 22 on the docker container to port 3456 on the host. (We describe the container construction and how it is launched in the next section.)

In the previous section we provided “addprocs()” with a single integer representing the number of worker we wanted launched locally. As shown below, the remote version requires a bit more. We supply an array of tuples to addprocs() where each tuple provides the contact point and the number of workers we want there. In this example we spawn 2 workers on one remote node and one worker on the other. We also provide the local location of the private key (here called pubkey) in the sshflags argument.

We also want each worker to have the Distributed.jl package and another package called “PyCall” which enables calling python library functions. We demonstrate the python call with a call to socket.hostname() in each worker. Notice that the remote function invocation returns the “print” output to the master process. The strange hostnames that are printed are the synthetic host names from the docker containers.

This example did not address performance. We treat that subject briefly at the end of the next section.

Channels

In addition to spawning remote tasks Julia supports a remote channel mechanism. This allows you to declare a channel that can carry messages of a given type and hold them for remote listeners to pickup. In the example below we declare a remote channel that carries string messages and a function defined for the listeners to use. The worker can use the “take!()” function to remove an item from the channel and the master uses “put!()” to fill the channel. The message “-1” tells the listener to stop listening.

Using the channel mechanism one can use Julia to have a group of workers respond to a queue of incoming messages. In the Github site for this post we have put an example where the workers take the channel messages and put the results into an Azure table.

Julia Distributed Computing on Kubernetes

Depending upon your favorite flavor of cloud there are usually three or four ways to spin up a cluster of nodes to run a distributed computation. The most basic way to do this is to launch a group of virtual machines where each has the needed resources for the problem you want to solve. For Julia, the best thing to do is launch instances of a ‘data science VM” that is available on AWS or Azure. There are two things your VM needs: an installation of Julia version 1.0.0 or later and the ability to ssh to it without using a password.

Here is how to do this on Azure. First, run “ssh-keygen” and it will step you through the process of generating a key-pair and give you the option of having no password. Then from the Azure portal select “create a resource” and search for Linux data science VM. As you go through the installation process when it asks for a key, paste in the public key you just generated. When the VM is up you can ssh to it to verify that it has the needed version of Julia installed by typing “Julia” at the command line. If it can’t find Julia you will need to download and install it. While you are logged and running Julia, you should also install some of the libraries your distributed program will need. For example, Distributed.jl and PyCall.jl or any package specific to your application. If you have a large cluster of VMs, this is obviously time consuming.

A better solution is to package your worker as a Docker container and then use Kubernetes to manage the scale-out step. We have set up a docker container dbgannon/juliacloud2 in the docker hub that was be used for the following experiments. This container is based on Jupyter/datascience-notebook so it has a version of Jupyter and Julia 1.0.0 already installed. However to make it work has a Julia distributed worker it must be running the OpenSsh daemon sshd. A passwordless key pair has been preloaded into the appropriate ssh directory. We have also installed the Jullia libraries Distributed.jl, PyCall.jl and IJulia.jl. PyCall is needed because we need to call some python libraries and Ijulia.jl is critical for running Julia in Jupyter. We have included all the files needed to build and test this container in Github.

Launching the container directly on your local network

The docker command to launch the container on your local network is

docker run -it -d -p 3456:22 -p 8888:8888  dbgannon/juliacloud2

This exposes the ssh daemon on port 3456 and, if you run the jupyter notebook that is on port 8888. To connect to the server you will need the ssh key which is found on the Github site. (If you wish to use your own passwordless keypair you will need to rebuild the docker container using the files in Github. Just replace pubkey and pubkey.pub. and rebuild the container.) To connect to the container and start Jupyter use the key pubkey.

ssh -i pubkey jovyan@localhost -p 3456
…
jovyan@bdfb4a7068e2:$ jupyter notebook

Notice that when you did this you were prompted to agree to add the ECDSA key fingerprint to your known hosts file. Jupyter will come up on your local machine at http://localhost:8888 and the password is “julia”. If you are running this remotely replace localhost with the appropriate IP and make sure port 8888 is open. To launch worker instances run docker as above (but you don’t need “-p 8888:8888”. ) When they are up you will need to ssh to each from your instance running jupyter. Doing this step is necessary to put the ECDSA key into the known hosts of your master jupyter instance.

Launching the containers from Kubernetes

Now that Kubernetes has become a cloud standard it is relatively easy to create a cluster from the web portal. The experiments here were completed on a small 5 node cluster of dual core servers. Once the cluster was up it was easy to launch the individual components from the Kubernetes command line. Creating the Jupyter controller required two commands: the first to create the deployment and the second to expose it through a load balancer as a service.

kubectl run jupyter --image=dbgannon/jupyter --port=8888
kubectl expose deployment jupyter --type=LoadBalancer

Creating the worker deployment required one line.

kubectl run julcloud --image=dbgannon/juliacloud

(Note: this wad done with earlier versions of the Docker containers. Juliacloud2 described above combines the capabilities of both in one container.)

One the two deployments were running, we used the Kubernetes web portal to scale the julcloud deployment to 5 different pods as shown in Figure 4 below.

Figure 4. Kubernetes deployment of 5 worker pods and the Jupyter pod.

Unfortunately, we still needed to find the Ip address of each pod (which can be found on the Kubernetes portal) and ssh to each from the Python controller to add them to the known hosts file there. (It should be possible to automate this step.) of Using this configuration we ran two experiments. In the first we created a simple standard Monte Carlo simulation to compute Pi and ran it by allocating 2 workers per Kubernetes worker pod. The code is shown below.

We scaled the number of pods to 10 and put two workers per pod and computed the speed up for N = 8*10⁹ and N = 8*10¹⁰. The highly unscientific results are shown in the chart below. Notice that 10 pods and 20 workers is two workers per core, so we cannot realistically achieve speeds up much beyond 10, so a maximum speedup of 13.9 with 20 workers is good. ( In fact, it reflects likely inefficiency seen when run with only one worker.)

Figure 5. Speedup relative to one worker when using up to 20 workers and 10 pods on five servers.

The second experiment was to use 10 workers to pull data from a distributed channel and have them push it to the Azure table service. The source code for this experiment is in the Github repository.

Conclusion

There is a great deal that has not been covered here. One very important item missing from the above discussion is the Julia distributed array library. This is a very important topic, but judging from the documentation it may still be a bit of a work-in-progress. However I look forward to experimenting with it.

One of the applications of Julia that inspired me to take a closer look at Julia is the Celeste astronomy project that used 650,000 cores to reach petaflop performance at NERSC. Julia computing is now a company devoted to supporting Julia. Their website has many other great case studies.

There are many interesting Julia resources. Juliacon is an annual conference that brings together the user community. A brief look at the contents of the meeting videos (100 of them!) shows the diversity of Julia applications and technical directions.

Cloud Services for Transfer Learning on Deep Neural Networks

Leave a reply

The breakthroughs in deep learning over the last decade have revolutionized computer image recognition. The state-of-the-art deep neural networks have 10s of millions of parameters and they require training sets of similar size. The training can take days on a large GPU cluster. The most advanced deep learning models can recognize over 1000 different objects in images with surprising accuracy. But suppose you have a computer vision task that requires that you classify a few dozen different objects. For example, suppose you need to identify ten different subspecies of wolf, or different styles of ancient Korean pottery or paintings by Van Gogh? These tasks are far too specific for any of the top-of-the-line pretrained models. You could try to train an entire deep network from scratch but you if you only have a small number of images of each of your specialized classes this approach will not work.

Fortunately, there is a very clever technique that allows you to “retrain” one of the existing large vision models for your specific task. It is called Transfer Learning and it has been around in various form from the mid 1990s. Sebastien Ruder has an excellent blog that describes many aspect of transfer learning and it is well worth a read.

In this article we look at the progress that has been made turning transfer learning into easy-to-use cloud services. Specifically, we will look at four different cloud services for building custom recognition systems. Two of them are systems that have well developed on-line portal interfaces and require virtually no machine learning expertise. They are the IBM Watson Visual Recognition Tool and Microsoft Azure Cognitive Services Custom Vision service. The other two are tools that require a bit of programming skill and knowledge about deep networks. These are the Google “Tensorflow for Poets” Transfer learning package and the Amazon Sagemaker toolkit. To illustrate these four tools, we will apply each systems to the task of classifying images of galaxies. The result is not deep from an astronomy perspective (because I am not even an amateur astronomer!), but it illustrates the power of the tools. We will classify the galaxies into four types: barred spiral, elliptical, irregular and spiral as illustrated in Figure 1. We will do the training with very small training sets: 19 images of each class that were gathered from Bing searches.

The classification task is not as completely trivial as one might assume. Barred spiral galaxies are a subspecies of spiral galaxy that are distinguished by a “bar” of stars at the origin of the spirals. Consequently, these two classes are easy to misidentify. Irregular galaxies can be very irregular. (I like to think of them as galaxies that have not “got it together” enough to take on one of the other forms.) And elliptical can often look like spiral or irregular galaxies.

Figure 1. Two samples each of the four galaxy types. The images were taken from Bing searches.

We have made these image files available at AWS S3 in two forms: a zip files barred, elliptical, irregular, spiral and test and in REC format as galaxies-train.rec and galaxies-test.rec.

Transfer Learning for DNNs

Before we launch into the examples, it is worth taking a dive into how transfer learning work with a pre-built deep learning vision model. A good example, and one we will use, is the Inception-V3 model shown in Figure 2.

Figure 2. Inception-V3 deep network schematic. Image from the Google Research blog “Train your own image classifier with Inception in TensorFlow“.

In Figure 2, each colored blob is a subnetwork with many parameters. The remarkable thing about deep networks is how much of lower layers of convolution, pooling, concatenation seem to capture abstract qualities of images such as shapes and lines and regions. Suppose the network has L layers At the risk of greatly oversimplifying one can say that it is only at the last few layers that specific image classification takes place. A simple way to do transfer learning is to replace the last two layers with two new ones and retrain the trained parameters of layers 0 to L-2 “constant” (or nearly so).

The paper https://arxiv.org/pdf/1512.00567.pdf “Rethinking the Inception Architecture for Computer Vision” by Szegedy et al. describes InceptionV3 in some detail. The last two layers are a fully connected layer with 2024 inputs and 1000 softmax outputs. To retrain it for 4 outputs we replace last layers as illustrated in Figure 3 with two new layers. We now have only one matrix W of dimension 2024 by 4 of parameters we need to learn

Figure 3. Modified network for transfer learning.

If the training algorithm converges, it will be literally thousands of time faster than training the original. A nice paper by Yosinski et al takes an in-depth look at feature transferability in deep networks. There are other ways to do transfer learning on deep nets than just holding the L-2 layers fixed. Instead one can allow some fine tuning of the top most layers with the new data. There is much more that can be said on this subject, but our goal here is to evaluate some of the tools available.

The IBM Watson Visual Recognition tool.

This transfer learning service is incredibly easy to use. There is an excellent drop-and-drag interface and a nodeJS API as well as a Python API. To test it we clicked on the create classifier button and dragged the zip files for our four classes of galaxies onto the interface as shown below.

Figure 4. Visual recognition tool Interface with dragged zip files for the galaxy classes.

Within a few minutes we had a view of the classifier that we could test. The figure below illustrates the results from dragging three examples from the training set to the classifier interface. As you can see the interface returns the relative strength of membership in each of the classes.

To invoke the service, you need three things:

your IBM bluemix api_key which you were given when you logged into the service the first time to build the model.
Once your model has been built you need the classifier ID which is visible on the tool interface.
you must install the watson_developer_cloud module with pip.

Figure 5: Three vertical panels show the result of dragging one of the training images onto the classifier interface.

import watson_developer_cloud
from watson_developer_cloud import VisualRecognitionV3
visual_recognition = VisualRecognitionV3( 
      api_key = '1fc969d38 your key here 7f7d3d27334', 
      version = '2016-05-20')
classifier_id = 'galaxies_1872954591'
image_url = https://s3-us-west-2.amazonaws.com/learn-galaxies/bigtest/t12.jpg
   
param = {"url": image_url, "classifier_ids":[classifier_id]}
visual_recognition.classify(parameters=json.dumps(param)

The key elements of the code are shown above. (There may be other versions of the Python API. This one was discovered by digging through the source code. There is little other documentation.) We built a Jupyter notebook that uses the api to compute the confusion matrix for our test set. The Watson classifier will sometimes refuse to classify an image into one of our categories, so we had to create a “none” tag to identify these cases. The results are very good, with the exception of the confusion of spiral and barred spiral galaxies.

Figure 6: Results from the Watson classifier for our 40 image test set.

Computing the confusion matrix for the training set gives a perfect score as shown below.

Figure 7: Confusion matrix given the training set as input.

The Jupyter notebook in HTML and IPYNB formats are available in S3. One additional comment is needed. Because this service is a black box, we have no idea what transfer learning service is use.

Microsoft Azure Cognitive Services Custom Vision.

The Microsoft Azure Custom Vison Service is another very well designed and easy to use system. It is also a black box, so we have no idea how it works. The assumption is that intended users don’t need to know and the designers are free to change the algorithm if they fine better ones.

Once you log in you create a new project as shown in Figure 8 below. Then you can upload your training data using another panel in the interface.

Figure 8. The left panel defines the galaxy name and type. The right panel is for uploading the training set.

Once the training set is in place you can see your project with a view of some of your images as shown in Figure 9. There is a button to click to start the training. In this case it takes less than a minute to see the results (Figure 10).

Figure 9. The view of a sample of your training set. The green button starts the training.

Figure 10. The results from 2 iterations of the training.

If you are not pleased with the result of the training, you can try adding or removing images from the training set and train it again.

During the training with this data we made 3 iterations. the first was with the initial data. The system recognized that one of the elliptical galaxies was a duplicate, so the second iteration included an additional elliptical galaxy. The system will not allow a new iteration until you have modified the data, so the third iteration replaced a random spiral galaxy with another. The results here are not great, but not bad for the small size of the training set. As shown in Figure 11, the confusion matrix is better than the IBM case for distinguishing barred elliptical from elliptical but not as good at recognizing the irregular galaxies.

Figure 11. Confusion matrix for Azure Custom Vision test.

Using the training data to compute the confusion we get an almost perfect score, but one barred spiral galaxy is recognized as spiral.

Figure 12. Confusion Matrix for Azure Custom vison with training data

We looked at the case that confused the classifier and it can be seen to one that is on the border between barred spiral and spiral. The image is contained in the full Jupyter notebook (html version, ipynb version).

To use the notebook you need to have your prediction and training keys and the project id for the trained model. You will also need to update your version of the Azure Python SDK. The code below shows how to invoke the predictor. The notebook gives the full details.

from azure.cognitiveservices.vision.customvision.prediction import prediction_endpoint
from azure.cognitiveservices.vision.customvision.training import training_api
training_key = 'aaab25your training key here 8a8b0' 
prediction_key = "09199your prediction key here b9ae" 
trainer = training_api.TrainingApi(training_key) 
project_id = 'fcbccf40-1bce-4bc4-b4ea-025d63f1014d' 
project = trainer.get_project(project_id)
iteration = trainer.get_iterations(project.id)[2]
image = “https://s3-us-west-2.amazonaws.com/learn-galaxies/bigtest/t5.jpg”
predictor = prediction_endpoint.PredictionEndpoint(prediction_key)
results = predictor.predict_image_url(project.id, iteration.id, url=image)
for prediction in results.predictions: 
  print("\t" + prediction.tag + ": {0:.2f}%".format(prediction.probability * 100)

The printed results give the name of each class and the probability that it fits the provided image.

Tensorflow transfer learning with Inception_v3

Google has built a nice package called Tensorflow For Poets that we will use for the next test. This is part of their Google Developer Codelabs.

You need to clone the github repo with the command

git clone https://github.com/googlecodelabs/tensorflow-for-poets-2
cd tensorflow-for-poets-2

Next go to the subdirectory tf_files and create a new directory there called “galaxies” and put four subdirectories there: barredspiral, spiral, elliptical, irregular with each containing the corresponding training images. Next do

pip install --upgrade tensorflow

The Tensorflow code to do transfer learning and retrain a model is in the subdirectory scripts in a file retrain.py. It follows the transfer learning method we described earlier by replacing the top to layers of the model with a new, smaller fully connected layer and a softmax layer. We can’t go into the details here without a deep dive into Tenorflow code which is beyond the scope of this article. Suffice it to say that it works very nicely.

The command to “retrain” the inception model is

python -m scripts.retrain \   
     --bottleneck_dir=tf_files/bottlenecks \   
     --how_many_training_steps=500 \   
     --model_dir=tf_files/models/ \   
     --summaries_dir=tf_files/training_summaries/"inception_v3" \   
     --output_graph=tf_files/retrained_graph.pb \   
     --output_labels=tf_files/retrained_labels.txt \   
     --architecture="inception_v3" \   
     --image_dir=tf_files/galaxies

If all goes well you will finally get the results that look like this

INFO:tensorflow:Final test accuracy = 88.9% (N=9)

Invoking the re-trained model is simple and you don’t need to know much Tensorflow to do it. You essentially load the image as a tensor and load the model graph and invoke it with the input tensor. The complete Python code for this in in the Jupyter notebook (in html and ipynb formats).

As with the other examples we have computed the confusion matrix for the test set and training set as shown below.

Figure 13. Tensorflow test results.

Figure 14. Tensorflow results on the training set

As can be seen the retrained model as the usual difficulty distinguishing between spiral and barred spiral and irregular sometimes looks like elliptical and sometimes spiral. Otherwise the results are not too bad.

Amazon SageMaker

SageMaker is a very different system from the tools described above. This article will not attempt to cover SageMaker thoroughly and we will devote a more complete article to it soon.

Briefly, it consists of a complete system for training and hosting ML models. There is a web portal but the primary user interface is Jupyter notebooks. Figure 15 illustrate the view of the portal after we created several experiments. It nicely illustrates the phases of SageMaker execution.

You first create a Jupyter instance and a notebook. When you create a Jupyter notebook instance from the portal you are actually deploying a virtual machine on AWS.
You use the notebook to create ML training jobs. The training jobs take place on a dynamically allocated container cluster.
When training is complete you create a model which is stored and managed by SageMaker.
When you have a model you can create an endpoint that can be used to invoke the model from your application.

Figure 15. SageMaker portal interface.

To train a new model you provide the name of an AWS S3 bucket where your data is stored and a bucket where the output is going to be placed.

When the Jupyter VM spins up you see it in your browser. The first thing you discover is a large collection of demo notebooks covering a host of topics. You are not restricted to these. There is also a library of tools to use Apache Spark from SageMaker. You can also upload your own notebooks with TensorFlow or MXNet models for training. Our you can create a docker image with your own algorithms.

In the example are interested here we discovered a SageMaker example notebook, Image-classification-transfer-learning.ipynb and made a copy we called sagemaker-galaxy-predict that you can access (in html or in ipynb format). As with the IBM and Microsoft examples, the actual transfer learning algorithm used is a black box, but there are some hints and parameters you can adjust.

When you train a deep neural network, you are find values for the millions of parameters in the network. (As we have described above there are many fewer parameters in transfer learning.) But there are an additional set of parameters, called hyperparameters, that describe the network architecture and the learning process. In the case of the transfer learning notebook you must specify the following hyperparameters: the number of layers in the network, the training minibatch size, the training rate and the number of training epochs. There are defaults for these based on the example that SageMaker provides, but they did poorly for the galaxy experiment. This left us with a four-dimensional hyperparameter space to explore. After spending about two hours trying different combinations we came up with the table below.

# The algorithm supports multiple network depth (number of layers). They are 18, 34, 50, 101, 152 and 200
num_layers = 101
# we need to specify the input image shape for the training data
image_shape = "3,224,224"
# we also need to specify the number of training samples in the training set
num_training_samples = 19*4
# specify the number of output classes
num_classes = 5
# batch size for training
mini_batch_size =  21
# number of epochs
epochs = 5
# learning rate
learning_rate = 0.0018
top_k=2
# Since we are using transfer learning, we set use_pretrained_model to 1 so that weights can be 
# initialized with pre-trained weights
use_pretrained_model = 1

We are absolutely certain that these are far from optimal. Once again we computed a confusion matrix for the test set and the training set and they are shown in Figure 16 and 17 below.

Figure 16. Confusion matrix for SageMaker test data.

Figure 17. Confusion matrix for SageMaker on training data.

As can be seen, these are not as good as our other three examples. The failure is largely due to poor choices for the hyperparameters. It should be noted that the Amazon team is just now starting a hyperparameter optimization project. We will return to this example after that capability is available.

Conclusion

In this report we examined four computer vision transfer learning service. We did this study using a very tiny example to see how well each service performed. We used the simple confusion matrix to give us a qualitative picture of performance. Indeed, these matrices showed us that distinguishing the barred spiral galaxies from the non-barred spiral ones was often challenging and that irregular galaxies are easy to misclassify. If we want a quantitative evaluation we can compute the accuracy of each method using the test data. The results are Azure = 0.75, Watson = 0.72, Tensoflow = 0.67 and SageMaker = 0.6. However, given the very small size of the data sets, we argue that it is surprising that we could get reasonable results with such little effort.

Building the best galaxy classifier was not our goal here. Real astronomers can do a much better job building systems that can answer much more interesting questions the classification task posed here. The goal of this project has been to show what you can do with cloud transfer learning tools. The IBM and Azure tools were extremely easy to use and, within a few minutes you had a model constructed. It was not hard to access and use these models from a Python client. The Tensorflow example from Google allowed us to do the transfer learning on a laptop. SageMaker was fun to use (if you like Jupyter), but tuning the hyperparameters is a challenge. A follow-up article will look at additional SageMaker capabilities.

Finally, if any reader can improve on any of these results for this small dataset, please let me know!

A Brief Survey of Cloud AI Services

Leave a reply

The commercial clouds are in a race to see who can provide the most interesting and useful AI services on their cloud platform. This work began in the research laboratories in universities and companies over the past 25 years, but the big breakthroughs came when deep learning models trained on massive data collections began to reach levels of human accuracy. For some time now, the public cloud companies have provided custom virtual machines that make it ease for technically sophisticated customers to use state of the art ML and neural network tools like TensorFlow, CNTK and others. (We described these in here.) But the real competition is now to provide services for building smart applications that can be used by developers lacking advanced training in machine learning and AI. We now have speech recognition, language translation, image recognition capabilities that can be easily integrated into web and mobile applications.

In the following paragraphs we will look at the AI services provided by IBM, Google, Microsoft and Amazon. These are certainly not the only providers. Salesforce has the myEinstein platform and small companies like Algorithmia and not-so-small Genpact also provide services and consulting in this area.

What becomes abundantly clear when you study the details of the offerings is that they all cover the same basics. This includes tools for building bots, natural language translations, speech-to-text and text-to-speech and unstructured document analysis. But what one also discovers is that each provider has some services that standout as being a bit more innovative that that offered by the others. We conclude with an overview of the trends we see and thoughts about the future of cloud AI services.

This is the first of a series that we will do on this topic. Future articles will explore some of these capabilities in more technical depth. For example, at the end of this article, we look at an example of doing text analysis with Amazon Comprehend.

IBM Cloud Watson Services

The IBM Watson services are organized into five groups.

Watson conversation provides a portal interface to build Bots. The interface promps you to identify intents, entities and dialog flow. Intents are the questions you expect your users to ask. Entities are the components such as city names, times and other objects your bot will understand. Dialog flow is the tree of intents and responses you anticipate in the dialog. The result is a bot you can deploy and later improve.
The discovery service is a tool that allow you to quickly ingest and explore data collection. A query language can be used to subset results and identify important features and anomalies. Discovery news is a service to crawl news and blogs looking for patterns in sentiment, new concepts and relationships. It allows you to see trends and key events.
The visual recognition service has been used to analyze aerial images to better understand drought and water use. It can do image content analysis including detecting faces and making age and gender estimates. If you have your own collection of labeled images the system can be easily trained to incorporate these into its model.
Speech. Watson has speed-to-text and text-to-speech services. These services work reasonably well but the quality of the output speech does not seem as good as Amazon Poly.
The Watson natural language classifier is designed to classify intent of text passages such as deciding that a question about the weather is looking for current temperatures. As with the other services it is update it with additional training data.
The Watson empathy services allow prediction of personality characteristics and emotions through text.

Google Cloud AI services

The Google cloud has an extensive set of AI services available.

AutoML is Google’s tool for training their vision models on your data. If you image data is labeled it will help create better labels. If it is not labeled they will help label it. It uses transfer learning which is a method to retrain a neural network to recognize new inputs. By leaving many of the early layers in the previously trained network unchanged basic features such as edges and shapes can be used again and only the last few layers need to be relearned. (This method is widely used by the other image services described here.) Google also has a powerful vision api that is capable of recognizing thousands of categories of images.
Cloud Machine Learning Engine is a cloud service that help you manage a large cluster for very large ML tasks. It also allows you to use your trained algorithm with terabytes of data and thousands of concurrent users.
DilogFlow is Google’s tool for building bots and interfaces that support natural and rich interactions.
Video Intelligence. Suppose you have a large collection of videos and you want to be able to search for occurrences of specific words. The Google cloud video intelligence API makes this possible.
Cloud Speech. Google has a long history with speech-to-text recognition that is widely used in their android product and Google search. The Google cloud Speech API recognizes over 100 languages and variants. It has context aware recognition that filters out lots of background noise. (The also have a very nice speech recognition kit that works with a raspberry pi. I have used it. It is fun little project.)
Natural Language. Google’s text analysis is very good at parsing documents. It is very good at entity recognition (tagging many phrases and words with Wikipedia articles). It can also give you lists of relevant categories. For syntax analysis is used a version of their parsyMcParseface parser that I used in my demo of building an application for Algorithmia described in this post.
Cloud Translation. Google had one of the earliest cloud translation services and it has become better over time. It supports more than 100 languages.

Microsoft Azure

Azure’s machine learning services are divided into two categories: Artificial Intelligence and cognitive services. There are currently three AI services:

ML services which is based on their machine learning workbench. The workbench is designed to guide you through the process of creating a data package from your data and then build a python/pyspark script to evaluate it. You can invoke the script locally, on an azure vm or in a container. The workbench has the capability to fill in the gaps of data cleaning and algorithm selection in generating a final solution.
Batch AI services consist of a set of tools to help you marshal GPU and CPU cluster resources for parallel machine learning training using CNTK, TensorFlow and Chainer.
Azure AI services include Bot Builder, an SDK for creating bots and a suite of bot template.

The cognitive services are divided into four main categories.

This includes a vision API for content analysis which works with a Jupyter notebook that allows you to upload images and return a description in terms of recognized entities. It also provides a caption. A content moderator service allows you to flag images that may have unwanted content. The custom vision service allows you to quickly train a vision app to recognize images from classes you provide. The classes can be small (30 images in each) but it does not recognize your images when they are embedded in more complex scenes. However it does allow you to export the trained model as TensorFlow to run in offline applications. Face and Emotion APIs allow you to detect faces in images and detect the mood of each. The video indexer is impressive. It can provide audio transcription, face tracking and identification, speaker indexing, visual text recognition, sentiment analysis and language translation.
The speech to text and text to speech services are there but there is also a Custom Speech Service that allows you to add knowledge about specific jargon to the language model. A Speaker Recognition API allows your apps to automatically verify and authenticate users using their voice and speech. The Translator service is based on the work that was done for the skype realtime speech translation system. It can recognize languages and translate the spoken sentences into the target language.
The Language Understanding Service allows your application to understand spoken commands like “Turn off the light” or home automation tasks. The Linguistic Analysis API provides sentence separation, part-of-speech tagging and constituency parsing. The Text Analysis Service provide sentiment analysis and key phrase extraction. A Web Language Model is based on the Web N-Gram Corpus for analysis of Web documents.
The Custom Decision Service uses reinforcement learning algorithms to extract features from a set of candidates when ranking articles and images for automatic inclusion in a web site. The Entity Linking Intelligence Service API provides a tool to understand when an word is uses as an actual entity rather than a part of speech or a general noun or verb. This is done by looking at the context of the use of the word. The Academic Knowledge API provides access to the Microsoft academic graph which is data mind from the Bing index. The QnA Maker is a REST API that trains a machine learning system to help bots respond in a more natural way to user requests.

AWS AI services

Amazon’s web services cloud AI services has seven major APIs

Image and video Rekognition. The image recognition service allows the full set of computer vision features that are available anywhere. Object, scene and activity detection is continuously learning. It can recognize objects and scenes. Text in images like street names or product names can be read. If you have a private library of photos it can identify a people. When it is analyzing video it can identify certain activities happening in the frame. Facial analysis recognizes age ranges and emotions. When analyzing video it can track individual people as they go in and out of a frame. Sending live or recorded video to a Kinesis Video Stream can be routed to rekognition video and identified object can be sent to lambda functions that can react in near real time. Alternatively, video can be periodically loaded into S3 buckets which trigger lambda functions that will invoke rekognition for analysis.
Amazon Lex is a tool for build bots with voice and text input and response. It is the same technology that powers Echo’s Alexa. The Lex console allows you to build a bot with ease. Conversation flow is an import part of the Bot interaction. Lex supports simple mechanisms to allow you to tailor the flow to your application.
Comprehend. There are two main components to Amazon Comprehend. The first is a set of tools to extract named entities (“Person”, “Organization”, “Locations”, etc.) and key phrases from a document. The more impressive capability of Amazon Comprehend is the topic modeling subsystem. This is of interest if you have a large collection of documents and you would like to see then classified into a set of N bins where N is a number you pick. Comprehend will take you collection of documents and apply a Latent Dirichlet Allocation-based learning model to separate them into N bins with each bin defined by a set of key words it has discovered. (At the end of this article we will demonstrate Amazon Comprehend.
Translate. This service provides real-time and batch language translation. The service is protected by SSL encryption.
If you have a mp3 or wave video and you want to add subtitles, the transcribe service will render all of the voice audio to text and also insert timestamps for each word. You can then use Translate to convert the audio to another language. They say they are adding specific voice identification soon.
Poly is the Amazon text to speech API. It is far from the robotic sounding speech generation we saw in the past. It has 47 different voices spread over many languages. (I have used it and it is both impressive and fun.)

Conclusions

If you need to build a bot that understands English, French and Mandarin and replies with spoken and correctly accented Italian that can help you identify your friends and celebrities in your Instagram photos and also mine your twitter feed, you are in luck. The tools are there. But if you are expecting emergent artificial intelligence, you are out of luck. Alexa, Cortana and Seri are each good at fast facts but otherwise dumb as a post.

It is also now clear that this technology is also a boon to those with more nefarious goals. If you are a government security agency with access to lots of cameras in public places, keeping track of your citizens is now a snap. We see that social media is now swarming with bots that sell not only soap but also promote and propagate lies and propaganda. Serious questions are being raised about the potential threat to modern democracies that these technologies enable. The social media companies are aware of the challenge of eliminating the bots that skew our national discussions and we hope they are up to the cleanup task.

There is also much to be excited about. The technology behind these AI services is also helping us use vision and sensing that can truly help mankind. We can “see” the planet at an enlarged scale. We can spot droughts, crop disease and the effects global warming is having on the planet in greater detail because of the speed and accuracy of image analysis. We can monitor thousands of sensors in our environment that help us improve our quality of air and water and we can better predict potential problems before they occur. The same sensor and vision technology help us scan x-ray and other medical images.

All of these AI advances are going to give us safer roads with driverless cars and robots magnify the power of the individual worker in almost every domain. I look forward to the time Alexa and or Cortana can become a real research partner helping me scan and review scientific literature and point me to discoveries that I most certainly miss today.

Afterword

In the following paragraphs we look at one of the cloud services in depth. In future articles we will examine other capabilities and applications.

Text Analysis with Amazon Comprehend

As with everything in AWS, their services can be accessed by the command line interface or the APIs. However, the console provides a very simple way to use them. We will test Amazon’s Comprehend using the named entity and key phrases interface. The service is accessed via their API explorer.

We selected a paragraph about the discovery of DNA from Wikipedia and pasted it into the entity/key phrase extractor. The results are shown in figures 1, 2 and 3.

Figure 1. Inserting a paragraph into the API explorer.

Figure 2. The list of Entities

Figure 3. The key phrases

As can be seen the system does a very good job with both the entity and key phrase tasks. In particular it does a great job of categorizing the named entities.

Topic Modeling

The more impressive capability of Amazon Comprehend is the topic modeling subsystem. This is of interest if you have a large collection of documents and you would like to see then classified into a set of N bins where N is a number you pick. To try this out, I used the Arxiv science abstract collection I have used in previous document classifier experiments. Each document is the text of an abstract of a scientific research paper. To use comprehend you put the documents in an AWS S3 bucket. I have 7108 documents and they are in the bucket https://s3-us-west-2.amazonaws.com/scimlcomprehend. (If you are interested, the individual files can be accessed by this url https://s3-us-west-2.amazonaws.com/scimlcomprehend/*arxiv where * is a an integer between 0 and 7108.)

Invoking the topic modeler from the console is trivial. You simply fill in a form. The form for my experiment is shown below in Figure 4.

Figure 4. Invoking the Topic modeler.

In this test case the topic modeler ran in about five minutes and produce a pair of CSV files. One file contained a set of tuples for each document. Each tuple is a triple consisting of the document name, the name of a topic bin and a score for fit for that bin. For example, here is the first 11 tuples. The abstract documents are drawn from five fields of science: physics, biology, computer science, math and finance. We have added a fourth column that provides the science category for the listed document.

Document no.	Topic	Score	Actual topic
0	0	0.242696	compsci
0	5	0.757304	compsci
1	1	1	math
2	0	0.546125	Physics
2	4	0.438275	Physics
2	5	0.015599	Physics
3	1	1	math
4	8	1	Physics
5	0	0.139652	Physics
5	3	0.245669	Physics
5	5	0.614679	Physics

As can be seen, document 0 is computer science and scores in topic 0 and highly in topic 5. Documents 1 and 3 are math and squarely land in topic 1. Documents 2, 4 and 5 are physics and are distributed over topics 0,3,4,5 and 8. The algorithm used in the topic modeler is described in Amazon’s documentation as follows.

“Amazon Comprehend uses a Latent Dirichlet Allocation-based learning model to determine the topics in a set of documents. It examines each document to determine the context and meaning of a word. The set of words that frequently belong to the same context across the entire document set make up a topic.”

If we look at the corpus as a whole we can see how well the topic modeler did in relation to the known topics. The result is in Figure 5 below which gives the percent of papers in each science area that had scores in each modeler topic.

Figure 5. Topics selected by the model for each science discipline

As can be seen the modeler topic 000 did not differentiate very well between physics, bio and compsci. To look closer at this we can look at the other csv file generated by the modeler. This file lists the key words the modeler used to define each topic. In the case of topic 000 the words were:

'system','timar','quantum','dynamic','cell','control','network','information,'dynamical','scheme'

As can be seen these are words that one would expect to see in many articles from those three areas. If we look beyond topic 000, we see physics is strong in topic 3 which is defined by the words

'field','energy','matt','dark','scalar','scale','magnetic','cosmological','couple','theory'

This topic is clearly physics. Looking at computer science, we see the papers score strongly is topics 005 and 007. These words are

'datar','method','analysis','learn','base','network','approach','value','regression','gene','algorithm','graph','problem','numb','propose','optimal','set','edge','vertex','low'.

We included machine learning in the computer science topics so this result is also reasonable. For math the strong topics were 001 and 006 and the corresponding words were

'distribution','method','function','sample','estimator','estimate','process','parameter','random','rate','space','prove','mathbb','group','graph', 'algebra’,'theorem','finite','operator','set'

which constitutes a nice collection of words we would expect to see in math papers. For finance topic 009 stands out with the following words.

'market', 'price', 'risk', 'optimal', 'problem', 'function', 'measure', 'financial', 'strategy', 'option'.

The only area where the topic modeler failed to be very clear was in the area of biology where topics 004 and 005 were the best. Those words were not very indicative of biology papers:

'model', 'parameter', 'data', 'propose', 'distribution', 'inference', 'simulation', 'bayesian', 'fit', 'variable' , 'method', 'analysis', 'learn', 'base', 'network', 'approach', 'value', 'regression', 'gene'.

As an unsupervised document classifier, the Amazon Comprehend modeler is impressive. Classifying these science abstracts is not easy because science is very multidisciplinary and many documents cross the boundary between fields. We have looked at this problem in a previous post Algorithmia Part 2. Document Embedding with Gensim and Deploying a Trained Model in the Algorithmia Cloud and in our book Cloud Computing for Science and Engineering where we describe many of the challenges. One short coming of the Amazon modeler is that it does not provide an easy way to model a new document against the models built from the corpus. This should be easy to do. In the analysis above we looked at how broad scientific domains are mapped over the detected category bins. One thing we also need to look at is how well the individual categories are at grouping similar abstracts. This is equivalent to looking at the columns of the table in Figure 5 above. If we take a look at topic 006 that is heavily associated with math we can print the titles and the ArXiv sub-categories they came from. A sample is shown below.

‘Differential Calculus on Cayley Graphs [cs.DM]’,
‘Coherent rings, fp-injective modules, and dualizing complexes [math.CT]’,
‘Self-dual metrics with maximally superintegrable geodesic flows [gr-qc]’,
‘New atomic decompositons for Bergman spaces on the unit ball [math.CV]’,
‘Presenting Finite Posets [cs.LO]’,
‘The Whyburn property and the cardinality of topological spaces [math.GN]’,
‘Absolutely Self Pure Modules [math.RA]’,
‘Polynomials and harmonic functions on discrete groups [math.GR]’,
‘Free Resolutions of Some Schubert Singularities in the Lagrangian Grassmannian [math.AG]’,
‘Connectedness properties of the set where the iterates of an entire unction are unbounded [math.DS]’,
‘A Purely Algebraic Proof of the Fundamental Theorem of Algebra [math.HO]’,
‘A cell filtration of the restriction of a cell module [math.RT]’,
‘Higher dimensional Thompson groups have subgroups with infinitely many relative ends [math.GR]’,
‘PI spaces with analytic dimension 1 and arbitrary topological dimension [math.MG]’,
‘Eigenvalues of Gram Matrices of a class of Diagram Algebras [math.RA]’

With the exception of the first, third and fifth documents they are all math and even those two documents look like math. On the other hand looking at a sample from category 000 we see a true hodgepodge of topics.

‘A stochastic model of B cell affinity maturation and a network model of   immune memory [q-bio.MN]’,
‘Precise determination of micromotion for trapped-ion optical clocks [physics.atom-ph]’,
‘Quantum delocalization directs antenna absorption to photosynthetic   reaction centers [physics.bio-ph]’,
‘Fluorescence energy transfer enhancement in aluminum nanoapertures [physics.optics]’,
‘Direct Cortical Control of Primate Whole-Body Navigation in a Mobile   Robotic Wheelchair [q-bio.NC]’,
‘Condition for the burning of hadronic stars into quark stars [nucl-th]’,
‘Joint Interference Alignment and Bi-Directional Scheduling for MIMO   Two-Way Multi-Link Networks [cs.IT]’,
‘MCViNE — An object oriented Monte Carlo neutron ray tracing simulation   package [physics.comp-ph]’,
‘Coherent addressing of individual neutral atoms in a 3D optical lattice [quant-ph]’,
‘Theoretical analysis of degradation mechanisms in the formation of   morphogen gradients [physics.bio-ph]’,
‘Likely detection of water-rich asteroid debris in a metal-polluted white   dwarf [astro-ph.SR]’,
‘A Study of the Management of Electronic Medical Records in Fijian   Hospitals [cs.CY]’,
‘Self-assembling interactive modules: A research programme [cs.FL]’,
‘Proceedings Tenth International Workshop on Logical Frameworks and Meta   Languages: Theory and Practice [cs.LO]’,

Setting aside this topic bin 000, we certainly see strong coherence of the documents.