Category Archives: Uncategorized

Quantum Computing in the Cloud Part 2. A look at Quantum AI Algorithms

In Part 1 of this article we looked at how quantum computers are now beginning to appear as cloud hosted attached processors and we also looked at how one can go about programming them. The most important question that part 1 failed to address is why quantum computers are interesting?   The most well-known uses of a quantum computer are for building interesting quantum chemistry models and for factoring large numbers to break cryptographic codes.   The chemistry application was the original suggestion from Richard Feynman in 1981 for building a quantum computer. The factoring application is important, but it will only remain interesting until the cryptographers find better algorithms.   Another application area that has received a lot of attention lately is applying quantum computing to machine learning (ML). This is not surprising.   (Everything in AI is hot now, so quantum AI must be incandescent.)

Much of the published work on quantum ML is very theoretical and this article will try to provide pointers to some of the most interesting examples, but we will not go deep into the theory.   However there are a few algorithms that run on the existing machines and we will illustrate one that has been provided by the IBM community. This algorithm builds a Support Vector Machines for binary classification and it runs on the IBM-Q system.   More on that in the second half of this article.

The following is a list of papers and link that formed the source for the brief summary that follows.   They are listed in chronological order. Several of these are excellent surveys with lots of technical details.

  1. Lloyd, M. Mohseni, and P. Rebentrost, Quantum algorithms for supervised and unsupervised machine learning, https://arxiv.org/abs/1307.0411 (2013)
  2. Lloyd, M. Mohseni, and P. Rebentrost, Quantum principal component analysis, https://arxiv.org/abs/1307.0401 (2013)
  3. Nathan Wiebey, Ashish Kapoor and Krysta M. Svore, Quantum Algorithms for Nearest-Neighbor Methods for Supervised and Unsupervised Learning, https://arxiv.org/abs/1401.2142 2014
  4. Maria Schuld, Ilya Sinayskiy and Francesco Petruccione, An introduction to quantum machine learning, https://arxiv.org/abs/1409.3097v1 2014
  5. Nathan Wiebe, Ashish Kapoor, and Krysta M. Svore, Quantum Deep Learning, https://arxiv.org/abs/1412.3489v2 2015
  6. Verdon, M. Broughton and J. Biamonte, A quantum algorithm to train neural networks using low-depth circuits, https://arxiv.org/pdf/1712.05304.pdf
  7. Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, Seth Lloyd, Quantum Machine Learning, https://arxiv.org/pdf/1611.09347v2.pdf 2018
  8. Seth Lloyd and Christian Weedbrook. Quantum generative adversarial learning, https://arxiv.org/pdf/1804.09139v1.pdf 2018
  9. Gao1, Z.-Y. Zhang, L.-M. Duan, A quantum machine learning algorithm based on generative models, http://advances.sciencemag.org/content/4/12/eaat9004.full, 2018
  10. Dawid Kopczyk “Quantum machine learning for data scientists” 2018

A Tour of Quantum ML Algorithms

The normal breakdown of machine learning algorithms is to divide it into 2 categories: supervised algorithms and unsupervised algorithms. Supervised algorithms include those where a quantity data and the “answers” are known, and we seek to build a model that will allow us to consider previously unseen data. These include simple regression, support vector machines (which we will consider in detail later), neural networks, restricted Boltzmann machines and deep learning classifiers. Unsupervised algorithms include clustering, dimensionality reduction and methods such as principle component analysis, and generative adversarial networks.

Because most of the work on quantum ML is theoretical the measure of success is based on complexity theory and the holy grail for quantum algorithms is a proof of exponential speed up of the quantum algorithm over the best sequential counterpart.  In the case of the classical ML algorithms that run in polynomial time (as a function of input data size), an exponential speedup for a quantum algorithm would be one that runs in polylogarithmic time or less.   But if the data input must be loaded into the quantum computer, then any sequential read of the data would require linear time and we cannot boast true exponential speed-up. One way around this is to assume the data has been preloaded into a “quantum random access memory” (qRAM). An alternative is to consider problems where the data is generated by an internal algorithm or a natural process.

A commentary in Nature by Scott Aaronson entitled “Read the fine print” (2015) points out that there are several difficulties with many quantum algorithms. These include the transformation of the data vector of dimension N into to log(N) qubits, the fact that the matrices must be well conditioned and the fact that the vector outputs may be difficult to read if they contain components that differ greatly in scale. In general, the “Quantum Supremacy” of a quantum algorithm may be theoretically true but it may be lost to fast classical algorithms in real applications if nasty constant factors dominate.

Looking at a Few Algorithms

A frequent assumption about data is that it is composed of vectors of dimension N=2n. These may be real or complex values.   Let v = {v1, v2, … , vN} be such a vector, then we assume we can imbed this into n qubits as

svm-01

If this can be done, it has some nice properties that can be used in some important quantum subroutines in the quantum ML algorithms.

K-Means Algorithm

K-means is one of the standard classic ML unsupervised algorithms.   It uses the vector-to-quantum vector map described above. If we define a second vector w in a similar way and then define

svm-eq02

A bit of algebra will verify that

svm-eqn03

This implies we can compute Euclidian distance if we can compute |<ϴ | φ > |2 .   It turns out there is a simple quantum circuit, called the swap-test that can help us estimate this value.

Given this distance algorithm and another standard subroutine called Grover’s algorithm (that allows us to pick the best from alternatives) we can build a K-means algorithm.   The quantum advantage of this algorithm only show up when N is large because the complexity of the classical algorithm with M data points is O(MN) and with the quantum algorithm it is O(M log(N)).

Principal Component Analysis

PCA is based on finding the eigenvectors of the covariance matrix of a set of input vectors.   The quantum version of this algorithm uses an important subroutine called quantum phase estimation, which is a method to find the eigenvalues of a unitary matrix.   For unitary matrices the eigenvalues are always of the formsvm-eq04

The algorithm computes the phase ϴ.

For PCA in general, we start with a set of vectors {vi , i = 1, M}. The jth component of vi can be thought of jth feature of the vector. To simplify the discussion let’s assume that mean of this feature is 0 for all M vectors. In otherwords, set vi = vi – 1/M*sum(vi, i=1,M).  We are looking for eigenvalues and eigenvectors of the matrix C = Vt * V where V is the NxM matrix consisting the vectors {vi , I = 1, M}. Another way to describe C is by the fact that it is NxN with Ci, j = sum( vki* vk j , k =1,M).   C is M times the covariance matrix for our set of normalized vectors. The eigenvectors of C form a new basis that diagonalizes C with the eigenvalues on the diagonal.

In the quantum algorithm we use the map of our vector data into the quantum space as abovesvm-01

Where we have also normalized each vector to unit length. Now we build the Hermitian operatorsvm-eq05

Expanding out the tensor product from the embedding representation we see this is precisely the quantum version of our covariance matrix. We can now turn this into a problem of finding the eigenvalues of a unitary matrix with the formula

svm-eq06

The algorithm now can use phase estimation and other tricks to get to the eigenvectors and values.   For the serious details see reference [7] and [10] above.

Neural Networks: RBMs and GANs

Paper [5] and [7] discuss approaches to quantum versions of restricted Boltzmann machines (RBMs).   Unlike the standard feed-forward deep neural network used for classification tasks, RBMs are a type of generative neural network. This means that they can be used to build a model of the statistical distribution of observed data. Sampling from this distribution can give you new examples of “similar” data objects. For example, a picture of a human face that is like others but different, or a potential chemical compound for drug discovery research. The simplest form of a RBM is a two layer network with one layer which is the visible nodes (v1, v2, .. vm) and the other is called the hidden layer of nodes (h1, h2, …, hn). Each visible node is connected to each hidden node as shown in the figure below.

rbm-1

We assume that all the data are binary vectors of length m. We use each data sample to predict values for the hidden state using a sigmoid function and a set of weights wi,j connecting visible node i to hidden node j and an offset ai. What results is a conditional probably for h given v.   Then given a sample of hidden states we project them back to the visible layer by a similar conditional probability function.

rbm-2

The joint probability for v and h is given by

rbm-4

Where the energy function is given by

rbm-3

Where Z is the normalizing sum.   This distribution is the Boltzmann or Gibbs distribution.   There are several quantum solutions.   One is to us simulated annealing which can be done on a machine like the D-wave system. The other approach is to exploit a technique known as Gibbs sampling. One can use the quantum computers can draw unbiased samples from the Gibbs distribution, thereby allowing the probabilities P(v,h) to be computed by quantum sampling. It seems that this can be done with a relatively small number of qubits, but the reader should consult [5] to get the full picture.

Paper [9] considers the problem of creating quantum generative systems by starting with factor graphs and then then construct a quantum state that represents the distribution associated with the graph. They then show that this is conceptually more powerful that then factor graph approach.

RBMs are one of the original forms of neural networks. (One survey article calls them the “model T” of networks.) RBM are not used very often because other generative methods seem to work better.   One such better method is the Generative Adversarial Network (GAN), in which two networks compete. We discussed GANS in a previous post. One network called the generator and its job is to generate sample data that will fool the other network, the discriminator. The discriminator’s job is to tell the real data samples from the fake.   When the system converges the discriminator is correct half the time and it is wrong half of the time.

In paper [8] the authors generalize this to the quantum case by proposing three variations: both the discriminator are quantum systems, the discriminator is quantum and the generator is classical and the configuration where the discriminator is classical and the generator is quantum.   They argue that for high dimensional data the quantum-quantum case may give exponential speed-up.

An Example: Support Vector Machines

To illustrate a simple quantum machine learning algorithm that runs on the IBM system we turn to a “classic” machine learning method known as Support Vector Machine (SVM).   In general, an SVM can be used as a binary classifier of measurements of experiments where each experiment is represented by its features rendered as vector or point in Rn.   For example, these might be measurements of cell images where we want to identify which cells are cancerous or it may be astronomical measurements of stars where we seek to determine which have planets.  To understand the quantum version of SVM it is necessary to review the derivation of the classical case.  In the simplest cases we can find a plane of dimension Rn-1 that separates the vectors (viewed as points) into two classes with the class on one side of the plane having the property we want and the class on the other side failing. These points with their label constitute the training set for the algorithm. In the case that n = 2, the plane is a line and the separator may look like figure 1a.   The hyper plan can be described in terms of a normal vector w and an offset b. To determine if a point X in Rn is one side or the other of the plane, we evaluate

svm-eqn1

If f(X) > 0 then the point is on one side and if f(X) < 0 the point is on the other side.

svm-pict

Figure 1. a) on the left is a linear separator between the red and the blue points. b) on the right is an example without a linear separator.

The name “support vector” comes from the optimal properties of the separating hyperplane.   The hyperplane is selected so that it maximized the distance from all the training examples. The distance of a point to the hyperplane is the minimum distance to the hyperplane. This minimum distance is on a vector from the plane to the point that is parallel to the normal vector of the plane.   Those points whose distance is the minimum are called the support vectors. Because the plane maximizes the distance from all points, those minima must be equal. A bit of algebra will show the normal distance between the nearest points on one side of the plane and the nearest point on the other side is 2/||w||. (Hence to maximize the distance we want to minimize ||w||2/2.) Let {xi , yi} I = 1,N be the training points where yi = -1 if xi is in the negative class (making f(xi) < 0) and yi = 1 if xi is in the positive class ( f(xi) > 0) .  To compute the values for w and b we first note that because f(X) is only + or -, we can scale both w and b by the same factor without changing the sign.   Hence, we can assume

svm-eq2

where equality happens only for the support vector points. To solve this problem of maximizing the minimum distance we are going to need to invoke a technique from math where we insert some new variables αi (known as Lagrange multipliers) and then compute

svm-eq3

As stated above we need minimize ||w||/2 so we can minimize its square.   Looking at the min term, we must compute

svm-eq4

Taking the derivative with respect to w and b we get the minimum when

svm-eq5

Substituting these into L we see we must compute

svm-eq6

A look at L shows that when xi is not a support vector then the max cannot happen unless αi = 0. Numerically solving for the remaining αi’s we can compute b from the fact that for the K support vectors we have

svm-eq6b

The matrix

svm-newk

Is called the kernel matrix. The function f(X) now takes the form

svm-eq8

The Nonlinear Case and the Kernel Trick

There are obviously many cases where a simple planar separator cannot be found. For example, Figure 1b contains two classes of points where the best separator is a circle.   However, that does not mean there is no interesting combination of features that can be used to separate the classes with a plane if viewed in the right way.   Let ϕ: Rn -> M be a mapping of out feature space into a space M of higher dimension. The goal is to find a function that spreads the data out so that we can apply the linear SVM case.   All we require of M is that it be a Hilbert space where we can compute inner products in the usual way. If we define the function K(X,Y) = (ϕ(X) . ϕ(Y)) then our function f(X) above becomes

svm-eq9

Where the ais and b are computed in the space M.   To illustrate this idea let ϕ: R2-> R3 be given by the mapping ϕ(X) = (X[0], x[1], 3 – 0.35*||X||2) and look at the data from figure 1b, but now in 3D (figure 2). As can be seen the mapping moves the data points onto a paraboloid where the points inside the circle are in the positive half-space and the rest are below. It is now relatively easy to find a planar separator in R3.

svm-3

Figure 2. Kernel mapping R2 to R3 providing a planar separator for the data in Figure 1b.

When does this kernel function trick work? It turns out that it almost always does provided you can find the right mapping function.   In fact the mapping function ϕ is not the important part.   It is the function K(X,Y).   As long as K is symmetric and positive semi-definite, then there is a function ϕ such that for every X and Y, K(X,Y) = (ϕ(X) . ϕ(Y)). But from the function f(X) above we see that we only need K and the inner product in M.   As we shall see below M may be derived from quantum states.

Quantum Support Vector Machines.

We will look at the results of using a quantum SVM algorithm that run on the IBM quantum hardware. The complete mathematical details are in a paper by Vojtech Havlicek, et. al. entitled “Supervised learning with quantum enhanced feature spaces“ (https://arxiv.org/abs/1804.11326). They describe two algorithms. One of the algorithms is called a variational method and the other is a direct estimation of a quantum version of a Kernel function K(x,z).   Because the later method is a tiny bit easier to explain we will follow that approach. There are two parts to this. We will work with examples with data in R2.

  1. First we construct a function φ from R2 into the space of positive semidefinite density matrices with unit trace.   We need this function to be hard to compute classically so we can preserve the “quantum advantage”. We can then create a 2 qubit function from R2 as
    svm-eq10

The transform Uφ(x) is the key to embedding out training and test data into 2 qubits. The exact definition involves selecting a set of nonlinear functions ϕS(x): R2 -> R where S ranges over the subsets of the set {1,2}. Given these functions, the unitary function Uφ(x) is defined as

svm-eq10a

Were Zi is the Pauli Z operator on the ith qubit.

To create our kernel function we look at

svm-eq11

  1. This is almost the kernel, but not quite. What we define as the K(x,z) is related to the fidelities between x and z. To get that from the transition amplitudes above we measure this R times and if R is sufficiently large it will give us a good estimate.

Once we have computed  Ki,j = K(xi, xj) for the training set we can now use the “classical” computer to calculate the support vector coefficients ai and offset b.   To make predictions of f(Z) we now use the quantum computer to calculate K(xi , Z) for each of the support vectors and plug that into the formula above.   The exact mathematical details for deriving φ are, of course, far more complicated and the reader should look at the full paper.

This example quantum support vector machine is available on-line for you to try out on IBM’s system.   The code is simple because the details of the algorithm are buried in the qiskit.aqua libraries. To run the algorithm, we create a very small sample dataset from their library “add_hoc_data” and extract training and testing files.

qsvm-code1

There are two ways to run the algorithms. One way is a 3-line invocation of the prepackaged algorithm Below is the “programmers” which shows a few more details. This shows that we are using the qasm_simulator with 2 qubits where qubit 0 is connected to qubit 1 on the simulated hardware (this reflects the actual hardware).   We create an instance and train it with 1024 “shots” (R above).

qsvm-code2

We can next print the kernel matrix and the results of the test.

qsvm-code3

Using the programming method, we can directly invoke the predict method on our trained model. We can use this to show a map of the regions of the quantum support “projected” to the 2-D plane of the sample data. We do this by running the predict function on 10000 points in the plane.   And plot this as a map and then add the training points.

qsvm-code4

The resulting image is below. As you can see, the dark blue areas capture the orange data points and the lighter orange areas capture the light blue data points.

qsvm-code5

It is interesting to compare this result to running this with the Scikit SVM library. Using the library is very simple.   Converting the data set from the quantum algorithm to one we can give to the Scikit library as vectors X and Y, we have

qsvm-code6

The Kernel function in this case is one of the standard kernels: RBF. Plotting the projection of the support surface wit the 2-D plain we see the image below.

qsvm-code7

The match to the training data is perfect, but in this case the accuracy is only 0.6. We tried two additional test.   The first was a simple linear partition along the line y = 0.6*x-0.2 with 20 points above the line and 20 below. In this case the quantum computation did not as well at first, but after several attempts with different data sets, it achieved a score of .95 and the Scipy RBF kernel also got a score of 0.95. The figure below illustrates the regions captured by both algorithms.   We also used another example from their data collection consisting of breast cancer cases.   The data was relatively high dimensional, so they projected it onto the 2 principal axes.   In this case the quantum and the Scipy RBF both achieved an accuracy of 0.9.

quantum-svm2

After numerous experiments we found that the quantum algorithm was unstable in that there were several cases that caused it to fail spectacularly (accuracy = 0.4). However, when it worked it worked very well. The experiments above were with very tiny data sets that were selected so the results could be easily visualized. A real test would require data sets 100 to 1000 time larger.

Conclusions

Quantum computers are now appearing as cloud-based resources and when used with algorithms that exploit the quantum subsystem and classical computer working together, real breakthroughs may be possible. In the area of AI and machine learning, the current work is primarily very theoretical, and we hope that we have given the interested reader pointers to some of the recent papers. We took a deep dive into one classical algorithm, support vector machines, and illustrated it with a code that runs on the IBM-Q system. Our results were impressive in terms of accuracy, but we did not see speed-up over the classical algorithm because of the small size and dimensionality of the data sets.

The current crop of algorithms will need improvements if they are to show substantial speed-up on real world problems. In particular, the mapping from real data to quantum states will need to improve. But this will be an area where we can expect to see substantial investment and research energy over the next few years.

Quantum Computing and the Cloud

Over the last five years cloud computing systems have evolved to be the home to more than racks of servers.   The need for specialized resources to for various classes of customers has driven vendors to add GPU server configurations as a relatively standard offering.   The rise of Deep Learning has seen the addition of special hardware to accelerate both the training and inference phases of machine learning.  This include the Google Tensor Processing hardware in the Google cloud and the FPGA arrays on Microsoft’s Azure.   Quantum computing has now moved from theory to reality.  Rigetti, IBM, D-Wave and Alibaba now all now have live quantum computing services in the cloud.   Google and Microsoft will follow soon.    In the paragraphs that follow we will dive a bit deeper into several of these services with illustrations of how they can be programmed and used.   In this article we will look at two different systems: the IBM-Q quantum computer and it software stack qiskit and the Microsoft Q# quantum software platform.  We could have discussed Rigetti which is similar to IBM and D-Wave but it is sufficiently different from the others to consider it elsewhere.   We don’t have enough information about Alibaba’s quantum project to discuss it.

The Most Basic Math

As our emphasis here will be on showing you what quantum programs look like, we will not go into the physics and quantum theory behind it, but it helps to have some background. This will be the shallowest of introductions.  (You will learn just enough to impress people at a party as long as that party is not a gathering of scientists.)  If you already know this stuff or you only want to see what the code looks like skip this section completely.

There are many good books that give excellent introductions to quantum computing.  (My favorite is “Quantum Computing: A Gentle Introduction” by Rieffel and Polak.)    We must begin with qubits: the basic unit of quantum information.   The standard misconception is that it is a probabilistic version of a binary digit (bit).   In fact, it is a two-dimensional object which is described as a complex linear combination of two basis vectors |0> and |1> defined as
eq1Using this basis, any qubit  |Ψ> can be described as a linear combination (called a superposition) of these basis vectors
eq2where α  and β are complex numbers whose square norms add up to 1.    Because they are complex this means the real dimension of the space is 4 but then the vector is  projected to complex projective space so that two qubit representatives |Ψ> and |г>  are the same qubit if there is a complex number c such that

eq3A slightly less algebraic representation is to see the qubit |Ψ> as projected onto the  sphere shown on the right spherewhere |Ψ> is defined by two angles where

   eq4

Qubits are strange things that live in a world where we cannot know what the parameters α and β are unless we attempt to measure the qubit.  Measurement can be thought of as projecting the qubit onto a special set of basis vectors and each device has its own set of basis vectors.  Here we will assume all our measurements are with respect to the standard basis |0> and |1>.   Measuring a qubit changes it and results in a classic bit 0 or 1.   However, the probability that it is a 0 is || α ||2  and the probability we get a 1 is || β ||2.   After a qubit has been measured it is projected into one of the basis vectors |0> and |1>.

Basic Qubit Operators

In addition to measurement, there are a number of operators that can transform qubits without doing too much damage to them.  (As we shall see, in the real world, qubits are fragile.  If you apply too many transformations, you can cause it to decohere: and it is no longer usable.   But this depends upon the physical mechanism used to render the qubit.)  The basic transformation on a single qubit can all be represented by 2×2 unitary matrices, but it is easy to just describe what the do to the basis vectors and extend that in the obvious way to linear combinations.

  • X is the “not” transform.   It takes |0> to |1> and |1> to |0>.   By linearity then X applied to any qubit is
    eq5
  • H the Hadamard transform takes |0> to 1/√2(|0> + |1>) and |1> to  1/√2(|0> – |1>).   This transform is used often in quantum programs to take an initial state such as |0> to a known mixed state (called superposition).

Multiple Qubits

Quantum computing gets interesting when multiple qubits interact.   The way we describe the state space of two qubits is with the tensor product.   For two qubits we can describe the state of the  system in terms of the basis that is just the tensor product of 1-qubit basis:

eq6

So that any state of the 2-qubit system can be describe as a linear combination of the form

eq6-5JPG

For three qubits the eight basis elements are |000> through  |111>.   The real dimension of this 3 qubit space is 23= 8.  There are deep mathematical reasons for why the tensor product is the correct formulation rather than the direct sum of vector spaces as in classical mechanics, but the result is profound.   The real dimension of an n-qubit system is 2n.  When n = 50 the amount of standard computer memory  required to store a single qubit is 8*250 = 16 petabytes (assuming 2 8-byte floating point values per complex coefficient).   Hence there are limits to the size of a quantum systems we can simulate on a classical computer.   We now have functional quantum computers with 20 qubits and we can simulate 32 qubits, but it may take 50 qubits or more to establish the more promising advantages of quantum computing.

For two qubits there is a standard operation is often used.  It is the

  • Cnot Controlled Not. This operation is so called because it can be thought of as using the first qubit (left most) in a pair to change the value of the second.   More specifically, if the first bit of the pair is 0 then the result is the identity op:
    Cnot(|0x>) = |0x>.   If the first bit is one then the second bit is flipped (not-ed):
    Cnot(|10>) = |11>,  Cnot(|11>) =   |10>.

One of the most interesting things we can do with these operations is to apply the H operator to the first qubit and then Cnot to the result.    In tensor product terms applying an operation H to the first qubit by not the other is to apply the operator H⊗Id to the pair, where Id is the identity operator.   We can now compute Cnot(H⊗Id)(|0>|0>) as

eq7

The result B =1/√2(|00> + |11>) is called a Bell state and it has some remarkable properties.   First it is not the product of two 1-qubit vectors.  (Some easy algebra can prove this claim.)  Consequently B is a qubit pair that is not the simple the co-occurrence of two independent entities.  The pair is said to be entangled.   Information we can derive from the first qubit can tell us about the second.   If we measure the first qubit of the pair we get 0 or 1 with equal likelihood.  But if it is 0 then M(B) is transformed to |00>.   If we get 1 M(B) becomes |11> .  If it is |00>, measuring the second bit will give 0 with 100% certainty.  If it is |11>, we will get 1 for the second bit.   As this is true even if the two qubits are physically separated after they have been entangled, the fact that measurement of one qubit determines the result of measuring the second leads to amusing arguments about action at a distance and quantum teleportation.

We now have enough of the math required to understand the programs that follow.

IBM-Q

The IBM system is real and deployed on the IBM cloud.  The core computational components are made up of superconducting Josephson Junctions, capacitors, coupling resonators, and readout resonators.  As shown in Figure 1, the induvial qubits are non-linear oscillators.  Tuned Superconducting resonator channels are used for readout.  Also, the qubits are tied together by additional superconducting channels.

IBM has several deployments each named after one of their research and development Labs.   They include:

  • IBM-Q Tokyo. 20 qubits available for IBM clients
  • IBM-Q Melbourne 14 qubits and available for public use.
  • IBM-Q Tenerife 5 qubits available for public use.
  • IBM-Q Yorktown 5 qubits available for public use.

ibm-q-chip

Figure 1.   A 5-qubit IBM-Q computational unit.  Source: IBM-Q website.

In addition, they have a very large simulator in the cloud, IBM-Q QASM_simulator, with 32 qubit capability.   There is much more to the architecture of a complete quantum system.  Two big challenges.  First the qubit devices must be cryogenically cooled and howare how do you connect a system running at 15 millikelvins to a room temperature computing environment and how do you minimize noise to reduce errors.   As shown in Figure 2, it takes several thermal layers and superconducting connections to make it happen.

Quantum_Leap_Supercomputer_Graphic_Online_Final_V11

Figure 2.  System Architecture of IBM-Q quantum architecture.  Source: IBM Research

Signing up to use the IBM-Q pubic systems is extremely easy.   Go to the qx community page and sign-up and you can get an access code.  The best way to use it is with Python Jupyter so download the most recent Anaconda distribution.

The IBM-Q qiskit software stack is how we program the system.   Ali Javadi-Abhari and Jay M. Gambetta have an excellent series of articles on Medium describing it.    The current state of the art in the hardware is what they call “noisy intermediate-scale quantum computers (NISQ)”.   The software stack is designed to allow researchers to explore several levels of NISQ computing.  There are for components.

  • Qiskit Terra. This is the core software platform containing all the Python APIs for describing quantum circuits and the interface to submit them to the hardware and simulators.   There are many qubit operators in Tera.    See this notebook for a good sample.
  • Qiskit Aqua. This is the high-level application layer.   It contains templates for building advanced applications in areas including chemistry, optimization and AI.   (We will discuss this in more depth in part 2 of this article.)
  • Qiskit Aer. Qiskit has several different simulators that run in the cloud or locally on your laptop.
    The unitary_simulator.   The standard operations on a quantum circuit are all unitary operations.   The unitary_simulator computes the result of computation and displays the result as a complex unitary matrix.
    The statevector_simulator allows you to initialize a mullti-qubit to an arbitray unit combination of the basis vectors and it will do the simulation in terms of the state vectors.
    The qasm_simulator provides a detailed device level simulator that takes into account the fact that the hardware is noisy.   We will illustrate this simulator below

A Quantum “Hello World” example.

Qiskit is a programming system based on compiling Python down to the basic assembly language of the IBM-Q hardware (or to the format needed by one of the simulators).   For a simple “hello world” example we will use the simple demonstration of entanglement describe in the mathematical introduction section above.   We start with two qubits in |0> state and apply the H transform to the first and then the controlled not (Cnot) operation to the pair.   We next measure the first qubit and then the second.  If they have become  properly entangle then if the first was measured at a 1 then the second will be a 1.  If the first is a 0, then the second will be measured as a zero.

This code is based on the sample by Jay Gambetta and Ismael Faro.   Using Jupyter we start by loading the libraries we will use and then load our account information (this was established earlier on the local machine earlier using the IBMQ.save_account(‘…. Key …’) operation.  With the account information we can inquire about the backend quantum systems we are allowed to use.    There are three: two are hardware  and one is the cloud-based qasm-simulator.

qiskit-jupyter1

In line [4] we ask for the least busy of machines and it is the 16 qubit system.

The next step is to define the program.   We declare an array of 2 qubit registers and 2 classical 1- bit registers.   As shown below, we create a circuit consisting these two resources.   We apply the H operator to the first qubit and the Cnot to the first and the second.   Finally, we measure both.

qiskit-jupyter2

We have not executed the circuit yet, but we can draw it.   There is a standard way quantum circuits are drawn which is similar to a musical score.   The registers (both quantum and classical) are drawn has horizontal lines and operators are placed on the lines in temporal order from left to right.

As you can see, the H gate is simply represented as a box.   The controlled not consists of a dot on the “control” qubit and a circle on the “controlled” instance.  Recall that the value of the control determines what happens to the second qubit.  Of course, we can’t know the result until we do the measurement and that is represented with a little dial and a line to the output classical bit.

We next execute our circuit on the ibmq_16_melbourne hardware.   We will run it 1000 times so we can get some interesting statistics.

qiskit-jupyter2-6

The execute command is a type of “future” call.  It returns a placeholder for the result.   The actual job will go into a queue and we can monitor the status.   When complete we can get the data. The data is always returned from the experiment as a Python dictionary with keys labeled 0x0 for the basis |00>, ox1 for the basis vector |01> and 0x2 for |10> and 0x3 for |11>.

As we can see from the results below, the hardware is indeed noisy.   The two bits are corelated 92.1% of the time.   We can also plot a histogram to see the results.

qiskit-jupyter3

Given that we should have created the Bell state 1/√2(|00> + |11>)  with our two qubit operations there should be no |01> or |10> components. But the systems are noisy, and the measurements of qubits produce results that defined by probability distributions, the outcome should not surprise us.  Even the initial state of the qubit may be |0> with only 99.9% accuracy.  That means it is occasionally |1>.

The IBM-Q qasm_simulator can provide a model of the execution at the device level.   To use it we extract data about the device we are going to simulate.     We can get very low-level details about the device and measured noise characteristics.   We can also get the coupling_map that tells us how the individual qubit cells are connected on the chip.

qiskit-jupyter4

Using this device data we can invoke the simulator.   As can be seen below, we get results that are very similar to the actual experiment.

qiskit-jupyter5

qiskit-jupyter6

If we had used one of the other simulators we would see only the theoretically perfect results.

Microsoft Q# Quantum Programming Toolkit

Perhaps the greatest challenge when building a quantum computer is designing it so that it is stable in the presence of noise.   Qubits that are too fragile will experience decoherence if they are subject to prolonged episodes of noise while they are undergoing the transformations required by a quantum algorithm.   The depth of a quantum algorithm is the count of the longest path of operations in the circuit.   Based on the intrinsic error characteristics of the devices and the noise there may be a limit of  a few tens of thousands in the circuit dept before decoherence is likely.   For some algorithms, such as those involving iteration, this limit may make it unusable.  One way to solve this is by introducing error correction through massive redundancy.

 Microsoft has been taking a different approach to building a qubit, one that, if successful, will yield a much more robust system without as much need for error correction.  Called a topological qubit, it Is based on different physics.

Topology is the branch of mathematics that is concerned with the properties of objects that are not changed when they are perturbed or distorted.   For example a torus is a 2-dimenional object that cannot be deformed into a sphere without ripping the surface of the torus.  But the torus can be deformed into various other shapes, such as the surface of a coffee mug with no such tearing.   Or consider points on a 1-D line as beads on a string.  We cannot change their  order unless we can move them from the one dimensional line to a second dimension and back to the line.  braidedThis braded structure is a topological constraints that is a global property and therefor very robust.  If you can make a qubit from this property it would impervious to minor noise.

In condensed matter physics the 2016 Nobel prize was awarded to David Thouless, Duncan Haldane and Michael Kosterlitz for their work understanding strange behavior of matter when restricted to thin films.  Their discovery demonstrated that the behavior had to do with the topology of 2-D surfaces.   A similar discovery had to do with chains of atoms on a thin (1-D) , superconducting wire.  The properties of the pair of objects at the ends of the chain were tied to the whole of the chain and not subject to minor local perturbations.  Microsoft uses a similar idea to construct their “topological qubits” made from spitting electrons to form “Majorana fermion quasi-particles”.   Situated at the opposite end of topological insulators they are highly noise resistant.   This implies that one does not need massive redundancy in the number of qubits required for error correction that is needed for many other qubit models.

Of course the above description does not tell us much about the exact nature of their process, but several interesting theory papers exist.

The Q# programming environment.

The first iteration of a quantum computing software platform from Microsoft was based on the F# functional programming language in 2014 and called  LIQUi|> (see Wecker and Svore).  The current version is based on C# and is nicely embedded in Visual Studio and Visual Studio Code.   You can  download it here for Windows 10, Mac and Linux.  The installation straightforward.  There is also a Python binding but we will look at the Visual Studio version here.

The Q# programming language is designed as a hybrid between quantum operations on qubits and classical procedural programming designed to operate on a digital computer that contains the quantum device as a co-processor.   Q# extends C# by introducing a number of new standard types including Qubit, Result (the result of a qubit measurement), unit (indicating an operator returns no result) and several additional operators.   A complete description of the operational semantics is defined on the Q# link above.   For our purposes here any Java or C# program should be able to follow the code.

There are two standard libraries and a set of research libraries.  The standard libraries include

  • Prelude: the collection of logic, libraries and runtime specific to a particular quantum computer architecture.
  • Cannon: The hardware independent library of primitive operator that can be used as part of quantum algorithm design.

There is also a set of excellent standard libraries that include important topics like amplitude magnification, quantum Fourier Transforms, iterative phase estimation and other topics more advanced than this article can cover.   There is also a set of research libraries with a focus on applications in Chemistry and Quantum  Chemistry.

 

Hello World in Microsoft’s Q#

We begin by creating a new C# project with the “file->new->project” menu.  If Q# has been correctly installed, you can select “Q# application” from the list of C# configurations and fill in the name for the project.  In our case we are using “BellTest” and the system now shows the following.

q#1

This is almost exactly what you will find in the introduction when you download the kit. The file called operations.qs contains the main part of our algorithm.  We will have two operations, one for initializing a quantum variable and one for the bulk of our algorithm.

q#2

The operation Set takes a qubit and a desired value (0 or 1) and sets the qubit to have that value.  It does this as follows.  First we measure the qubit.  Recall that measurement (in the standard basis) returns a 0 or a 1 and (in the standard basis)  projects the qubit to either |0> or |1>.   If measures 0 and a |1> is desired, we use the Not gate (X) to flip it.   If a 1 is measured and 1 is desired no change is made, etc.

The operation BellTest takes an iteration count and an initial value we will use for qubit 1.   The code is essentially identical to the example we described for qiskit.   We start with an array of 2 qubits.  We set qubit 0 to |0> and qubit 1 to |initial> .   Next we apply H to qubit 0 and Cnot to the pair as before.  We measure qubit 0 and compare it to the measurement of qubit 1.   If they agree we increment a counter.  If the measurement is 1 we also count that.  This process is repeated count time and returns our counter values.

q#3

The main program calls BellTest two different ways: once with the initial value for qubit 1 to be 0 and one with the value of qubit 1 to be 1.   In the case that both qubits are initialized to 0 we saw from the mathematical introduction that the state after the Cnot should be the Bell state 1/√2(|00> + |11>).  Consequently, after the measurements both qubits should always agree: if one is 0 the other is 0 and if one is 1 the other is 1.   However if qubit 1 is initialized to 1 the situation is different.  Evaluating the state mathematically we get

eq8

Hence, we should see that the two qubits never agree when measured.  The main program that drives this experiment is

q#4

which runs BellTest 1000 times with each configuration of qubit 1.   The results are below.

q#5

These results agree with the theoretical result.   The only probabilistic effect is the count in the number of zeros/ones measured.   Unlike the Qasm_simulator, there was no noise introduced into the initialization because the Q# simulator was not modeling a specific hardware configuration.

Conclusion

The near-term future of quantum computers will be as co-processors tightly integrated as cloud services.  While IBM is now ready to start selling there systems into private clouds,  others like Google and Microsoft will probably stay with a cloud service offering.

This short paper was intended to illustrate two approaches to programming quantum computers.  It is certainly not sufficient to begin any serious quantum algorithm development. Fortunately, there is a ton of great tutorial material out there.

There is a great deal of exciting research that remains to be done.  Here are a few topics.

  1. Quantum Compiler Optimzation. Given the problem of qubit decoherence over time, it is essential that quantum algorithms terminate in as few step as possible.  This is classic compiler optimization.
  2. Efficient error correction. If you have a great quantum algorithm that can solve an important  problem with 100 qubits, but the error correction requires 1000 qubits, the algorithm may not be runnable on near term machines.
  3. Breakthrough algorithmic demonstrations. “Quantum supremacy” refers to concrete demonstrations of significant problem solutions on a quantum computer that cannot be duplicated on a classical machine.   In some cases this is argued to be algorithms that are “exponentially” faster than the best classical algorithm.  However, good quantum algorithms may lead to the discovery of classical algorithms that are quantum inspired.  For example, here is one for recommendation systems.

Part 2 of this report will address the application of quantum computing to AI and machine learning.  This is both a controversial and fascinating topic.

Julia Distributed Computing in the Cloud

Abstract.

This brief note is intended to illustrate why the programming language Julia is so interesting to a growing number of computational and data scientists.  Julia is designed to deliver high performance on modern hardware while retaining the interactive capabilities that make it well suited for Jupyter-style scientific exploration. This paper illustrates, with some very unscientific examples, how Julia can be deployed and with Docker and Kubernetes in the cloud.

Introduction

In all our previous posts we have used Python to build applications that interact with cloud services.  We used Python because everybody knows it.  However, as many scientists have now discovered, for scientific applications the programming language Julia is a better alternative.  This note is not a Julia tutorial, but rather, it  is intended to illustrate why Julia is so interesting. Julia was launched in 2012 by Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman and is now gaining a growing collection of users in the science community.  There is a great deal of Julia online material and a few books (and some of it is up to date).   In a previous blog post we looked at Python Dask for distributed computing in the cloud.   In this article we focus on how Julia can be used for those same parallel and distributed computing tasks in the Cloud. Before we get to the cloud some motivational background is in order.

1.    Julia is Fast!

There is no reason a language for scientific computing must be as dull as Fortran or C.   Languages with lots of features like Java, Python and Scala are a pleasure to use but they are slow.   Julia is dynamic (meaning it can run in a read-eval-print-loop in Jupyter and other interactive environments), it has a type system that support parametric polymorphism and multiple dispatch.   It is also garbage collected and extremely extensible.   It has a powerful package system, macro facilities and a growing collection of libraries.   It can call C and Python directly if it is needed.

And it generates fast code.   Julia uses a just-in-time compiler that optimized your program depending on how you use it.    To illustrate this point, consider a simple function of the form

function addem(x)
   # do some simple math with x
   x += x*x
return x
end

Because we have not specified the exact type of x, this defines a generic function: it will work with arguments of any type that has meaning for the math operations used.   But when we invoke it with a specific type, such as int or float, a version of the function is compiled on the fly that is specialized to that type.  This specialization takes a bit of time, but when we invoke the function a second time  with that argument type, we used the specialized version.   As illustrated in Figure 1, we called the function twice with integer arguments.  The second call is substantially faster than the first.  Calling it with a floating point argument is slow until the specialized version for floating point variables is created, then the function runs fast.

specializing-generics

Figure 1.   Using a timing macro to measure the speed of function calls.   In step 23 the generic version is called with an integer.   The second call uses the version optimized for integers.   In  steps 25 an 26 the specialized version for floating point numbers is generated and run.

The Julia team has put together a benchmark set of small examples written in several languages.  We extracted the benchmarks for C, Python and Julia and ran them.   The resulting execution time are shown below.   As you can see, the Julia code generation is arguably as good as  C (compiled with optimized gcc).  Relative to Python it is as much as 50 times faster.  Figure 2 makes this more explicit.  The only benchmark where Python is within a factor of two is  matrix multiply and that is because Python is using the optimized numpy.linalg libraries.

Benchmark python gcc – O julia
recursion_fibonacci 2.977 0.000 0.037
parse_integers 2.006 0.117 0.197
userfunc_mandelbrot 6.294 0.068 0.065
recursion_quicksort 12.284 0.363 0.364
iteration_pi_sum 558.068 22.604 22.802
matrix_statistics 82.952 11.200 10.431
matrix_multiply 70.496 41.561 41.322
print_to_file 72.481 17.466 8.100

speed-up

Figure 2.   Speed-up of Julia and C relative to Python

2.    Julia Math and Libraries

Doing basic linear algebra and matrix computation in Julia is trivial.  The following operations each take less than one second in Jupyter using the package LinearAlgebra.jl.

M = rand(1000, 1000); # generates a 1000 by 1000 matrix of random floats
N = M^-1                        # compute the inverse of M
C = M*M’                       # M’ is the transpose so the product is symmetric.
Q = svd(C)                     # computes the singular value decomposition of C.
Q.S                                 # are the singular values.

Sparse matrices are also supported along with a large number of other matrix creation and manipulation operations.

Differential Equations

The differential equation package built by a team led by Christopher Rackauckas is one of the most impressive.  To illustrate this we consider an example from their excellent tutorial.   The Lorenz attractor is a fascinating example of a chaotic solution to a system of differential equations that model convection.   The system involves the evolution in 3D of a system that is characterized by three parameters.  As shown below the equations are expressed exactly as you would describe them mathematically (dx is dx/dt) and the three parameters ar sigma, rho and beta.   The initial point is (1,0,0) and the region is integrated over [0,100].   The package automatically picks an appropriate solver and the output is plotted as shown in Figure 3.   Running this on a Mac mini took about 5 seconds.  We used another package “Plots” to render the image.

g = @ode_def LorenzExample begin
           dx = σ*(y-x)
dy = x*(ρ-z) – y
dz = x*y – β*z
end σ ρ β

u0 = [1.0;0.0;0.0]
tspan = (0.0,100.0)
p = [10.0,28.0,8/3]
prob = ODEProblem(g,u0,tspan,p)
sol = solve(prob)
plot(sol)

lorentz

Figure 3.  Plot of the Lorenz attractor solution. (yes, Greek Unicode character names are valid.)

Note: Thomas Breloff provides a more explicit integration with a Gif animation in this Plots tutorial.  As with the above example, Breloff’s solution works well in Jupyter.

Julia Distributed Computing in the Cloud

Julia has several packages that support parallel computing.   These include a library for CUDA programming, OpenMP, Spark and  MPI.jl for MPI programming.  MPI is clearly the most reliably scalable parallel computing model for tasks involving 10000 or more cores.  It is based on low-latency, two-sided communication in which coordinated, synchronized send-receive pairs and collective operations are executed in parallel across large clusters equipped with specialized networking.  While Julia can be used with MPI, the natural model of communication in Julia is based on one-sided communication based on threads, tasks, futures and channels.  We will use the package Distributed.jl to demonstrate this.

The Julia distributed computing model is based on distributing tasks and arrays to a pool of workers.  Worker are either Julia processes running on the current host or on other hosts in your cluster.   (In the paragraphs that follow we show how to launch local workers, then workers on other VMs in the cloud and finally workers as docker containers running in a Kubernetes cluster.

Let’s begin with a trivial example.   Create a function which will flip a coin “n” times and count the number of heads.

countheads

Here we generated random Boolean values and converted them to 0 or 1 and added them up.   Now let’s add some workers.

First we include the “Distributed” package and count the current number of workers.   Next using the “addproc( )” function we added two new worker processes on the same server running this Jupyter notebook.  Workers that our notebook process knows about are identified with an integer and we can list them.   (The notebook process, called here the master, is not a worker)

The worker processes are running Julia but they don’t know about the things we have defined in the master process.  Julia has a very impressive macro facility that is used extensively in the distributed computing package to distributed code objects to the workers and  to launch remote computations.

When we create a function in the master that we want executed in the workers we must make sure it also gets defined in that worker.   We use the “@everywhere” macro to  make sure things we define locally are also defined in each worker.   We even must tell the workers we are using the Distributed package.  In the code below we created a new version of our count_heads function and distributed it.

Julia uses the concept of futures to launch computation on the workers.   The “@spawnat” macro takes two parameters: the ID of a worker and a computation to be launched.   What is returned is a future: a placeholder for the final result.   By “fetch( a)” we can grab the result of the future computation and return it to our calling environment.  (Some library functions like “printf()” when executed on the workers are automatically mapped back to the master.)

distributed_countheads

In the following example we create two local workers and define a function for them to execute.  Workers each have a unique integer ID and we print them.   Then we use @spawnat to launch the function on each worker in turn.

We can easily measure the cost of this remote function evaluation with the @time macro as follows. We enclose the call and fetch in a block and time the block.  (Notice we have increased the number of coins to 109 from 104.)

time-siingle-call

If we want both workers working together, we can compose a list of spawned invocations with a construct of the form

[ @spawn at i expr(i) for i in range]

This returns a list (technically, in Julia it is an array) of futures.   We can then grab the result of each future in turn.   The result is shown below.  This is run on a dual core desktop machine, so parallelism is limited but, in this case, the parallel version is 1.85 times faster than the individual call.   The reason that it is not exactly two time faster is partly due to the overhead in sequentially launching and resolving the futures.  But it is also due to communication delays and other OS scheduling delays.

parallel-call

A more compact and elegant way to write this parallel program is to use the Julia distributed parallel map function  pmap(f, args).  pmap takes a function and applies it to each element of a set of arguments and uses the available workers to do the work.   The results are returned in an array.

pmap

In this case count_headse did not need an argument so we constructed an anonymous function with one parameter to provide to the pmap function.  In this execution we were only concerned with dividing the work into two parts and then letting the system schedule and execute them with the available worker resources.   We chose 2 parts because we know there is two workers.   However, we could have divided into 10 parts and applied 10 argument values and the task would have been accomplished using the available workers.

Julia distributed across multiple host machines.

To create a worker instance on another host Julia uses secure shell (ssh) tunnels to talk to it.  Hence you need five things:  the IP address of the host, the port that secure shell uses, the identity of the “user” on that host and the private ssh key for that user.  The ssh key pair must be password-less.  The location of the Julia command on that host is also needed.

In this and the next example we simplify the task of deploying Julia on the remote host by deploying our Julia package as a docker container launched on that host.  To make the ssh connection work we have mapped the ssh port 22 on the docker container to port 3456 on the host.   (We describe the container construction and how it is launched in the next section.)

In the  previous section we provided “addprocs()” with a single integer representing the number of worker we wanted launched locally.   As shown below, the remote version requires a bit more.  We supply an array of tuples to addprocs() where each tuple provides the contact point and the number of workers we want there.  In this example we spawn 2 workers on one remote node and one worker on the other.  We also provide the local location of the private key (here called pubkey) in the sshflags argument.

We also want each worker to have the Distributed.jl package and another package called “PyCall” which enables calling python library functions.  We demonstrate the python call with a call to socket.hostname() in each worker.   Notice that the remote function invocation returns the “print” output to the master process.  The strange hostnames that are printed are the synthetic host names from the docker containers.

distributed-example

This example did not address performance.  We treat that subject briefly at the end of the next section.

Channels

In addition to spawning remote tasks Julia supports a remote channel mechanism.   This allows you to declare a channel that can carry messages of a given type and hold them for remote listeners to pickup.  In the example below we declare a remote channel that carries string messages and a function defined for the listeners to use.   The worker can use the “take!()” function to remove an item from the channel and the master uses “put!()” to fill the channel.  The message “-1”  tells the listener to stop listening.

channels1

channels2

Using the channel mechanism one can use Julia to have a group of workers respond to a queue of incoming messages.   In the Github site for this post we have put an example where the workers take the channel messages and put the results into an Azure table.

Julia Distributed Computing on Kubernetes

Depending upon your favorite flavor of cloud there are usually three or four ways to spin up a cluster of nodes to run a distributed computation.   The most basic way to do this is to launch a group of virtual machines where each has the needed resources for the problem you want to solve.   For Julia, the best thing to do is launch instances of a ‘data science VM” that is available on AWS or Azure.  There are two things your VM needs: an installation of Julia version 1.0.0 or later and the ability to ssh to it without using a password.

Here is how to do this on Azure.   First, run “ssh-keygen” and it will step you through the process of generating a key-pair and give you the option of having no password.  Then from the Azure portal select “create a resource” and search for Linux data science VM.   As you go through the installation process when it asks for a key, paste in the public key you just generated.   When the VM is up you can ssh to it to verify that it has the needed version of Julia installed by typing “Julia” at the command line.  If it can’t find Julia you will need to download and install it.   While you are logged and running Julia, you should also install some of the libraries your distributed program will need.  For example, Distributed.jl and PyCall.jl or any package specific to your application.   If you have a large cluster of VMs, this is obviously time consuming.

A better solution is to package your worker as a Docker container and then use Kubernetes to manage the scale-out step.   We have set up a docker container dbgannon/juliacloud2 in the docker hub that was be used for the following experiments.  This container is based on Jupyter/datascience-notebook so it has a version of Jupyter and Julia 1.0.0 already installed.    However to make it work has a Julia distributed worker it must be running the OpenSsh daemon sshd. A passwordless key pair has been preloaded into the appropriate ssh directory.  We have also installed the Jullia libraries Distributed.jl, PyCall.jl and IJulia.jl.   PyCall is needed because we need to call some python libraries and Ijulia.jl is critical for running Julia in Jupyter.  We have included all the files needed to build and test this container in Github.

Launching the container directly on your local network

The docker command to launch the container on your local network is

docker run -it -d -p 3456:22 -p 8888:8888  dbgannon/juliacloud2

This exposes the ssh daemon on port 3456 and, if you run the jupyter notebook that is on port 8888.  To connect to the server you will need the ssh key which is found on the Github  site.   (If you wish to use your own passwordless keypair you will need to rebuild the docker container using the files in Github. Just replace pubkey and pubkey.pub. and rebuild the container.)   To connect to the container and start Jupyter use the key pubkey.

ssh -i pubkey jovyan@localhost -p 3456
…
jovyan@bdfb4a7068e2:$ jupyter notebook

Notice that when you did this you were prompted to agree to add the ECDSA key fingerprint to your known hosts file.  Jupyter will come up on your local machine at http://localhost:8888 and the password is “julia”.  If you are running this remotely replace localhost with the appropriate IP and make sure port 8888 is open.    To launch worker instances run docker as above (but you don’t need “-p 8888:8888”. )  When they are up you will need to ssh to each from your instance running jupyter.   Doing this step is necessary to put the ECDSA key into the known hosts of your master jupyter instance.

Launching the containers from Kubernetes

Now that Kubernetes has become a cloud standard it is relatively easy to create a cluster from the web portal.   The experiments here were completed on a small 5 node cluster of dual core servers.  Once the cluster was up it was easy to launch the individual components from the Kubernetes command line.  Creating the Jupyter controller required two commands: the first to create the deployment and the second to expose it through a load balancer as a service.

kubectl run jupyter --image=dbgannon/jupyter --port=8888
kubectl expose deployment jupyter --type=LoadBalancer

Creating the worker deployment required one line.

kubectl run julcloud --image=dbgannon/juliacloud

(Note: this wad done with earlier versions of the Docker containers.  Juliacloud2 described above combines the capabilities of both in one container.)

One the two deployments were running, we used the Kubernetes web portal to scale the julcloud deployment to 5 different pods as shown in Figure 4 below.

kuberntes

Figure 4.  Kubernetes deployment of 5 worker pods and the Jupyter pod.

Unfortunately, we still needed to find the Ip address of each pod (which can be found on the Kubernetes portal) and ssh to each from the Python controller to add them to the known hosts file there.  (It should be possible to automate this step.) of Using this configuration we ran two experiments.   In the first we created a simple standard Monte Carlo simulation to compute Pi and ran it by allocating 2 workers per Kubernetes worker pod.  The code is shown below.

compute-pi

We scaled the number of pods to 10 and put two workers per pod and computed the speed up for N = 8*109 and N = 8*1010.  The highly unscientific results are shown in the chart below.  Notice that 10 pods and 20 workers is two workers per core, so we cannot realistically achieve speeds up much beyond  10, so a maximum speedup of 13.9 with 20 workers is good.  ( In fact, it reflects likely inefficiency seen when run with only one worker.)

speed-up-big

Figure 5.  Speedup relative to one worker when using up to 20 workers and 10 pods on five servers.

The second experiment was to use 10 workers to pull data from a distributed channel and have them push it to the Azure table service.   The source code for this experiment is in the Github  repository.

Conclusion

There is a great deal that has not been covered here.  One very important item missing from the above discussion is the Julia distributed array library.  This is a very important topic, but judging from the documentation it may still be a bit of a work-in-progress.   However I look forward to experimenting with it.

One of the applications of Julia that inspired me to take a closer look at Julia is the Celeste astronomy project that used 650,000 cores to reach petaflop performance at NERSC.  Julia computing is now a company devoted to supporting Julia.  Their website has many other great case studies.

There are many interesting Julia resources.  Juliacon is an annual conference that brings together the user community.   A brief look at the contents of the meeting videos (100 of them!) shows the diversity of Julia applications and technical directions.

Cloud Services for Transfer Learning on Deep Neural Networks

 

The breakthroughs in deep learning over the last decade have revolutionized computer image recognition.   The state-of-the-art deep neural networks have 10s of millions of parameters and they require training sets of similar size.   The training can take days on a large GPU cluster.   The most advanced deep learning models can recognize over 1000 different objects in images with surprising accuracy.   But suppose you have a computer vision task that requires that you classify a few dozen different objects.   For example, suppose you need to identify ten different subspecies of wolf, or different styles of ancient Korean pottery or paintings by Van Gogh?    These tasks are far too specific for any of the top-of-the-line pretrained models.  You could try to train an entire deep network from scratch but you if you only have a small number of images of each of your specialized classes this approach will not work.

Fortunately, there is a very clever technique that allows you to “retrain” one of the existing large vision models for your specific task.  It is called Transfer Learning and it has been around in various form from the mid 1990s.   Sebastien Ruder has an excellent blog that describes many aspect of transfer learning and it is well worth a read.

In this article we look at the progress that has been made turning transfer learning into easy-to-use cloud services.    Specifically, we will look at four different cloud services for building custom recognition systems.   Two of them are systems that have well developed on-line portal interfaces and require virtually no machine learning expertise.   They are the IBM Watson Visual Recognition Tool and Microsoft Azure Cognitive Services Custom Vision service.   The other two are tools that require a bit of programming skill and knowledge about deep networks.  These are the Google “Tensorflow for Poets” Transfer learning package and the Amazon Sagemaker toolkit.    To illustrate these four tools, we will apply each systems to the task of classifying images of galaxies.   The result is not deep from an astronomy perspective (because I am not even an amateur astronomer!), but it illustrates the power of the tools.   We will classify the galaxies into four types: barred spiral, elliptical, irregular and spiral as illustrated in Figure 1.     We will do the training with very small training sets:  19 images of each class that were gathered from Bing searches.

The classification task is not as completely trivial as one might assume.   Barred spiral galaxies are a subspecies of spiral galaxy that are distinguished by a “bar” of stars at the origin of the spirals.  Consequently, these two classes are easy to misidentify. Irregular galaxies can be very irregular.  (I like to think of them as galaxies that have not “got it together” enough to take on one of the other forms.)   And elliptical can often look like spiral or irregular galaxies.

galaxy-samples

Figure 1.   Two samples each of the four galaxy types.  The images were taken from Bing searches.

We have made these image files available at AWS S3 in two forms: a zip files barred, elliptical, irregular, spiral and test and in REC format as galaxies-train.rec and galaxies-test.rec.

Transfer Learning for DNNs

Before we launch into the examples, it is worth taking a dive into how transfer learning work with a pre-built deep learning vision model.   A good example, and one we will use, is the Inception-V3 model shown in Figure 2.

Inceptionv3

Figure 2.  Inception-V3 deep network schematic.   Image from the Google Research blog “Train your own image classifier with Inception in TensorFlow“.

In  Figure 2, each colored blob is a subnetwork with many parameters.  The remarkable thing about deep networks is how much of lower layers of convolution, pooling, concatenation seem to capture abstract qualities of images such as shapes and lines and regions.  Suppose the network has L layers  At the risk of greatly oversimplifying one can say that it is only at the last few layers that specific image classification takes place.   A simple way to do transfer learning is to replace the last two layers with two new ones and retrain the trained parameters of layers 0 to L-2 “constant” (or nearly so).

The paper https://arxiv.org/pdf/1512.00567.pdf  “Rethinking the Inception Architecture for Computer Vision” by  Szegedy et al. describes InceptionV3 in some detail.   The last two layers are a fully connected layer with 2024 inputs and 1000 softmax outputs.     To retrain it for 4 outputs  we replace last layers as illustrated in Figure 3 with two new layers. We now have only one matrix W of dimension 2024 by 4 of parameters we need to learntransfer-net

Figure 3.   Modified network for transfer learning.

If the training algorithm converges, it will be literally thousands of time faster than training the original.  A nice paper by Yosinski et al takes an in-depth look at feature transferability in deep networks.   There are other ways to do transfer learning on deep nets than just holding the L-2 layers fixed.   Instead one can allow some fine tuning of the top most layers with the new data.    There is much more that can be said on this subject, but our goal here is to evaluate some of the tools available.

The IBM Watson Visual Recognition tool.

This transfer learning service is incredibly easy to use.   There is an excellent drop-and-drag interface and a nodeJS API as well as a Python API.   To test it we clicked on the create classifier button and dragged the zip files for our four classes of galaxies onto the interface as shown below.

watson0

Figure 4. Visual recognition tool Interface with dragged zip files for the galaxy classes.

Within a few minutes we had a view of the classifier that we could test.   The figure below illustrates the results from dragging three examples from the training set to the classifier interface.  As you can see the interface returns the relative strength of membership in each of the classes.

To invoke the service, you need three things:

  1. your IBM bluemix api_key which you were given when you logged into the service the first time to build the model.
  2. Once your model has been built you need the classifier ID which is visible on the tool interface.
  3. you must install the watson_developer_cloud module with pip.

watson-final

Figure 5:  Three vertical panels show the result of dragging one of the training images onto the classifier interface.

import watson_developer_cloud
from watson_developer_cloud import VisualRecognitionV3
visual_recognition = VisualRecognitionV3( 
      api_key = '1fc969d38 your key here 7f7d3d27334', 
      version = '2016-05-20')
classifier_id = 'galaxies_1872954591'
image_url = https://s3-us-west-2.amazonaws.com/learn-galaxies/bigtest/t12.jpg
   
param = {"url": image_url, "classifier_ids":[classifier_id]}
visual_recognition.classify(parameters=json.dumps(param)

The key elements of the code are shown above.  (There may be other versions of the Python API.  This one was discovered by digging through the source code.   There is little other documentation.)  We built a Jupyter notebook that uses the api to compute the confusion matrix for our test set.  The Watson classifier will sometimes refuse to classify an image into one of our categories, so we had to create a “none” tag to identify these cases.  The results are very good, with the exception of the confusion of spiral and barred spiral galaxies.

watson-test-results

Figure 6: Results from the Watson classifier for our 40 image test set.

Computing the confusion matrix for the training set gives a perfect score as shown below.

watson-train-results

Figure 7: Confusion matrix given the training set as input.

The Jupyter notebook in HTML and IPYNB formats are available in S3.  One additional comment is needed.   Because this service is a black box, we have no idea what transfer learning service is use.

Microsoft Azure Cognitive Services Custom Vision.

The Microsoft Azure Custom Vison Service is another very well designed and easy to use system.   It is also a black box, so we have no idea how it works.  The assumption is that intended users don’t need to know and the designers are free to change the algorithm if they fine better ones.

Once you log in you create a new project as shown in Figure 8 below.   Then you can upload your training data using another panel in the interface.

azure-galexy

Figure 8.   The  left panel defines the galaxy name and type.  The right panel is for uploading the training set.

Once the training set is in place you can see your project with a view of some of your images as shown in Figure 9.   There is a button to click to start the training.   In this case it takes less than a minute to see the results (Figure 10).

azure-api

Figure 9.   The view of a sample of your training set.    The green button starts the training.

azure2

Figure 10.   The results from 2 iterations of the training.

If you are not pleased with the result of the training, you can try adding or removing images from the training set and train it again.

During the training with this data we made 3 iterations. the first was with the initial data. The system recognized that one of the elliptical galaxies was a duplicate, so the second iteration included an additional elliptical galaxy. The system will not allow a new iteration until you have modified the data, so the third iteration replaced a random spiral galaxy with another.  The results here are not great, but not bad for the small size of the training set.    As shown in Figure 11, the confusion matrix is better than the IBM case for distinguishing barred elliptical from elliptical but not as good at recognizing the irregular galaxies.

azure-test-results

Figure 11.  Confusion matrix for Azure Custom Vision test.

Using the training data to compute the confusion we get an almost perfect score, but one barred spiral galaxy is recognized as spiral.

azure-train-results

Figure 12.  Confusion Matrix for Azure Custom vison with training data

We looked at the case that confused the classifier and it can be seen to one that is on the border between barred spiral and spiral.   The image is contained in the full Jupyter notebook (html versionipynb version).

To use the notebook you need to have your prediction and training keys and the project id for the trained model.   You will also need to update your version of the Azure Python SDK.   The code below shows how to invoke the predictor.  The notebook gives the full details.

from azure.cognitiveservices.vision.customvision.prediction import prediction_endpoint
from azure.cognitiveservices.vision.customvision.training import training_api
training_key = 'aaab25your training key here 8a8b0' 
prediction_key = "09199your prediction key here b9ae" 
trainer = training_api.TrainingApi(training_key) 
project_id = 'fcbccf40-1bce-4bc4-b4ea-025d63f1014d' 
project = trainer.get_project(project_id)
iteration = trainer.get_iterations(project.id)[2]
image = “https://s3-us-west-2.amazonaws.com/learn-galaxies/bigtest/t5.jpg”
predictor = prediction_endpoint.PredictionEndpoint(prediction_key)
results = predictor.predict_image_url(project.id, iteration.id, url=image)
for prediction in results.predictions: 
  print("\t" + prediction.tag + ": {0:.2f}%".format(prediction.probability * 100)

The printed results give the name of each class and the probability that it fits the provided image.

Tensorflow transfer learning with Inception_v3

Google has built a nice package called Tensorflow For Poets that we will use for the next test.  This is part of their Google Developer Codelabs.

You need to clone the github repo with the command

git clone https://github.com/googlecodelabs/tensorflow-for-poets-2
cd tensorflow-for-poets-2

Next go to the subdirectory tf_files and create a new directory there called “galaxies” and put four subdirectories there: barredspiral, spiral, elliptical, irregular with each containing the corresponding training images. Next do

pip install --upgrade tensorflow

The Tensorflow code to do transfer learning and retrain a model is in the subdirectory scripts in a file retrain.py.   It follows the transfer learning method we described earlier by replacing the top to layers of the model with a new, smaller fully connected layer and a softmax layer.   We can’t go into the details here without a deep dive into Tenorflow code which is beyond the scope of this article.   Suffice it to say that it works very nicely.

The command to “retrain” the inception model is

python -m scripts.retrain \   
     --bottleneck_dir=tf_files/bottlenecks \   
     --how_many_training_steps=500 \   
     --model_dir=tf_files/models/ \   
     --summaries_dir=tf_files/training_summaries/"inception_v3" \   
     --output_graph=tf_files/retrained_graph.pb \   
     --output_labels=tf_files/retrained_labels.txt \   
     --architecture="inception_v3" \   
     --image_dir=tf_files/galaxies

If all goes well you will finally get the results that look like this

INFO:tensorflow:Final test accuracy = 88.9% (N=9)

Invoking the re-trained model is simple and you don’t need to know much Tensorflow to do it.  You essentially load the image as a tensor and load the model graph and invoke it with the input tensor.  The complete Python code for this in in the Jupyter notebook (in html and ipynb formats).

As with the other examples we have computed the confusion matrix for the test set and training set as shown below.

tensorflow-results-test

Figure 13.  Tensorflow test results.

tensorflow-results-train

Figure 14.  Tensorflow results on the training set

As can be seen the retrained model as the usual difficulty distinguishing between spiral and barred spiral and irregular sometimes looks like elliptical and sometimes spiral.   Otherwise the results are not too bad.

Amazon SageMaker

SageMaker is a very different system from the tools described above.  This article will not attempt to cover SageMaker thoroughly and we will devote a more complete article to it soon.

Briefly, it consists of a complete system for training and hosting ML models.  There is a web portal but the primary user interface is Jupyter notebooks.   Figure 15 illustrate the view of the portal after we created several experiments.  It nicely illustrates the phases of SageMaker execution.

  • You first create a Jupyter instance and a notebook. When you create a Jupyter notebook instance from the portal you are actually deploying a virtual machine on AWS.
  • You use the notebook to create ML training jobs. The training jobs take place on a dynamically allocated container cluster.
  • When training is complete you create a model which is stored and managed by SageMaker.
  • When you have a model you can create an endpoint that can be used to invoke the model from your application.

sagemaker.JPG

Figure 15.  SageMaker portal interface.

To train a new model you provide the name of an AWS S3 bucket where your data is stored and a bucket where the output is going to be placed.

When the Jupyter VM spins up you see it in your browser.   The first thing you discover is a large collection of demo notebooks covering a host of topics.   You are not restricted to these.  There is also a library of tools to use Apache Spark from SageMaker.  You can also upload your own notebooks with TensorFlow or MXNet models for training.   Our you can create a docker image with your own algorithms.

In the example are interested here we discovered a SageMaker example notebook, Image-classification-transfer-learning.ipynb and made a copy we called sagemaker-galaxy-predict that you can access (in html or in ipynb  format).   As with the IBM and Microsoft examples, the actual transfer learning algorithm used is a black box, but there are some hints and parameters you can adjust.

When you train a deep neural network, you are find values for the millions of parameters in the network.  (As we have described above there are many fewer parameters in transfer learning.)  But there are an additional set of parameters, called hyperparameters, that describe the network architecture and the learning process.   In the case of the transfer learning notebook you must specify the following hyperparameters:  the number of layers in the network, the training minibatch size, the training rate and the number of training epochs.   There are defaults for these based on the example that SageMaker provides, but they did poorly for the galaxy experiment.    This left us with a four-dimensional hyperparameter space to explore.   After spending about two hours trying different combinations we came up with the table below.

# The algorithm supports multiple network depth (number of layers). They are 18, 34, 50, 101, 152 and 200
num_layers = 101
# we need to specify the input image shape for the training data
image_shape = "3,224,224"
# we also need to specify the number of training samples in the training set
num_training_samples = 19*4
# specify the number of output classes
num_classes = 5
# batch size for training
mini_batch_size =  21
# number of epochs
epochs = 5
# learning rate
learning_rate = 0.0018
top_k=2
# Since we are using transfer learning, we set use_pretrained_model to 1 so that weights can be 
# initialized with pre-trained weights
use_pretrained_model = 1

 

We are absolutely certain that these are far from optimal.   Once again we computed a confusion matrix for the test set and the training set and they are shown in Figure 16 and 17 below.

sagemaker-test-results

Figure 16.   Confusion matrix for SageMaker test data.

sagemaker-train-results

Figure 17.  Confusion matrix for SageMaker on training data.

As can be seen, these are not as good as our other three examples.    The failure is largely due to poor choices for the hyperparameters.  It should be noted that the Amazon team is just now starting a hyperparameter optimization project.   We will return to this example after that capability is available.

Conclusion

In this report we examined four computer vision transfer learning service.   We did this study using a very tiny example to see how well each service performed.   We used the simple confusion matrix to give us a qualitative picture of performance.  Indeed, these matrices showed us that distinguishing the barred spiral galaxies from the non-barred spiral ones was often challenging and that irregular galaxies are easy to misclassify.   If we want a quantitative evaluation we can compute the accuracy of each method using the test data.  The results are Azure = 0.75, Watson = 0.72, Tensoflow = 0.67 and SageMaker = 0.6.   However, given the very small size of the data sets, we argue that it is surprising that we could get reasonable results with such little effort.

Building the best galaxy classifier was not our goal here.  Real astronomers can do a much better job building systems that can answer much more interesting questions the classification task posed here. The goal of this project has been to show what you can do with cloud transfer learning tools.   The IBM and Azure tools were extremely easy to use and, within a few minutes you had a model constructed. It was not hard to access and use these models from a Python client.  The Tensorflow example from Google allowed us to do the transfer learning on a laptop.  SageMaker was fun to use (if you like Jupyter), but tuning the hyperparameters is a challenge.   A follow-up article will look at additional SageMaker capabilities.

Finally,  if any reader can improve on any of these results for this small dataset, please let me know!


					

A Brief Survey of Cloud AI Services

The commercial clouds are in a race to see who can provide the most interesting and useful AI services on their cloud platform.   This work began in the research laboratories in universities and companies over the past 25 years, but the big breakthroughs came when deep learning models trained on massive data collections began to reach levels of human accuracy.  For some time now, the public cloud companies have provided custom virtual machines that make it ease for technically sophisticated customers to use state of the art ML and neural network tools like TensorFlow, CNTK and others.  (We described these in here.)  But the real competition is now to provide services for building smart applications that can be used by developers lacking advanced training in machine learning and AI. We now have speech recognition, language translation, image recognition capabilities that can be easily integrated into web and mobile applications.

In the following paragraphs we will look at the AI services provided by  IBM, Google, Microsoft and Amazon.  These are certainly not the only providers.  Salesforce has the myEinstein platform and small companies like Algorithmia and not-so-small Genpact  also provide services and consulting in this area.

What becomes abundantly clear when you study the details of the offerings is that they all cover the same basics.  This includes tools for building bots, natural language translations, speech-to-text and text-to-speech and unstructured document analysis.   But what one also discovers is that each provider has some services that standout as being a bit more innovative that that offered by the others.  We conclude with an overview of the trends we see and thoughts about the future of cloud AI services.

This is the first of a series that we will do on this topic.   Future articles will explore some of these capabilities in more technical depth.  For example, at the end of this article, we look at an example of doing text analysis with Amazon Comprehend.

IBM Cloud Watson Services

The IBM Watson services are organized into five groups.

  • Watson conversation provides a portal interface to build Bots.  The interface promps you to identify intents, entities and dialog flow.  Intents are the questions you expect your users to ask.  Entities are the components such as city names, times and other objects your bot will understand.   Dialog flow is the tree of intents and responses you anticipate in the dialog.   The result is a bot you can deploy and later improve.
  • The discovery service is a tool that allow you to quickly ingest and explore data collection.  A query language can be used to subset results and identify important features and anomalies.  Discovery news is a service to crawl news and blogs looking for patterns in sentiment, new concepts and relationships.   It allows you to see trends and key events.
  • The visual recognition service has been used to analyze aerial images to better understand drought and water use.   It can do image content analysis including detecting faces and making age and gender estimates.   If you have your own collection of labeled images the system can be easily trained to incorporate these into its model.
  • Speech. Watson has speed-to-text and text-to-speech services.   These services work reasonably well but the quality of the output speech does not seem as good  as  Amazon Poly.
  • The Watson natural language classifier is designed to classify intent of text passages such as deciding that a question about the weather is looking for current temperatures.   As with the other services it is update it with additional training data.
  • The Watson empathy services allow prediction of personality characteristics and emotions through text.

Google Cloud AI services

The Google cloud has an extensive set of AI services available.

  • AutoML is Google’s tool for training their vision models on your data. If you image data is labeled it will help create better labels.   If it is not labeled they will help label it.    It uses transfer learning which is a method to retrain a neural network to recognize new inputs.  By leaving many of the early layers in the previously trained network unchanged basic features such as edges and shapes can be used again and only the last few layers need to be relearned.  (This method is widely used by the other image services described here.)  Google also has a powerful vision api that is capable of recognizing thousands of categories of images.
  • Cloud Machine Learning Engine is a cloud service that help you manage a large cluster for very large ML tasks. It also allows you to use your trained algorithm with terabytes of data and thousands of concurrent users.
  • DilogFlow is Google’s tool for building bots and interfaces that support natural and rich interactions.
  • Video Intelligence. Suppose you have a large collection of videos and you want to be able to search for occurrences of specific words.  The Google cloud video intelligence API makes this possible.
  • Cloud Speech. Google has a long history with speech-to-text recognition that is widely used in their android product and Google search.   The Google cloud Speech API recognizes over 100 languages and variants.   It has context aware recognition that filters out lots of background noise.   (The also have a very nice speech recognition kit that works with a raspberry pi.   I have used it.   It is fun little project.)
  • Natural Language. Google’s text analysis is very good at parsing documents.   It is very good at entity recognition (tagging many phrases and words with Wikipedia articles).   It can also give you lists of relevant categories.   For syntax analysis is used a version of their parsyMcParseface parser that I used in my demo of building an application for Algorithmia described in this post.
  • Cloud Translation. Google had one of the earliest cloud translation services and it has become better over time.   It supports more than 100 languages.

Microsoft Azure

Azure’s machine learning services are divided into two categories: Artificial Intelligence and cognitive services.  There are currently three AI services:

  • ML services which is based on their machine learning workbench. The workbench is designed to guide you through the process of creating a data package from your data and then build a python/pyspark script to evaluate it.   You can invoke the script locally, on an azure vm or in a container.   The workbench has the capability to fill in the gaps of data cleaning and algorithm selection in generating a final solution.
  • Batch AI services consist of a set of tools to help you marshal GPU and CPU cluster resources for parallel machine learning training using CNTK, TensorFlow and Chainer.
  • Azure AI services include Bot Builder, an SDK for creating bots and a suite of bot template.

The cognitive services are divided into four main categories.

  • This includes a vision API for content analysis which works with a Jupyter notebook that allows you to upload images and return a description in terms of recognized entities.  It also provides a caption.  A content moderator service allows you to flag images that may have unwanted content.  The custom vision service allows you to quickly train a vision app to recognize images from classes you provide.   The classes can be small (30 images in each) but it does not recognize your images when they are embedded in more complex scenes.   However it does allow you to export the trained model as TensorFlow to run in offline applications.  Face and Emotion APIs allow you to detect faces in images and detect the mood of each.   The video indexer is impressive.  It  can provide audio transcription, face tracking and identification, speaker indexing, visual text recognition, sentiment analysis and language translation.
  • The speech to text and text to speech services are there but there is also a Custom Speech Service that allows you to add knowledge about specific jargon to the language model.  A Speaker Recognition API allows your apps to automatically verify and authenticate users using their voice and speech.  The Translator service is based on the work that was done for the skype realtime speech translation system.   It can recognize languages and translate the spoken sentences into the target language.
  • The Language Understanding Service allows your application to understand spoken commands like “Turn off the light” or home automation tasks.  The Linguistic Analysis API provides sentence separation, part-of-speech tagging and constituency parsing.   The Text Analysis Service provide sentiment analysis and key phrase extraction.  A Web Language Model is based on the Web N-Gram Corpus for analysis of Web documents.
  • The Custom Decision Service uses reinforcement learning algorithms to extract features from a set of candidates when ranking articles and images for automatic inclusion in a web site.  The Entity Linking Intelligence Service API provides a tool to understand when an word is uses as an actual entity rather than a part of speech or a general noun or verb.  This is done by looking at the context of the use of the word.  The Academic Knowledge API provides access to the Microsoft academic graph which is data mind from the Bing index.   The QnA Maker is a REST API that trains a machine learning system to help bots respond in a more natural way to user requests.

AWS AI services

Amazon’s web services cloud AI services has seven major APIs

  • Image and video Rekognition. The image recognition service allows the full set of computer vision features that are available anywhere.  Object, scene and activity detection is continuously learning.  It can recognize objects and scenes.  Text in images like street names or product names can be read.  If you have a private library of photos it can identify a people. When it is analyzing video it can identify certain activities happening in the frame.   Facial analysis recognizes age ranges and emotions.   When analyzing video it can track individual people as they go in and out of a frame.  Sending live or recorded video to a Kinesis Video Stream  can be routed to rekognition video and identified object can be sent to lambda functions that can react in near real time.   Alternatively, video can be periodically loaded into S3 buckets which trigger lambda functions that will invoke rekognition for analysis.
  • Amazon Lex is a tool for build bots with voice and text input and response. It is the same technology that powers Echo’s Alexa.    The Lex console allows you to build  a bot with ease.  Conversation flow is an import part of the Bot interaction.   Lex supports simple mechanisms to allow you to tailor the flow to your application.
  • Comprehend. There are two main components to Amazon Comprehend.   The first is a set of tools to extract named entities (“Person”, “Organization”, “Locations”, etc.) and key phrases from a document. The more impressive capability of Amazon Comprehend is the topic modeling subsystem.   This is of interest if you have a large collection of documents and you would like to see then classified into a set of N bins where N is a number you pick.   Comprehend will take you collection of documents and apply a Latent Dirichlet Allocation-based learning model to separate them into N bins with each bin defined by a set of key words it has discovered.   (At the end of this article we will demonstrate Amazon Comprehend.
  • Translate. This service provides real-time and batch language translation.   The service is protected by SSL encryption.
  • If you have a mp3 or wave video and you want to add subtitles, the transcribe service will render all of the voice audio to text and also insert timestamps for each word.   You can then use Translate to convert the audio to another language.   They say they are adding specific voice identification soon.
  • Poly is the Amazon text to speech API.   It is far from the robotic sounding speech generation we saw in the past.   It has 47 different voices spread over many languages.   (I have used it and it is both impressive and fun.)

Conclusions

If you need to build a bot that understands English, French and Mandarin and replies with spoken and correctly accented Italian that can help you identify your friends and celebrities in your Instagram photos and also mine your twitter feed, you are in luck.  The tools are there.  But if you are expecting emergent artificial intelligence, you are out of luck.  Alexa, Cortana and Seri are each good at fast facts but otherwise dumb as a post.

It is also now clear that this technology is also a boon to those with more nefarious goals.  If you are a government security agency with access to lots of cameras in public places, keeping track of your citizens is now a snap.   We see that social media is now swarming with bots that sell not only soap but also promote and propagate lies and propaganda.    Serious questions are being raised about the potential threat to modern democracies that these technologies enable.   The social media companies are aware of the challenge of eliminating the bots that skew our national discussions and we hope they are up to the cleanup task.

There is also much to be excited about.   The technology behind these AI services is also helping us use vision and sensing that can truly help mankind.   We can “see” the planet at an enlarged scale.  We can spot droughts, crop disease and the effects global warming is having on the planet in greater detail because of the speed and accuracy of image analysis.    We can monitor thousands of sensors in our environment that help us improve our quality of air and water and we can better predict potential problems before they occur.  The same sensor and vision technology help us scan x-ray and other medical images.

All of these AI advances are going to give us safer roads with driverless cars and robots magnify the power of the individual worker in almost every domain.   I look forward to the time Alexa and or Cortana can become a real research partner helping me scan and review scientific literature and point me to discoveries that I most certainly miss today.

Afterword

In the following paragraphs we look at one of the cloud services in depth.   In future articles we will examine other capabilities and applications.

Text Analysis with Amazon Comprehend

As with everything in AWS, their services can be accessed by the command line interface or the APIs.   However, the console provides a very simple way to use them.   We will test Amazon’s Comprehend using the named entity and key phrases interface.  The service is accessed via their API explorer.

We selected a paragraph about the discovery of DNA from Wikipedia and pasted it into the entity/key phrase extractor.    The results are shown in figures 1, 2 and 3.

comprehend1

Figure 1.  Inserting a paragraph into the API explorer.

Comprehend2

Figure 2.   The list of Entities

keyphrases.JPG

Figure 3. The key phrases

As can be seen the system does a very good job with both the entity and key phrase tasks.   In particular it does a great job of categorizing the named entities.

Topic Modeling

The more impressive capability of Amazon Comprehend is the topic modeling subsystem.   This is of interest if you have a large collection of documents and you would like to see then classified into a set of N bins where N is a number you pick.   To try this out, I  used the Arxiv science abstract collection I have used in previous document classifier experiments.   Each document is the text of an abstract of a scientific research paper.   To use comprehend you put the documents in an AWS S3 bucket.   I have 7108 documents and they are in  the bucket https://s3-us-west-2.amazonaws.com/scimlcomprehend.  (If you are interested, the individual files can be accessed by this url https://s3-us-west-2.amazonaws.com/scimlcomprehend/*arxiv  where * is a an integer between 0 and 7108.)

Invoking the topic modeler from the console is trivial.  You simply fill in a form.  The form for my experiment is shown below in Figure 4.

Comprehend4

Figure 4.   Invoking the Topic modeler.

In this test case the topic modeler ran in about five minutes and produce a pair of CSV files.   One file contained a set of tuples for each document.  Each tuple is a triple consisting of the document name, the name of a topic bin and a score for fit for that bin.   For example, here is the first 11 tuples.  The abstract documents are drawn from five fields of science: physics, biology, computer science, math and finance. We have added a fourth column that provides the science category for the listed document.

Document no. Topic Score Actual topic
0 0 0.242696 compsci
0 5 0.757304 compsci
1 1 1 math
2 0 0.546125 Physics
2 4 0.438275 Physics
2 5 0.015599 Physics
3 1 1 math
4 8 1 Physics
5 0 0.139652 Physics
5 3 0.245669 Physics
5 5 0.614679 Physics

As can be seen, document 0 is computer science and scores in topic 0 and highly in topic 5.    Documents 1 and 3 are math and squarely land in topic 1.  Documents 2, 4 and 5 are physics and are distributed over topics 0,3,4,5 and 8.   The algorithm used in the topic modeler is described in Amazon’s documentation as follows.

“Amazon Comprehend uses a Latent Dirichlet Allocation-based learning model to determine the topics in a set of documents. It examines each document to determine the context and meaning of a word. The set of words that frequently belong to the same context across the entire document set make up a topic.”

If we look at the corpus as a whole we can see how well the topic modeler did in relation to the known topics.    The result is in Figure 5 below which gives the percent of papers in each science area that had scores in  each modeler topic.

figure5

Figure 5.  Topics selected by the model for each science discipline

As can be seen the modeler topic 000 did not differentiate very well between physics, bio and compsci.   To look closer at this we can look at the other csv file generated by the modeler.   This file lists the key words the modeler used to define each topic.   In the case of topic 000 the words were:

'system','timar','quantum','dynamic','cell','control','network','information,'dynamical','scheme'

As can be seen these are words that one would expect to see in many articles from those three areas.  If we look beyond topic 000, we see physics is strong in topic 3 which is defined by the words

'field','energy','matt','dark','scalar','scale','magnetic','cosmological','couple','theory'

This topic is clearly physics.  Looking at computer science, we see the papers score strongly is topics 005 and 007.   These words are

'datar','method','analysis','learn','base','network','approach','value','regression','gene','algorithm','graph','problem','numb','propose','optimal','set','edge','vertex','low'.

We included machine learning in the computer science topics so this result is also reasonable.   For math the strong topics were 001 and 006 and the corresponding words were

'distribution','method','function','sample','estimator','estimate','process','parameter','random','rate','space','prove','mathbb','group','graph', 'algebra’,'theorem','finite','operator','set'

which constitutes a nice collection of words we would expect to see in math papers.  For finance topic 009 stands out with the following words.

'market', 'price', 'risk', 'optimal', 'problem', 'function', 'measure', 'financial', 'strategy', 'option'.

The only area where the topic modeler failed to be very clear was in the area of biology where topics 004 and 005 were the best.   Those words were not very indicative of biology papers:

'model', 'parameter', 'data', 'propose', 'distribution', 'inference', 'simulation', 'bayesian', 'fit', 'variable' , 'method', 'analysis', 'learn', 'base', 'network', 'approach', 'value', 'regression', 'gene'.

As an unsupervised document classifier, the Amazon Comprehend modeler is impressive.   Classifying these science abstracts is not easy because science is very multidisciplinary and many documents cross the boundary between fields.   We have looked at this problem in a previous post Algorithmia Part 2. Document Embedding with Gensim and Deploying a Trained Model in the Algorithmia Cloud and in our book Cloud Computing for Science and Engineering where we describe many of the challenges.   One short coming of the Amazon modeler is that it does not provide  an easy way to model a new document against the models built from the corpus.  This should be easy to do. In the analysis above we looked at how broad scientific domains are mapped over the detected category  bins.  One thing we also need to look at is how well the individual categories are at grouping similar abstracts.  This is equivalent to looking at the columns of the table in Figure 5 above.   If we take a look at topic 006 that is heavily associated with math we can print the titles and the ArXiv sub-categories they came from.   A sample is shown below.

‘Differential Calculus on Cayley Graphs [cs.DM]’,
‘Coherent rings, fp-injective modules, and dualizing complexes [math.CT]’,
‘Self-dual metrics with maximally superintegrable geodesic flows [gr-qc]’,
‘New atomic decompositons for Bergman spaces on the unit ball [math.CV]’,
‘Presenting Finite Posets [cs.LO]’,
‘The Whyburn property and the cardinality of topological spaces [math.GN]’,
‘Absolutely Self Pure Modules [math.RA]’,
‘Polynomials and harmonic functions on discrete groups [math.GR]’,
‘Free Resolutions of Some Schubert Singularities in the Lagrangian  Grassmannian [math.AG]’,
‘Connectedness properties of the set where the iterates of an entire  unction are unbounded [math.DS]’,
‘A Purely Algebraic Proof of the Fundamental Theorem of Algebra [math.HO]’,
‘A cell filtration of the restriction of a cell module [math.RT]’,
‘Higher dimensional Thompson groups have subgroups with infinitely many   relative ends [math.GR]’,
‘PI spaces with analytic dimension 1 and arbitrary topological dimension [math.MG]’,
‘Eigenvalues of Gram Matrices of a class of Diagram Algebras [math.RA]’

With the exception of the first, third and fifth documents they are all math and even those two documents look like math.   On the other hand looking at a sample from category 000 we see a true hodgepodge of topics.

‘A stochastic model of B cell affinity maturation and a network model of   immune memory [q-bio.MN]’,
‘Precise determination of micromotion for trapped-ion optical clocks [physics.atom-ph]’,
‘Quantum delocalization directs antenna absorption to photosynthetic   reaction centers [physics.bio-ph]’,
‘Fluorescence energy transfer enhancement in aluminum nanoapertures [physics.optics]’,
‘Direct Cortical Control of Primate Whole-Body Navigation in a Mobile   Robotic Wheelchair [q-bio.NC]’,
‘Condition for the burning of hadronic stars into quark stars [nucl-th]’,
‘Joint Interference Alignment and Bi-Directional Scheduling for MIMO   Two-Way Multi-Link Networks [cs.IT]’,
‘MCViNE — An object oriented Monte Carlo neutron ray tracing simulation   package [physics.comp-ph]’,
‘Coherent addressing of individual neutral atoms in a 3D optical lattice [quant-ph]’,
‘Theoretical analysis of degradation mechanisms in the formation of   morphogen gradients [physics.bio-ph]’,
‘Likely detection of water-rich asteroid debris in a metal-polluted white   dwarf [astro-ph.SR]’,
‘A Study of the Management of Electronic Medical Records in Fijian   Hospitals [cs.CY]’,
‘Self-assembling interactive modules: A research programme [cs.FL]’,
‘Proceedings Tenth International Workshop on Logical Frameworks and Meta   Languages: Theory and Practice [cs.LO]’,

Setting aside this topic bin 000, we certainly see strong coherence of the documents.

Moving Cloud AI to the Edge

Edge computing describes the movement of computation away from cloud data centers so that it can be closer to instruments, sensors and actuators where it will be run on “small” embedded computers or nearby “micro-datacenters”.  The primary reason to do this is to avoid the network latency in cases where responding to a local event is time critical.  This is clearly the case for robots such as autonomous vehicles, but it is also true of controlling many scientific or industrial apparatuses.  In other cases, privacy concerns can prohibit sending the data over an external network.

We have now entered the age where advances in machine learning has made it possible to infer much more knowledge from a collection of the sensors than was possible a decade ago.    The question we address here is how much deep computational analysis can be moved to the edge and how much of it must remain in the cloud where greater computational resources are available.

The cloud has been where the tech companies have stored and analyze data.  These same tech companies, in partnership with the academic research community, have used that data to drive a revolution in machine learning.  The result has been amazing advances in natural language translation, voice recognition, image analysis and smart digital assistants like Seri, Cortona and Alexa.   Our phones and smart speakers like Amazon Echo operate in close connection with the cloud.   This is clearly the case when the user’s query requires a back-end search engine or database, but it is also true of the speech understanding task.  In the case of Amazon’s Echo, the keyword “Alexa” starts a recording and the recorded message is sent to the Amazon cloud for speech recognition and semantic analysis.   Google cloud, AWS, Azure, Alibaba, Tencent,  Baidu and other public clouds all have on-line machine learning services that can be accessed via APIs from client devises.

While the cloud business is growing and maturing at an increasingly rapid rate, edge computing has emerged as a very hot topic.  There now are two annual research conferences on the subject: the IEEE Service Society International conference on Edge computing and the ACM IEEE Symposium on Edge computing.  Mahadev Satyanarayanan from CMU, in a keynote at the 2017 ACM IEEE Symposium and in the article “The Emergence of Edge Computing” IEEE Computer, Vol. 50, No. 1, January 2017, argues very strongly in favor of a concept called a cloudlet which is a server system very near or collocated with edge devices under its control.   He observes that applications like augmented reality require real-time data analysis and feedback to be usable.  For example, the Microsoft Hololens mixed reality system integrates a powerful 32bit Intel processor with a special graphics and sensor processor.   Charlie Catlett and Peter Beckman from Argonne National Lab have created a very powerful Edge computing platform called Waggle (as part of the Array of Things project) that consists of a custom system management board for keep-alive services and a powerful ODROID multicore processor and a package of instruments that measure Carbon Monoxide, Hydrogen Sulphide, Nitrogen Dioxide, Ozone, Sulfur Dioxide, Air Particles, Physical Shock/Vibration, Magnetic Field, Infrared Light, Ultraviolet Intensity, RMS Sound Level and a video camera.  For privacy reasons the Waggle vision processing must be done completely on the device so that no personal identifying information goes over the network.

Real time computer vision tasks are among the AI challenges that are frequently needed at the edge.  The specific tasks range in complexity from simple object tracking to face and object recognition.   In addition to Hololens and Waggle there are several other small platforms designed to support computer vision at the edge.   As shown in Figure 1, these include the humble RaspberryPi with camera,  the Google vision kit and the AWS DeepLens.

fourdevices

Figure 1.  From the left is a RaspberryPi with an attached camera, ANL Waggle array, the Google AIY vision kit and the AWS DeepLens.

The Pi system is, by far, the least capable with a quad core ARMv7 processor and 1 GB memory.   The Google vision kit has a Raspberry Pi Zero W (single core ARMv7 with 512MB memory) but the real power lies in the Google VisionBonnet which uses a version of the Movidius Myriad 2 vision processing chip which has 12 vector processing units and a dual core risc cpu.   The VisionBonnet runs TensorFlow from a collection of pretrained models.    DeepLens has a 4 megapixel camera, 8 GB memory, 16 GB storage and an intel Atom process and Gen9 graphics engine which supports model built with Amazon SageMaker that is pre-configured to run TensorFlow and Apache MXNet.

As we stated above many applications that run on the edge many must rely on the cloud if only for storing data to be analyzed off-line.   Others, such as many of our phone apps and smart speakers, use the cloud for backend computation and search.   It may be helpful to think of the computational capability of edge devices and the cloud as a single continuum of computational space and an application as an entity that has components distributed over both ends.   In fact, depending upon the circumstances parts of the computation may migrate from the cloud to the device or back to optimize performance.   As illustrated in Figure 2, AWS Greengrass accomplishes some of this by allow you to move Lambda “serverless” functions from the cloud to the device to form a network of long running functions that can interact with instruments and securely invoke AWS services.

greengrass

Figure 2.   AWS Greengrass allows us to push lambda functions from the cloud to the device and for these functions to communicate seamlessly with the cloud and other functions in other devices.  (Figure from https://aws.amazon.com/greengrass/ )

The Google vision kit is not available yet and DeepLens will ship later in the spring and we will review them when they arrive.   Here we will focus on a few simple experiments with the Raspberry Pi and return to these other devices in a later post.

Deep Learning Models and the Raspberry PI 3.

In a previous post we looked at several computer vision tasks that used the Pi in collaboration with cloud services.   These included simple object tracking and doing optical character recognition and search for information about book covers seen in an image.   In the following paragraphs we will focus on the more complex task of recognizing objects in images and we will try to understand the limitations and advantages of using the cloud as the backend computational resource.

As a benchmark for our experiments we use the Apache MXNet deep learning kit with a model based on the resnet 152-layer neural network that was trained on a  collection  of over a  10 million images and over 11 thousand labels.    We have packaged this MXNet with this model into a Docker container dbgannon/mxnet which we have used for these experiments.  (the details of the python code in the container are in the appendix to this blog.

Note: If you want to run this container and if you have dockerand Jupyter installed you can easily test the model with pictures of your own.  Just download the jupyter notebook send-to-mxnet-container.ipynb and follow the instructions there.

How fast can we do the image analysis (in image frames per second)?

Running the full resnet-152 model on an installed version  of MXNet  on more capable machines (Mac mini and the AWS Deeplearning AMI c5.4xlarge, no GPU)  yields an average performance of about 0.7 frame/sec.     Doing  the same experiment on the same machines, but using the docker container and a local version of the Jupyter notebook driver we see the performance degrade a bit to an average of about 0.69 frame/sec (on a benchmark set of images we described in the next paragraph).    With a GPU one should be able to go about 10 times faster.

For the timing tests we used a set of 20 images from the internet that we grabbed and reduced so they average about 25KB in size.  These are stored in the Edge device.  Loading one of these images takes about the same amount of time as grabbing a frame from the camera and reducing it to the same size.  Two of images from the benchmark set and the analysis output is shown in figure 3 below.

bugs

Figure 3. Two of the sample images together with the output analysis and call time.

How can we go faster on the Pi 3?   We are also able to install MXNet on the Pi 3, but it is a non-trivial task as you must build it from the source.   Deployment details are here, however, the resnet 152 model is too large for the 1MB memory of the Pi 3, so we need to find another approach.

The obvious answer is to use a much smaller model such as the Inception 21 layer network which has a model database of only 23MB (vs 310MB  for resnet 152), but it has only 1000 classes vs   the 11,000 of the full rennet 152. We installed Tensorflow on the Pi3. (there are excellent examples of using it for image analysis and recognition provided by Matthew Rubashkin of Silicon Valley Data Science.) We ran the Tensorflow Inception_2015_12_05 which fit in memory on the Pi.  Unfortunately, it was only able to reach 0.48 frames per second on the same image set described above.

To solve the, we need to go to the cloud.    In a manner similar to the Greengrass model,  we will have the Pi3 sample the camera and downsize the image and send it to the cloud for execution.   To test it we ran the MXNet container on a VM in AWS and pointed the Pi camera at various scenes.  The results are shown in Figure 4.

pipict1

pipict2

Figure 4.  The result for the toy dinosaur result is as it is logged into the AWS DynamoDB.   With the bottom two images show only the description string.

The output of the model gives us likelihood of various labels.  In a rather simple minded effort to be more conversational we translate the likelihood results as follows.  If a label X is  more than 75% likely  the container returns a value of “This certainly looks like a X”.   If the likely hood is less than less than 35%  it returns “I think this is an X, but I am not sure” (the code is below).    We look at the top 5 likely labels and they are listed in order.

The Pi device pushes jpeg images to AWS S3 as a blob.   It then pushes the metadata about the image (a blob name and time stamp) to the AWS Simple Queue Service.   We modified the MXNet container to wait for something to land in the queue. When this happens, it takes the image meta data and pulls the image from S3 and does the analysis and finally stores the result in an AWS DynamoDB table.

However we can only go as fast as we can push the images  and metadata to the cloud from the Pi device.   With repeated tries we can achieve 6 frames/sec.   To speed up the analysis to match this input stream we spun up a set of analyzers using the AWS Elastic Container Service (ECS).  The final configuration is shown in Figure 5.

mxnet-cluster

Figure 5.   The full Pi 3 to Cloud image recognition architecture. (The test dataset is shown in the tiny pictures in S3)

To conduct the experiments, we included a time stamp from the edge device with the image metadata. When the MXNet container puts the result in the DynamoDB table it includes another timestamp.  This allows us to compute the total time from image capture to result storage for each image in the stream. If the device sends the entire collection as fast as possible then the difference between the earliest recorded time stamp and the most recent gives us a good measure of how long it takes to complete the entire group.

While the Pi device was able to fill S3 and the queue at 6 frames a second having only one MXNet container instance yielded the result that the total throughput was only about 0.4 frames/sec.   The servers used to host the container are relatively small.   However, using the ECS it is trivial to boost the number of servers and instances.  Because of the size of the container instance is so large only one instance can fit on each of the 8 GB servers.   However as shown in Figure 6 we were able to match the device sending throughput with 16 servers/instances.  At this point messages in the queue were being consumed as fast as they were arriving. Using a more powerful device (a laptop with a core I7 processor) to send the images we were about to boost the input end up to just over 7  frames per second and that  was matched with 20 servers/instances.

framespersec

Figure 6.  Throughput in Frames/second measured from the Pi  device to the final results in the DynamoDB instance.   In the 20 instance case, a faster core I7 laptop was used to send the images.

Final Thoughts

This exercise does not fully explore the utility of AI method deployed at the edge or between the edge and the cloud.  Clearly this type of full object recognition at real-time frame rates is only possible if the edge device has sophisticated accelerator hardware.  On the other hand, there are many simple machine learning models that can be used for more limited applications.   Object motion tracking is one good example.  This can be done in real-time.    This is typically done by comparing a frame to a previous one and looking for the differences.   Suppose you need to invoke fire suppression when a fire is detected.   It would not be had to build a very simple network that can recognize fire but not simple movement of ordinary objects.   Such a network could be invoked whenever movement is detected and if it is fire the appropriate signal can be issued.

Face detection and recognition is possible with the right camera.   This was done  with the Microsoft Xbox-1 and it is now part of the Apple IPhone X.

There are, of course, limits to how much we want our devices to see and analyze what we are doing.   On the other hand it is clear that advances in automated scene analysis and “understanding” are moving very fast.   Driverless cars are here now and will be commonplace in a few years.   Relatively “smart” robots of various types are under development.   It is essential that we understand how the role of these machines in society can benefit the human condition along the lines of the open letter from many AI experts.

 

Notes about the MXNet container.

The code is based on a standard example of using MXNet to load a model and invoke it.   To initialize the model, the container first loads the model files into the root file system.  That part is not show here. The files are full-resnet-152-0000.parms (310MB), full-resnet-152-symbols.json (200KB) and full-synset.txt (300KB) . Once loaded into into memory  the full network is well over 2GB and the container requires over 4GB.

Following the load, the model is initialized.

import mxnet as mx
# 1) Load the pretrained model data
with open (' full-synset.txt ','r ') as f:
        synsets = [l.rstrip() for l in f]
sym, arg _params , aux_pa ram s = mx . model .load _checkpoint( 'full-resnet-152' ,0)
# 2) Build a model from the data
mod = mx.mod .Module (symbol =sym , context =mx. gpu ())
mod. bind ( for_training =False, data_shapes=[( 'data ',(1,3,224,224))])
mod. set_params ( arg_params , aux_params )

The function used for the prediction is very standard.  It takes three parameters: the image object, the model and synsnet (the picture labels).   The image is modified to fit the network and then fed to the forward end.  The output is a Numpy array which is sorted and the top five results are returned.

def predict(img, mod, synsets):
    img = cv2.resize(img, (224, 224))
    img = np.swapaxes(img, 0, 2)
    img = np.swapaxes(img, 1, 2)  
    img = img[np.newaxis, :] 
    mod.forward(Batch([mx.nd.array(img)]))
    prob = mod.get_outputs()[0].asnumpy()
    prob = np.squeeze(prob)
    a = np.argsort(prob)[::-1]
    result = []
    for i in a[0:5]:
          result.append( [ prob[i], synsets[i][synsets[i].find(' '):]])
    return result

The container runs as a webservice on port 8050 using the Python “Bottle” package.  When it receives a web POST message to “call_predict” it invokes the call_predict function below.  the image has been passed as a jpeg attachment with is extracted with the aid of the request package.   It is saved in a temporary file and then read by the OpenCV read function.  Unfortunately there was no way to avoid the save followed by read because of limitations to the API.  However we measured the cost of this step and it was less than 1% of the total time of the invocation.

The result of the predict function is a two dimensional array with each row consisting of a probability and the associated label.   The call returns the most likely labels as shown below.

@route('/call_predict', method='POST')
def call_predict():
    t0 = time.time()
    result = ''
    request.files.get('file').save('yyyy.jpg', 'wb')
    image = cv2.cvtColor(cv2.imread('yyyy.jpg'), cv2.COLOR_BGR2RGB)
    t1 = time.time()
    result = predict(image,  mod, synsets)
    t2 = time.time()
    answer = "i think this is a "+result[0][1]+" or it may be a "+result[1][1]
    if result[0][0] < 0.3:          answer  = answer+ ", but i am not sure about this."     if result[0][0] > 0.6:
         answer = "I see a "+result[0][1]+"."
    if result[0][0] > 0.75:
         answer = "This certainly looks like  a "+result[0][1]+"."
    answer = answer + " \n total-call-time="+str(t2-t0)
    return(answer)
run(host='0.0.0.0', port=8050)

The version of the MXNet container used in the ESC experiment replace the Bottle code and call_predict with loop that polls the message queue, pulls a blob from S3 and pushes the result to DynamoDB

Algorithmia Part 2. Document Embedding with Gensim and Deploying a Trained Model in the Algorithmia Cloud

 

In a previous post we described Algorithmia,  a cloud service for discovering, invoking and deploying algorithms.  In this short article we look at Algorithmia as a tool to deploy trained machine learning models.   We used a tool called Gensim to build a model of scientific documents and then create an Algorithmia service that uses the model to predict the topic categories of scientific article.

A Review of Word Vectors and Document Vectors.

This technology has been around for a while now, so this review is more of a history lesson and not a deep technical review.  However, we will give you links to some of the important papers on the subject.

If you want to do document analysis through machine learning, you need a way to represent words in a vector form.  Given a collection of documents you can extract all the words to create a vocabulary.  If the size of the vocabulary is 100,000 words, you can represent each word as a “one-shot” vector in which the i-th word in the vocabulary is a vector of zeros except for a 1 in the i-th position.   Then if your then each document in your collection can be represented as the sum of vectors corresponding the words in that document.   If you have M documents, then the collection is represented by the sparse matrix of size M x 100,000.  Using this “bag of words” representation, there are a variety of traditional techniques such as Latent Sematic Analysis that can be used to extract the similarities between documents.

About five years ago, a team from Google found a much better way to create vectors from words so that words that are used in similar semantic context are nearer to each other as vectors.   Described in the paper by Tomas Mikolov et. all.,  the method, often referred to as Word2Vec, can be considered a map m() of our 100,000 dimension space of word to a dense space of much smaller dimension, say 50, with some remarkable properties.   In particular, there is the now-famous analogy linearity relationships. For example “man is to king as woman is to queen” is expressible (approximately) as

m( king) – m(man) + m(woman)   ≈   m(queen)

There is an excellent set of technical explanations of why Word2Vec work on Quora and we won’t go into them here.  One of the best papers that address this issue is by Golberg and Levy.

Le and Mikolov have shown that the basic methods of Word2Vec generalized to paragraphs, so that we now have a map p() from a corpus of paragraphs to vectors. In other words, given a corpus of documents D of size N, then for any doc d in D, p(d) is a vector of some prespecified length that “encodes” d.  At the risk of greatly oversimplifying, the paragraph vector is a concatenation of a component that is specific to the paragraph’s ID with word vectors sampled from the paragraph.   (As with Word2Vec, there are actually two main versions of this model.  Refer to the Le and Mikolov paper for details.)  It turns out that the function p can be extended to arbitrary documents x so that p(x) is an “inferred” vector in the same space vector space.  We can then use p(x) to find the documents d such that p(d) is nearest to p(x).      If we know how to classify the nearby documents, we can make a guess at the classification of x.  That is what we will do below.

Using Doc2Vec to Build a Document Classifier

Next we will use a version of the Paragraph vectors from Gensim’s Doc2Vec model building tools and show how we can use it to build a simple document classifier.  Gensim is a product of  Radim Řehůřek’s RaRe Technologies.    An excellent tutorial for Gensim is this notebook from RaRe.  To initialize Gensim Doc2vec we do the following.

import gensim
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)

This creates a model that, when trained will have vectors of length 50.  The training will use 2 word minimum from each doc for each iteration and there will be 55 iterations.

Next we need to ready a document corpus.   What we will use is 7000 science journal article abstracts from the Cornell University archive ArXiv.org . We have recorded the titles, abstracts and the topic classifications assigned by the authors.   There are several dozen topic categories but we partition them into five major topics: physics, math, computer science, biology and finance.      We have randomly selected 5000 for the training set and we use the remainder plus another 500 from recently posted papers for testing.  We must first convert the text of the abstracts into the format needed by Doc2Vec.  The files are “sciml_train” and “sciml_test”.   The function below preprocesses each of the document abstracts to create the correct corpus.

def read_corpus(fname, tokens_only=False):
        with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
             for i, line in enumerate(f):
                   doc = gensim.utils.simple_preprocess(line)
                   if tokens_only:
                       yield doc
                   else:
                       # For training data, add tags
                       yield gensim.models.doc2vec.TaggedDocument(d, [i])
train_corpus = list(read_corpus("sciml_train"))
test_corpus = list(read_corpus("sciml_test", tokens_only=True))

We next build a vocabulary from the words in the training corpus.   This is a dictionary of all the words together with the counts of the word occurrences.  Once that is done we can train the model.

model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)

The training takes about 1 minutes and a simple 4-core server. We can now save the model so that it can be restored for use later with the Python statement model.save(“gensim_model”). We will use this later when building the version we will install in Algorithmia.
The model object contains the 5000 vectors of length 50 that encode our documents. To build our simple classifier we will extract this into an array mar of size 5000 by 50 and normalize each vector to be of unit length. (The normalization will simplify our later computations.)

import Numpy as np
mar = np.zeros((model.docvecs.count, 50))
for i in range(m.count):
    x = np.linalg.norm(model.docvecs[i])
    mar[i] = model.docvecs[i]/x

An interesting thing to do with the mar matrix is to visualize it in 2-d using the t-distributed stochastic neighbor embedding (t-SNE) algorithm. The result is shown in the figure below. The points have been color coded based on topic: 1(dee purple) = “math”, 2(blue gray) = “Physics”, 3(blue green) = “bio”, 4(green) = “finance” and 5(yellow) = “compsci”.

tsne-word2vec2

There are two things to note here.   First, the collection is not well balanced in terms of numerical distribution.  About half the collect is physics and there are only a small number of bio and finance papers.   That is the nature of academic science:  lots of physicists publishing papers and not so many quantitative finance or quantitative bio papers in the open literature.   It is interesting to note that the Physics papers divide clearly into two or  three clouds.  (it turns out these separate clouds could be classed as “astrophysics”  and “other physics”.)  Computer science and math have a big overlap and bio has a strong overlap with cs because these are all “quantitative bio” papers.

The classification algorithm is very simple. Our model has a function infer_vector(doc) that will use stochastic methods to interpret the doc into the model vector space. Using that inferred vector we can compute the nearest k documents to it in the model space with the function below.

def find_best(k, abstract):
	preproc = gensim.utils.simple_preprocess(abstract)
	v = model.infer_vector(preproc)
       v0 = v/np.linalg.norm(v)
	norms = []
 	for i in range(5000):
     		norms.append([np.dot(v0,mar[i]), i])
       return norms[0:k]

The dot product of the two normalized vectors is the cosine distance. Because the infer_vector is stochastic in nature, our final version of the classifier calls the find_best ten times and computes an average ranking. (The details are in this notebook. and an Html version.) Selecting one of the more recent abstracts and subjecting it to the classifier gives the result pictured below.

sample-classifier1

The analysis gives the abstract a score of 80 for computer science and 20 for bio. Note that the title contains the detailed ArXiv category, so we see this is correct, but reading the article it could also be cross listed as bio.
On the other hand, there are many examples that easily confuse the system. For example, the one below is classified as quantitative biology in arXiv, but the system can’t decide if it is math, cs or physics.

gensim_result2.JPG

In general we can take the highest ranking score for each member of the test set and then compute a confusion matrix.  The result is shown below.  Each row of the table represents the percent of the best guesses from the system for the row label.

gensim_confusion

One interesting observation here is that in the cases where there is an error in the first guess, the most common mistake was to classify an abstract as mathematics.

Moving the model to Algorithmia

Moving the model to Algorithmia is surprisingly simple.  The first step is to create a data collection in the Algorithmia data cloud.   We created one called “gensim” and it contains the three important files: the gensim model,  topicdict, the dictionary that translates ArXiv topics to our major topics, and the ArXiv topics associated with each of the training documents.   The Algorithmia collection is shown below.  We also loaded the training document titles but they are not necessary.

gensim_algorithmia_data

The main difference between running a trained model in Algorithmia and that of a “normal” algorithm is the part where you load the model from the data container.   The skeleton of the python code now includes a function load_model()which you write and a line that invokes this function as shown below.  Now when your algorithm is loaded into the microservice it first calls the load_model()before invoking the apply(input) function.   For all subsequent invocations of you algorithm while it running in that microservice instance the model is already loaded. (The full source code is here. )

import Algorithmia
import gensim
From gensim.models.doc2vec import Doc2Vec
client = Algorithmia.client()
def load_model():
    file_path = 'data://dbgannon/gensim/gensim_model'
    file_path = client.file(file_path).getFile().name
    model = Doc2Vec.load(file_path)
    # similarly load train_sites and topicdict
    # and create mar by normalizing model data
    return model, mar, topicdict, train_sites

model, mar, topicdict, train_sites = load_model()
def find_best_topic(abstract):
    #body of find_best_topic
def apply(input):
    out = find_best_topic(input)
    return out

Deploying the algorithm follows the same procedure as before.   We add the new algorithm from the Algorithmia portal and clone it.   Assuming the SciDocClassifier.py contains our final version of the source,  we execute the following commands.

git add SciDocClassifier.py
git commit -m "second commit"
git push origin master

Returning to the Algorithmia portal, we can go to the project source editor. From there we need to add the code dependencies. In this case, we wanted exactly the same versions of gensim and Numpy we used in our development environment. As shown below that was easy to specify.

algorithmia-dependencies

The final version has been published as dbgannon/SciDocClassifer and is available for anyone to use.  Once again, our experience with using Algorithmia’s tools have been easy to use and fun to experiment with. There are many algorithms to try out and a starter free account is all you need.

Observations about Serverless Computing

With a few examples from AWS Lambda, Azure Functions and OpenWhisk.

Cloud computing is going through an interesting evolution.  It has gone from a platform for deploying virtual machines to planet-scale systems with extensive collections of data storage, analysis and machine learning services.   Most recently we have seen the emergence of “cloud native” computing, which in its most basic form involves a design pattern of microservices where big applications are decomposed into hundreds of basic stateless components that run on  clusters managed by tools like Kubernetes, Mesos and Swarm.

Serverless computing is the next step in this evolution.  It addresses the following challenges.  Suppose I have a small computation that I want to run against some database at the end of each month. Or suppose I want to have the equivalent of a computational daemon that wakes up and executes a specific task only when certain conditions arise.  For example, when an event arrives in a particular stream or when a file in a storage container has been modified or when a timer goes off.  How do I automate these tasks without paying for a continuously running server?   Unfortunately, the traditional cloud computing infrastructure model would require me to allocate computing resources such as virtual machines or a microservice cluster and my daemon would be a continuously running process.   While I can scale my cluster of VMs up and down, I can’t scale it to zero without my daemon becoming unresponsive.  I only want to pay for my computing WHEN my computation is running.

This is not a totally new idea.  Paying only for the compute that we use goes back to early timesharing and persists with compute-hour “allocations” on supercomputers today.  And there are cloud services such as Azure Data Lake Analytics, Amazon Kinesis or the AWS API gateway, that charge you only for the computation that you use or data that you move and they do not require you to deploy server infrastructure to use them.

However, there is something deeper going on here and it has to do with triggers and another computational paradigm called “Function-as-a-Service”(FaaS). Unlike the serverless examples above which depend upon me invoking a specific well-defined service, FaaS allows a cloud user to define their own function, and then “register” it with the cloud and specify the exact events that will cause it to wake up and execute. As mentioned above, these event triggers can be tied to changes in state of a storage account or database, events associated with queues or streams of data from IoT devices, web API invocations coming from mobile apps.  Triggers can even be defined by steps in the execution of a workflow.   And, of course, the user only pays when and while the function is executing.

faas-diagram

Figure 1.  Function, triggers and output concept. (from markoinsights.com)

There have been two conferences that have focused on the state of serverless computing. The paper “Status of Serverless Computing and Function-as-a-Service (FaaS) in Industry and Research” by Geoffrey Fox, Vatche Ishakian, Vinod Muthusamy and Aleksander Slominski provides an excellent overview of many of the ideas, concepts and questions that surround serverless computing that surfaced at these conferences. In that paper they refer to an IBM tutorial that defines serverless FaaS as
1. short-running, stateless computation
2. event-driven applications
3. scales up and down instantly and automatically
4. based on charge-by-use

Notice that there is a distinction between FaaS and serverless FaaS and it has to do with item 4 above.   A good example of this is Google’s App Engine, which was arguably the first FaaS available from a commercial cloud.   In its current form App Engine can run in one of two modes.  In its standard mode, your applications run in a sandbox and you are charged only when the app is running.  In the “flexible” mode you deploy a container  and then specify the compute infrastructure needed in terms of CPU power, memory, disk and you are charged by the hour.  You could say that App Engine in running Flexible mode is server-lite, but clearly not fully serverless, while standard mode is truly serverless.

What are the serverless FaaS choices?

There are a number of FaaS implementations.  Some of these are used for research while others are commercial products.  The Status report refers to many of these and the slides for the workshop are on-line.   A good example of the research work is OpenLambda from the University of Wisconsin and first introduced dat HotCloud ’16.  Based on this experience the Wisconsin team described Pipsqueak, an experiment to reduce the deployment latencies caused by Python library initializations.   Ryan Chard described Ripple which is an excellent example of distributing event trigger management from the source to the cloud. Ripple has been designed and use for several significant science applications including beamline science (ALS and APS).  Another related technology is if-this-then-that IFTTT that is a service for chaining together other service.

Two other open source projects raise an interesting question about what is behind the curtain of serverless. Funktion  and Fission are both implementations of FaaS on top of Kubernetes. As we discuss serverless computing we must remember that there is a “server” somewhere. The basic infrastructure for serverless computing needs to run somewhere as a persistent service and hence a microservice platform like Kubernetes is a reasonable choice. This relates to the economics of severless computing and we return to that at the end of this report.

The most commonly referenced open source FaaS service is Apache OpenWhisk which was developed by IBM and is available on their Bluemix cloud as a service. The other commercial services include Google Function, Microsoft Azure Functions and Amazon Lambda. At the end of this article we will show some very simple examples of using some of these systems.

When can FaaS replace a mesh of microservices for building an app?

The Status paper also makes several other important observations. For example, they note that while serverless is great for the triggered example described above, it is not good for long-running or statefull applications like databases, deep learning training, heavy stream analytics, Spark or Hadoop analysis and video streaming. In fact, many large-scale cloud-native applications that have thousands of concurrent users require continuous use of massive networks of microservices. This will will not be based on serverless FaaS. However, there may be many cases where a user-facing application running in the cloud could be easily implemented with serverless FaaS rather than as a big microserivce deployment. What we do not know is where the cross-over point from a serverless FaaS implementation of an app to a full kubernetes based massive microservice deployment lies. This relates to the economic of FaaS (discussed briefly at the end of this article). The Cloud Native Computing Foundation has a working group on serverless computing that is addressing this topic.

There is one interesting example of using serverless FaaS to do massively parallel computing and it called pywren. The lovely paper “Occupy the Cloud: Distributed Computing for the 99%”  by Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica and Benjamin Recht describes the concepts in pywren which allow it to scale computation to thousands of concurrent function invocations achieving 40teraflops of compute performance. Pywren uses AWS Lambda in a very clever way: it serializes the computational function which are then passed to a lambda function to execute. We will return to Pywren in another post.

Computing at the Edge?

Perhaps the hottest topic in the general area of cloud computing is when the computing spills out of the cloud to the edge of the network. There are many reasons for this but most boil down to latency and bandwidth. A good example is the use cases that motivate the Ripple system described above. In order to generate a stream of events from sensors at the edge of the network, one needs very light weight computing that can monitor them and generate the events. In many cases it is necessary to preprocess the data in order to send the message to the cloud where a function will be invoked to respond. In some cases, the response must return to the source sooner than a remote invocation can respond because of the latencies involved. The computing at the edge may need to execute the function there and the actual communication with the cloud may be just a final log event or a trigger for some follow-up action.

Another possibility is that the functions can migrate to the places where they are needed. When you deploy the computing at the edge you also deploy a list of functions that that must be invoked. When the edge process starts up it could cache some of the function it needs locally. We expect there will be many variations on these ideas in the future.

A Tiny FaaS Tutorial

By their very nature FaaS systems are very simple to use: you write short, stateless function and tie them to triggers. We will take a quick look at three of these: AWS Lambda, Microsoft Function and IBM Bluemix OpenWhisk.

AWS Lambda Functions

Amazon was the first to introduce serverless functions as a service and it is very well integrated into their ecosystem of other services. Called Lambda functions, they can be easily created in a number of standard programming languages. Lambda functions can be associated with a variety of trigger events including changes to the state of a storage account, web service invocations, stream events and even workflow events.

We will illustrate Lambda with a simple example of a function that responds to Kinesis Stream events and for each event it adds an item to a dynamoDB table. Here is the python code for the function that accomplishes this task.

from __future__ import print_function
#import json
import base64
import boto3
def lambda_handler(event, context):
    dyndb = boto3.resource('dynamodb', region_name='us-west-2')
    table = dyndb.Table("lambdaTable")
    
    for record in event['Records']:
       #Kinesis data is base64 encoded so decode here
       payload=base64.b64decode(record["kinesis"]["data"])
       x = eval(str(payload))
       metadata_item = {'PartitionKey': x['PartitionKey'], 
                        'RowKey': x['RowKey'], 'title': x['text']}
       table.put_item(Item=metadata_item)

There are several critical items that are not explicit here.   We need to invoke the AWS Identity and Access Management (IAM) system to delegate some pemissions to our function and we will need to grab copies of the Amazon Resource Names (ARNs) for this and other objects. First we create a IAM role that will allow access to Kineses streams and DynamoDB. Using the IAM portal we created a role called:  “lambda-kinesis-execution-role” and attached two policies ‘AmazonDynamoDBFullAccess” and “AWSLambdaKinesisExecutionRole”. We then made a copy of the Role ARN.

The next step is to install this function into the lambda system. To do that we put the function above into a file called “ProcessKinesisRecords.py” and then zipped it. We then uploaded the zipped file to S3 (in the us-west-2) in the bucket “dbglambda”. With that we can create our function with the boto3 call from our laptop as:

lambdaclient = boto3.client('lambda')

response = lambdaclient.create_function(
    FunctionName='lambdahandler',
    Runtime='python2.7',
    Role='arn:aws:iam::066xx534:role/lambda-kinesis-execution-role',
    Handler='ProcessKinesisRecords.lambda_handler',
    Code={
        'S3Bucket': 'dbglambda',
        'S3Key': 'ProcessKinesisRecords.zip',
    }
    )

The final item we need is the link between this function and the Kinesis Stream service.
To do that we went to the portal for Kinesis and created a stream called “mylambdastream” and grabbed its ARN. Creating the binding is accomplished with the following.

response = lambdaclient.create_event_source_mapping(
    EventSourceArn=
        'arn:aws:kinesis:us-west-2:06xxx450734:stream/mylambdastream',
    FunctionName='lambdahandler',
    BatchSize=123,
    StartingPosition='TRIM_HORIZON',
)

We can verify the properties of the function we have created by looking at the AWS Lambda portal pages. As shown below, we can verify that our lambdahandler function does indeed have our stream as its trigger.

lambda1

Figure 2. AWS Lambda Function dashboard showing our trigger.

Finally we can invoke it by pushing an event to the kinesis stream. This is shown below.

client = boto3.client('kinesis')
record = "{'PartitionKey':'part2', 'RowKey': '444', 'text':'good day'}"
resp = client.put_record(StreamName='mylambdastream',
    Data=bytearray(record),
    PartitionKey='a')

Checking the DynamoDB portal will verify that the function has picked the message from Kinesis and deposited it in the database. The full details are in the notebook “simple-kinesis-lambda-sender.ipynb”.

Azure Functions

The Azure function portal has a large number of basic templates we can use to build our function as shown below.

azure-funcs1

Figure 3.  Azure Function Template samples

We have selected one of the Python examples that creates a simple web service. Bringing this up on the portal we see the code below.

import os
import json

postreqdata = json.loads(open(os.environ['req']).read())
response = open(os.environ['res'], 'w')
response.write("hello world to "+postreqdata['name'])
response.close()

To get this running, we now must go to the “integrate” page to provide one more detail: we set the authorization level to “anonymous” as shown below.

azure-func2

Figure 4.  Setting the authorization level for the trigger to “anonymous”.

The client is also simple and shown below.

import os
import json
import requests
import json

data = {}
data['name'] = 'goober'
json_data = json.dumps(data)
r = requests.post("https://pylook.azurewebsites.net/api/pylook", 
                   data=json_data)
print(r.status_code, r.reason)
print r.text

Using Azure Functions to pull data from a queue.

We now give another version of the function that reads messages from a queue and puts them in a table. There is no Python template for this one yet, so we will use JavaScript. To use template this we must first create a new storage account or use an existing one. Go to the storage account page and you will seefunction5

Figure 5. Creating a new table and queue in a storage account.

We click on Table and create a new table and remember its name. Then we go to Queues and create a new one and remember its name. Looking in the storage explorer should show for these items. Clicking on the table name should give us a picture of an empty table.

Go back to the portal main page and click + and look for “Function App” and click create. There is a table to fill in like the one below.

function4

Figure 6.  Creating a new function.

We give it a name and allow it to create a new resource group. For the storage we want to use the dropdown and look for the storage account name. (It is important that the storage account is in the same location as the function. We click create and wait for the function to appear on the function portal page. You should see it in your resource groups. Follow that link and you will see that it is running.

Go to the functions tab and hit “+”. It will ask you to pick one of the templates. At the top where it says “language” select Javascript and pick the one that is called QueueTrigger. This will load a basic template for the function. Now edit the template so that it looks like the following example.

function2

 

The main difference between this and the template is that we have added an output table and instructions to push three items into the table. The function is assuming that the items in the queue are of the form

{‘PartitionKey’: ‘part1’, ‘RowKey’: ’73’, ‘content’: ‘some data ‘}

Next we need to tie the queue to our queue and the table to our table. So click on “Integrate” on the left. You need to fill in the form so that it ties it to your stuff as illustrated below.

function3

Figure 8.  The association between the storage account queue and table service with the variable in our function.  Here we have highlighted the “show value” to verify that it has the right storage account.

You should see your storage account in the dropdown menu. Select it. And they add the Table name. You need to do the same for the AzureQueueStorage. Once this is done and your function is saved and the system is running your function should be instantiated and invoked as soon as you send it queue items. For that we have a simple python script in a Jupyter notebook. You can get from  https://SciEngCloud.github.io/py-functions-queue-driver.ipynb.
You will need to fill in your account key for the storage account, but then you should be able to step through the rest. The notebook runs a few tests and then sends 20 items to the queue. Using the Azure storage explorer we see the results in the table as shown below.

function1

Figure 9.   The view of the table after running the Jupyter Notebook

OpenWhisk and IBM Bluemix

OpenWhisk is the open source serverless function system developed by IBM that is also supported as a commercial service on their Bluemix cloud.  Like the others it is reasonably easy to use from their command-line tools, but the Bluemix portal provides an even easier solution.   Creating a function is as easy as selecting the runtime and downloading a template.  Figure xx below illustrates a simple function derived from template for a Python3 function.

action

Figure 10.  An OpenWhisk function that decodes a dictionary input and returns a dictionary.

Notice that the parameter to the function is a python dictionary (all Openwhisk messages are actually Json objects, which for Python are rendered as dictionaries).  While this template can be run as a web service in its present form, it is also possible to connect it to an external trigger.  To illustrate this we connected it to a trigger  that monitors activity in a Github account.  Using the portal it was easy to create the trigger (called “mygithubtrigger” and bind it to push and delete actions on a repository called dbgannon/whisk.   All this required was an access token which was easy available by logging in to the GitHub portal.   In the case of GitHub triggers the event returns a massive Json object that is a complete description of the repository.   In the action “myaction” we go two levels down in the dictionary and extract the official repository description and make that the response of the action to the event.

When a trigger fires you need a rule which bind the firing to an action.   We bound it to the trivial “myaction” example above.  The view from the portal of the rule is below

ruleview

Figure 11.   The rule view of our association of the trigger to the function

We next added a new file to the repository.  This activated the trigger and the rule invoked the action.   The portal has a nice monitor facility and the image is below.

monitor
Figure 12.  This is the “Monitor” view after the trigger has fired.

Finally, drilling down on the “myaction” log event we see the description of the GitHub repository we created.

json-data
Figure 13.  The output of the “myaction” function after a file was added to the GitHub repository.

Finally

These examples above are all very trivial.   The next thing to explore is how functions can be composed into workflows.   Each of the three systems has its own way of doing that and if we have time later we will show some examples of this capability.   We have also not disussed performance or cost which is greatly dependent on the rate at which your triggers are firing and the amount of work in each function execution.

The economics of serverless computing are also very interesting.  As we pointed out  earlier, for a cloud provider to offer a FaaS serverless capability it must be supported by actual infrastructure.   How can the provider afford this if it is only charging by the second?  There are two possible answers.   First if your FaaS is serving very large numbers of function per second then it will certainly pay for itself.   But there is another consideration.   Many of the big cloud providers are running vast microservice frameworks that support most, if not all of their big internal and publiclly available applications.  Running  a FaaS system on top of that is as easy as running any other microservice based application.  One only needs to set the cost per second of function evaluation high enough to cover the cost of that share of the underlying platform the FaaS system is using.   For Amazon Lambda “The price depends on the amount of memory you allocate to your function. You are charged $0.00001667 for every GB-second used.”  This amounts to $0.06 per GB hour.  So for a function that take 8 GB of memory to execute that is $.48 per hour, which is 4.8 times greater than the cost of an 8GB m4.large ec2 instance.   Consequently a heavily used FaaS system is a financial win and a lightly used one may have little impact on your infrastructure.

Of course, single executions of a FaaS function are limited to five minutes, so you would need 12 5 minute concurrent executions to reach an hour of execution time. Furthermore AWS give you 1 million invocations (at 100ms) each month for free or 400,000 GB-seconds per month free.  That is 111 GB hours per month for free.   This is not a bad deal and it would seem to indicate that Lambda is not a big drain on their infrastructure yet.

Acknowledgement:  Thanks go to Ryan Chard for helpful comments and suggestions on a draft of this post.

Azure’s new CosmosDB Planet-Scale Database

The Microsoft Azure team has recently released CosmosDB, an new entry to the cloud data storage management marketplace. Cosmos is actually the name of a data storage system that has been used internal to Microsoft for many years. That original Cosmos has now morphed into Azure Data Lake Storage (ADLS) and Analytics. ADLS is focused on Hadoop/HDFS compatible scalable analytics and, at the present time, it is available in only two of the US data centers. CosmosDB is related to the original Cosmos in name only and it is available at all of the data centers.

There is an excellent technical description [https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-cosmos-db], so we will hit only a few highlights here.  CosmosDB has four main API components and a number of interesting properties. From the user’s perspective these components are

  1. DocumentDB; “a NoSQL document database service designed from the ground up to natively support JSON and JavaScript directly inside the database engine.”
  2. MongoDB; CosmosDB implements the MongoDB API so that users of that popular NoSQL database can easily move to CosmosDB.
  3. Azure Table Storage: a simple key-value table mechanism that has been part of Azure from it earliest days.
  4. Gremlin Graph Query API; a graph traversal language based on functional programming concepts.

There is a common “resource model” that unifies all of these capabilities. A database consists of users with varying permissions and containers. A container holds one of the three content types, document collection, tables or graphs. Special resources like stored procedures, triggers and user-defined-functions (UDFs) are also stored within the container. At the top level the user creates a cosmosDB database account and in doing so the user must pick one of these four APIs. In the examples we show here we focus on the DocumentDB API. In a later post we will look at Gremlin.

There are four important properties that every CosmosDB databased has.

  1. Global distribution: your database is replicated in any of 30+ different regions and you can pick these from a map on the Azure portal.
  2. Elastic scale-out: you can scale throughput of a container by programmatically provisioning throughput at a second or minute granularity. provision throughput (measured in using a currency unit called, Request Unit or RU).
  3. Guaranteed low latency: For a typical 1KB item, Azure Cosmos DB guarantees end-to-end latency of reads under 10ms and indexed writes under 15ms at the 99th percentile within the same Azure region.
  4. Five consistency models.
  5. A Comprehensive Service Level Agreement (SLA).

Global Distribution

CosmosDB has two forms of distribution. Each database is composed of one or more collections and every data collection is stored in a logical container that is distributed over one or more physical server partitions. And, of course, everything is replicated. Every item in a collection has a partition key and a unique ID. The partition key is hashed and the hash is used to make the assignment of that item to one of the physical servers associated with that container. This level of partitioning happens automatically when you create a collection and start filling it.

Global distribution is more interesting. When you first create a database it has an initial “write location” which refers to one of the 30 or so Azure regions around the globe. You can use the portal to say how your data is distributed to other regions. In the example we discuss at the end of this document our original region “North Central US”. We have used the “Replicate Data Globally” function, which gives us a map in the portal, where we selected three additional regions, “West Europe”, “South India” and “Brazil South” as shown in Figure 1 below.

geodistributed-db

Figure 1. Selecting three additional locations for a database using a map tool.

With a small (170MB) database of documents it took only a few minutes to replicate the data to these locations. The data collections in the replicate sites are considered “read locations”. Which means that a user in these locations will by default read data from the local replica. However if that remote user attempts a write the request is routed to the “write location”. To understand how long a remote reader takes to see an update from the remote write location we need to discuss the consistency models.

Elastic scale-out and The Cost of Computing

Every operation you do in CosmosDB requires bandwidth and computation. You can control the rate at which “energy” is consumed for your database with something called Request Units (RU) and RUs/sec and RUs/min. You specify the RUs/sec you are willing to “spend” on your database and CosmosDB will provision the resources to meet your throughput requirements.

Consistency Models

Consistency is one of the most difficult topics in distributed systems. In the case of distrubted or replicated databases, consistency the consistency model tell the user how changes in one copy of a database are reflected updates to other copies. By insisting that consistency be very strong, this may have an adverse impact on latencies and throughput. So the choice of consistence model can have a profound impact on application performance and cost.

One of the truly unique features of CosmosDB is that it gives the user a choice of five different consistency models. In order from the weakest to the strongest and in terms of RUs, the cheapest to most expensive they are:

  1. Eventual consistency. Eventual consistency is the weakest form of consistency wherein a client may get the values which are older than the ones it had seen before, over time. In the absence of any further writes, the replicas within the group will eventually converge.
  2. Consistent prefix. Consistent prefix level guarantees that reads never see out of order writes. If writes were performed in the order `A, B, C`, then a client sees either `A`, `A,B`, or `A,B,C`, but never out of order like `A,C` or `B,A,C`.
  3. Session consistency. When you connect to a cosmosDB database through its URL you are creating a session. Session consistency guarantees monotonic reads, monotonic writes, and read your own writes (RYW) guarantees for the duration of the session. In the Python API you create client object that encapsulates your connection to the database. The parameter “Consistency_level” has the default value “Session”. Reads in session consistency take a bit longer than consistent prefix which takes longer than eventual consistency.
  4. Bounded stateness. This model has two ways to insist on consistency. You can define a time interval such that beyond that interval of time from the present, the system will guarantee consistency. Alternatively you can specify a the upper bound on the number of reads that lag behind writes your reader can be. Beyond that point consistency is guaranteed. Of course the smaller you make the window, the more computing you may consume and delay you may encounter.
  5. Strong consistency. The most expensive and most complete. it is linearizable in that reads always returns the most recent writes. But it is limited to collections that are not geodistributed.

A word about the performance of these models. In the example at the end of this section we put the write region of the database in North America and replicated in three other locations with one in Western Europe. We then created a virtual machine in Western Europe and connected it to the database and verified by the IP address that it was connecting to the local replica. We then fired changes at the North America region and read with each of the consistency models. We were unable to detect any out of order reads and the responses were extremely fast. This is obviously not a full scientific study, but it was clear we had a system that performed very well. We also attempted a write to the local Western Europe “read location”. We were able to detect that the API call immediately dropped the read location connection and reconnected to the North American copy to do the write.

Comprehensive SLA

The service level agreement (SLA) for CosmosDB is actually a contract that Azure and you agree to when you create a database. It is indeed comprehensive. It covers guarantees for availability, throughput, consistency and latency giving you upper bounds for each measurable attribute of performance. As you are the one to specify how many RUs/sec you are willing to allocate, some of the temporal guarantees may be dependent upon this allocation not being exceeded. We will not go into it in detail here. Read it with your lawyer.

A look at Azure DocumentDB using Python and GDELT Data

To illustrate the basics of DocumentDB we will use a tiny part of the amazing GDELT event collection. This data collection is available on AWS S3 as well as Google’s BigQuery. The GDELT project (http://gdeltproject.org) is the brainchild of Kalev Leetaru from Georgetown University and it is an serious “Big Data” collection. GDELT’s collection is the result of mining of hundreds of thousands of broadcast print and online new sources from every corner of the world every day. What we will look at here is a microscopic window into the daily collection of news items. In fact, we will take a look at the news from from one day: June 30, 2017.

AWS keeps the data on S3 and downloading it is dead easy. The command

$ aws s3 cp s3://gdelt-open-data/events/20170630.export.csv . 

will download the 20MB June 30 2017 dataset as a CSV file. Each row of the file is a record of an “event” consisting of some 60 attributes that catalog a publication of a news item. Among the fields in the record are a unique identifier, a timestamp and the name and geolocation of two “actors” that are part of the event. In addition, there is a url a published news item about the event. The actors are derived from the even. Actor1 is called an initiator of the event and actor2 is a recipient or victim of the event. These actors are usually represented by the names of the cities or countries associated with the actual actors.

For example the story “Qatar’s defense minister to visit Turkey amid base controversy” (http://jordantimes.com/news/local/qatar’s-defence-minister-visit-turkey-amid-base-controversy) appeared in the Jordan Times. It describes an event where the Qatar defense minister visited Turkey as Turkey resists pressure from Saudi Arabia, the United Arab Emirates, Egypt and Bahrain to close bases after the five other nations pressed sanctions agains Qatar on June 5. This URL appears three time in the database for June 30. In one case, actor1 is Doha Qatar and actor2 is Ankora Turkey. In the second case actor1 is Riyadh, Saudi Arabia actor2 is Turkey. In the third case actor1 and actor2 are both Riyadh. Though the identity of the actor cities is based on an automated analysis of the story, their may be three different stories to analyze. However, only one URL has been selected to represent all three.

For our purposes we will describe this as a single event that links the three cities, Doha, Ankora and Riyadh. Many researchers have used this GDELT data to do deep political and social analysis of the world we live in, but that goal is way beyond what we intend here. What we will try to do in this example is to use CosmosDB to graphically illustrate different ways we can represent the events of June 30 and the cities they impact.

Creating the DocumentDB from the CSV file.

We will begin by creating a simple document database where each document is the JSON record for one line of the CSV file. Actually we will only include the URL, the two actor cities and their geolocations. To begin we must connect you client program to the CosmosDB system.

We use Python and we assume that the documented tools have been installed.

#run this if documentdb is not installed
#!pip install pydocumentdb
import csv
import pydocumentdb;
import pydocumentdb.document_client as document_client

We next create the database and the collection. But first we need an Azure CosmosDB account and is best done on the Azure portal. One that is there you can retrieve the account key so that we can create the collection.
Our account is called bookdockdb and our database is called ‘db3’ and the new collection will be called ‘gdelt’.

config = { 
    'ENDPOINT': 'https://bookdocdb.documents.azure.com',
    'MASTERKEY': 'your db key here',
    'DOCUMENTDB_DATABASE': 'db3',
    'DOCUMENTDB_COLLECTION': 'gdelt'
};
#we create the database, but if it is already there the "CreateDatabase will fail"
#In that case, we can query for it. 
#this is handy because we can have more than one collection in  a given database.
try:
    db = client.CreateDatabase({ 'id': config['DOCUMENTDB_DATABASE'] }, )
except:
    db_id = config['DOCUMENTDB_DATABASE']
    db_query = "select * from r where r.id = '{0}'".format(db_id)
    db = list(client.QueryDatabases(db_query))[0]

Next we will create the collection. Again, we may already have the collection there, so we can add new items to an existing collection but we need a different method to retrieve the collection handle.

options = {
    'offerEnableRUPerMinuteThroughput': True,
    'offerVersion': "V2",
    'offerThroughput': 400
}
try:
    collection = client.CreateCollection(db['_self'], 
         { 'id': config['DOCUMENTDB_COLLECTION'] }, options)
except:
    collection = next((coll for coll in client.ReadCollections(db['_self']) \
         if coll['id']==config['DOCUMENTDB_COLLECTION']))

The next step is to read the CSV file and add the new documents to the database collection.
We are going to do this one line at a time and keep only the time stamps, actor cities and geolocations (latitude and longitude) and the associated URL. We will also keep a separate list of the URLs for later use.

url_list = []
with open('20170630.export.csv') as csvfile:
    readCSV = csv.reader(csvfile, delimiter='\t')
    for row in readCSV:
        url_list.append(row[57])
        document1 = client.CreateDocument(collection['_self'],
            { 
             'id': row[0],
             'time_stamp': row[1],
             'time_stamp2': row[4],
             'city0_lat': row[46],
             'city0_lon': row[47],
             'city0_name': row[43],
              'city1_lat': row[53],
              'city1_lon': row[54],
              'city1_name': row[50],
              'url': row[57] 
              }
            )

The next thing we will do is to look at the URLs that involve a large number of cities (because they are often more interesting). Because we have the geolocations of each city we can draw a map and link the cities associated with the URL.

To do this we need to remove all the duplicates in our URL list using a standard Python trick: convert the list to a set and then back to a list.

s = set(url_list)
url_list = list(s)

We next query the database collection to find the cities associated with each URL. The best way to do this in the cloud is a simple map-reduce operation where we map each record to a pair consisting of a URL and a list of cities. We then invoke a reduce-by-key function to reduce all pairs with the same key to a single pair with that key with all of the lists associated with that key concatenated.
If we were doing more than a single day’s worth of data, It would be worth bringing Spark into the picture and using that for the map-reduce step. Instead we will do it sequentially using a data base query for each URL. The following function accomplishes that task.

def get_cities_for_url(url):
    db_query = {'query':"select * from r where r.url = '{0}'".format(url) }
    options = {}
    options['enableCrossPartitionQuery'] = True
    options['maxItemCount'] = 1000
    q = client.QueryDocuments(collection['_self'], db_query, options)
    results = list(q)
    cities = []
    for x in results:
        if x['city1_name'] != '':
            cities.append(x['city1_name']+" lat="+x['city1_lat'] + ' lon='+x['city1_lon'])
        if x['city0_name'] != '':
            cities.append(x['city0_name']+" lat="+x['city0_lat'] + ' lon='+x['city0_lon'])
    city_set = set(cities)
    cities = list(city_set)
    return cities

The function returns a list of strings where each string is a city and its geolocation. We have used the matplotlib Basemap tools to render the most interesting sets of cities where we define “interesting” to be those URL sets with more than 5 cities. In this case we have only plotted a few hundred of the city clusters.

gdelt-geoimage

Figure 2.   Cities linked by shared stories.

While this is colorful, it is not very informative. We can tell there was some events in Australia that involved Europe and the US (this turned out to be a scandal involving an important person in the catholic church) and there was a number of stories that involve North Korea, the US and China. Something happened in the Northwest that involved cities in the east coast of the US.n general the information content is low.

To create a different view, let’s transform the collection to one that reflects the set of cities associated with each URL. To accomplish that we call the get_cities_for_url() function above and create a new document that only three attributes: an ID, the URL and a list of cities associated with the URL. The new collection is called “cities”, but it is really more like a dictionary that associates the URLs with a list of cities.  (The Jupyter Notebook used to build this collection and illustrate the example below is available here.)

One thing we can do is to search for special words or phrases in the URLs and list the cities that are associated. To do this we can make use of another feature of DocumentDB and create a user defined function to do the string search. This is extremely easy and can be done from the Azure Portal. The functions are all java script. The data explorer tab for the “cities” collection and select the “New User Defined Function” tab. In our case we create a function “findsubstring” that searches for a substring “sub” in a longer string “s”.

function  findsubstring(s, sub){
    t = s.indexOf(sub)
    return t
}

We can now call this function in a database query as follows.

def checkInDB(name):
    db_query = {'query':"select r.url, r.cities from r where udf.findsubstring(r.url,'{0}')>0".format(name) }
    options = {}
    options['enableCrossPartitionQuery'] = True
    options['maxItemCount'] = 10000
    q = client.QueryDocuments(collection['_self'], db_query, options)
    results = list(q)
    return results

Picking a name from the news, “Ivanka”, we can see what stories and cities mentioned “Ivanka” on that day. The call checkInDB(“Ivanka”) returns

[{'cities': ['Oregon, United States', 
'White House, District of Columbia, United States',
'Buhari, Kano, Nigeria',
'Canyonville Christian Academy, Oregon, United States ', 
'Chibok, Borno, Nigeria', 
'Virginia, United States'],
'url': 'http://www.ghanaweb.com/GhanaHomePage/world/Trump-and-Ivanka-host-two-Chibok-girls-at-the-White-House-553963'}]

which is a story about a reception for two girls that were rescued from the Boko Haram in Africa. The list of associated cities include Canyonville Christian Academy, a private boarding school that has offered to continue their education.

A More Interactive Abstract View of the Data.

Another interesting approach to looking at the data is to consider stories that link together smaller cities in some ways. We are going to look for “local” story connections rather than big news items. More specifically, we will look at cities that appear in at least 3 but no more than 5 stories in the days news. This leaves out big cities like washington DC and big global headlines. Next we will build a graph that we can use to explore the stories and cities that share these more regional stories. There are actually 245 cities that appear in our restricted set.

In order to make our graph we need a dictionary that maps cities into the list of stories that mention it. Call that dictionary “stories” Our visualization has nodes that are cities and we will connect them by an edge if they appear in a story together. Let “linkcities” be the list of 245 cities describe above. We can then compute the edges using as follows.

edges = []
for i in range(0, len(linkcities)):
    for j in range(0, len(linkcities)):
        if i < j:             
		iset = set(stories[linkcities[i]])             
		jset = set(stories[linkcities[j]])             
		kset = iset.intersection(jset)             
		if len(kset)> 0:
                	edges.append((i,j))

The complete code for this example is given in this Jupyter notebook. We use the Plotly package to build and interactive graph. The nodes are distributed approximately over the surface of a sphere in 3D. You can rotate the sphere and mouse over nodes Doing so, shows the URLs for the stories associated with that town. The sphere appears as the image below.

city-graph

Figure 3.  Interactive city graph

A better way to see this is to interact with it. You can do this by going to the bottom panel in this file [https://SciEngCloud.github.io/city-graph.html].

Conclusion

Our experience with CosmosDB has been limited to using the DocumentDB API, but we were impressed with the experience. In particular, the ease with which one can globally distribute the data was impressive as were the selection of consistency protocols. As we said, we were unable to force the database to show temporary inconsistency even with the weakest protocol choice using very distant copies of the database. This should not be considered a fault. Our tests were too easy.  The examples above do not begin to stretch all the capabilities of CosmosDB, but we hope they provide a reasonable introduction for the Python programmer.)

One final comment. It seems CosmosDB was built on top of the Azure microservice container system. It would be very interesting to see more details of the implementation.