Observations About Streaming Data Analytics for Science

I recently had the pleasure of attending two excellent workshops on the topic of streaming data analytics and science.  A goal of the workshops was to understand the state of the art of “big data” streaming applications in scientific research and, if possible, identify common themes and challenges.  Called Stream2015 and Stream2016, these meetings were organized by Geoffrey Fox, Lavanya Ramakrishnan and Shantenu Jha.   The talks at the workshop were from an excellent collection of scientists from universities and the national labs and professional software engineers who are building cloud-scale streaming data tools for the Internet industry.

First it is important to understand what we mean by streaming data analytics and why it has become so important.   Most scientific data analysis involves “data at rest”: data that was generated by a physical experiment or simulation and saved in files in some storage system.   That data is then analyzed, visualized and summarized by various researchers over a period of time.   The sizes of scientific data archives are growing and the number of disciplines creating new ones is expanding.    New organizations like the Research Data Alliance have been created to help coordinate the development and sharing of scientific data collections.   However not all data is “at rest” in this sense.   Sometimes data takes the form of an unbounded stream of information.   For example, the continuous stream of live instrument data from on-line sensors or other “internet of things” (IoT) devices.  Even computer system logs can produce large continuous streams.  Other examples include data from continuously running experiments or automated observatories such as radio telescopes or the output of DNA sequencers.

In some cases, the volume and rate of generation is so large, we cannot keep the data at rest for very long.  The observed data from the Square Kilometer Array (SKA) will be so large that that it is too expensive to contemplate keeping it and therefore it must be immediately processed into a reduced stream.  An important aspect of this large scale streaming scientific data analysis is computational steering: the need for a human or smart processes to analyze the data stream for quality or relevance and then to make rapid adjustments to the source instruments or simulations. The report from the first Streams workshop describes many of these cases.  For example, autonomous vehicles processing radar data streams for oil and gas exploration or modern avionics systems that have to recognize bad data in real-time.  Data coming from superconducting tokamak experiments must be managed and analyzed in real-time to adjust the control settings, and prevent catastrophic events.

This article has two parts.   In this part we will look at the issue of streaming data in science and then present some of the lessons I gathered from the workshops.  The workshop organizers have not released their final report for Stream2016, so their conclusions may be vastly different from my own.   In the second part we take a deep dive into the cloud centric data analytics tools to try to understand the landscape of ideas and approaches that have evolved in recent years in this community.

There are many factors that determine when a particular technology is appropriate for a particular problem.  Streaming data analytics is an interesting case that illustrates how diverse challenges and requirements have led software designers to build vastly different solutions.   For example, the software built to manage the vast Twitter data streams just can’t handle the analytic problems encountered when steering high-end electron microscopy experiments.  It is worth trying to understand why this is the case.

We can divide the spectrum of streaming data scenarios into three basic categories

  1. The data streaming challenges that confront large enterprises when dealing the data from millions of users of Internet enabled devices.   These might be the “click-streams” from browsers to search engines where it is critical to understand user sentiment or where to place advertisements based on previous user queries.  The stream may be the vast logs of the behavior of systems with tens of thousands of active machines that need to be constantly monitored, scaled and serviced.  In these cases, the individual events that make up the stream are often very small records of a few bytes in length.
  2. Large scale environmental or urban sensor networks such as wide-area earthquake sensor networks or the NSF Ocean Observatories Initiative or urban sensors networks such as those proposed in Chicago’s Array of Things project.  These are very heterogeneous collection of data streams that may involve instruments with very different stream characteristics.   For example, some small sensors may generate a high rate of small message while others may generate large bunches of large Mbyte-size messages in bursts such as you would see from an UAV surfacing and uploading many records.  They may require intermediate analysis at various stages but final analysis “downstream”.   Another good example is the stream of data from a swarm of robots that must be analyzed to avoid collision (see the paper by He, Kamburugamuve and Fox which describes this real-time challenge in detail.)
  3. The streams generated by very large experimental facilities like the Large Hadron Collider, Square Kilometer Array, the Advanced Photon Source and massive supercomputer simulations.  These large scale scientific experiments can be extremely complex and involve large numbers of instruments or HPC simulations, multiple data analysis steps and a distributed set of collaborators. Most of the data analysis in these experiments are not like the pure streaming data challenges we see in items 1 and 2.   The data streams are often extremely large file object that must move through complex laboratory networks.   The orchestration of the streaming activity more accurately resembles workflow than data flow and often that workflow must allow a human in the loop.

While it is tempting to think that one solution paradigm can cover all the bases, this may not be the case.  Cases 1 has led to an explosion of creativity in the open source community and several very significant Apache projects.  These include Spark Streaming which has been derived from the Spark parallel data analysis system,  Twitter’s  Storm system which has been redesigned by Twitter as Heron, Apache Flink from the German Stratosphere project, Googles Dataflow (also see this article) which is becoming Apache Beam which will run on top of Flink and Spark.  Other university projects include Neptune and the Granules project at Colorado State.   In addition to Google Cloud dataflow other cloud providers include Amazon Kinesis,  Azure Streaming and IBM Stream Analytics.   (Are you confused yet?   In the second part of this report we will describe many of these in much greater detail.)

It turns out that many of the tools described above for case 1 also apply to case 2 under certain conditions.  The challenges arise in two areas.  If the real-time demands of the application require very low latencies such as is required for various UAV challenges, some cloud solutions can be lacking.  However, Kamburugamuve, Ekanayake, Pathirage and Fox demonstrate that Storm’s basic communication mechanisms can be vastly improved using collective communication that exploit shared memory and optimized routing to meet the demands of the robot swarm example mentioned above.  The second challenge is if the size of the individual events in the stream is large (greater than a megabyte), such as you may find in many instruments that deal with image our sound object, it may not work at all with many of the systems designed with case 1 in mind.  Algorithmic methods can be used to reduce the size so approximate methods can be used to identify events for deeper off-line analysis.  In many of these instrument streaming cases it is necessary to do more processing of the stream “near the edge”.   In other words, many small data sources can be “pre-analyzed” by processors very near the source.   For example, the Apache Quark edge-analytics tools are designed to run in very small systems such as the Raspberry Pi.

Case 3 presents the greatest departure from the emerging open source tools.  The ATLAS experiment on the Large Hadron Collider (LHC) has a large Monte Carlo simulation component and they have converted the processing of the data into a relatively fine-grained event stream, called the Atlas Event Service.  A distributed workload manager, PanDA manages a global queue of analysis tasks that can be executed on a variety of platforms including Amazon and, in a specialized form called Yoda, on HPC systems.

At the other end of the application spectrum, massively parallel simulation model running on an exascale computer can generate vast amounts of data.   Every few simulated time steps the program may generate a very large (50GB or more) data structure distributed over a thousand parallel processing elements.  You can save a few of these to a file system, but it is now preferable to create a stream of these big objects and let another analysis system consume them directly.  The state-of-the-art HPC I/O library, called ADIOS, provided a very simple, standard- looking API to the application programmer, but the back-end of ADIOS can be adapted to a variety of storage or networking layers while taking full advantage of the parallel I/O capabilities of the host system.   One such back-end is facilitated by a networking layer, EVPath  that  provides the flow and control needed to handle such a massive stream.   Another backend target for ADIOS is DataSpaces, a system for creating shared data structures between application across distributed systems.  DataSpaces accomplishes this by mapping n-dimensional array objects to one dimension by using a distributed hash table and Hilbert space filling curves.   Together these provide a variety of streaming abstractions to allow data to move from one HPC application to a variety of HPC data analysis and visualization tools as illustrated in Figure 1.

adios

Figure 1.  From “Stream Processing for Remote Collaborative Data Analysis” by Klasky, Chang, Choi, Churchill, Kurc, Parashar, Sim, Wolf and Wu.  ORNL, PPPL, Rutgers, GT, SBU, UTK, LBNL.  White paper Stream2016 workshop.  

At the Streams 2016 workshop Kerstin Kleese Van Dam makes the important observation that that the workflow systems managing the stream analytics of time-critical experiments can be complex and the success of the experiment depends upon reliable performance of the overall system. The use case she described is “In Operando catalysis experiments”.   More specifically, this involves the steering of high end electron microscopy experiments where a beam of electrons is transmitted through an ultra-thin specimen, interacting with the specimen as it passes through. These experiments can generate atomic resolution diffraction patterns, images and spectra under wide ranging environmental conditions. In-situ observations with these instruments, were physical, chemical or biological processes and phenomena are observed as they evolve.  These experiments generate from 10GB-10’s of TB (e.g. at BNL) of data per at rates ranging from 100 images/sec for basic instruments to 1600 images/sec for state of the art systems. To optimize the scientific outcome of such experiments it is essential to analyze and interpret the results as they are emerging.  It is essential that the workflow system reliably deliver optimal performance, especially in situations where time-critical decisions must be made or computing resources are limited.

The current systems in use include the Analysis in Motion framework developed by PNNL, but the challenge that is presented here is to enact the workflow in a way that yields reliable performance. The workflows are frequently composite applications built from loosely coupled parts, running on a loosely connected set of distributed and heterogeneous computational resources. Each workflow task may be designed for a different programming model and implemented in a different language, and most communicate via files sent over general purpose networks.  This research group currently has a DOE project to demonstrate “Integrated End-to-End Performance Prediction and Diagnosis for Extreme Scientific Workflows (IPPD)”.

Concluding Observations. 

The streaming data landscape is very new and evolving fast.  I have come to the conclusion that of the three application domains described above (1: Internet Data Analysis, 2: Array of Things Instruments, 3: Big Science) only 1 and 2 are starting to see convergence of lines of thought and models of computing while 3 will always be driven by very different requirements and needs.   The bleeding edge of science does not have the deep pockets of Google or Amazon when it comes to IT resources.   Their budgets are dominated by the massive experimental facilities and supercomputers and hence the solutions must be custom.  And each experimental domain is unique enough that few common tools beyond MPI exist. On the other hand, one can argue that Twitter, Google and all of the various Apache projects discussed here are also custom built for the problems they each are trying to solve.   This is a world of “bespoke” software systems.

Algorithms and Analysis

An area where there will be great opportunity for sharing is in the algorithmic techniques that will be used to analyze the data.  The Streams 2015 report observed that a variety of compelling research topics have emerged including adaptive sampling, online clustering, sketching and approximation that trade space and time complexity for accuracy.   Sketching reduces an element of the stream to a basic form that allows easy generation of approximate answers to important queries.  There are many forms of sketching. Szalay described an elegant way to do principal component analysis (PCA) is a streaming context.  This provides a way to reduce the spectral complexity of a stream of big events.  Machine Learning classifiers can and are used as part of stream analytics across application domains as diverse as tweet analysis and medical imaging.      With the growing capabilities of deep learning systems, more data, images and sounds can be analyzed and recognized in near real-time.   Skype can do near-real time natural language translation and face recognition from video streams.   Applying the same technology to sifting through streams of instrument data will lead to new tools to understand earthquakes, hurricanes and tornadoes. We anticipate a lot of great work emerging from this area.

Azure Container Services Are Now Live: An Initial Look

The Microsoft Azure container services are now live and, for the most part, they work very well.  There are actually two container services the Azure team is supporting.   One is Mesosphere DC/OS and the other is Docker swarm.   I have been using various versions of Mesos and Mesosphere for a year now, but those deployments were somewhat ad hoc. Some previous postings are here and here and this article provides some updates to both.   These services are now in “general availability”, which is Microsoft speak for “it is now a product”.  There is a good start-up tutorial available here which will lead you through the setup phase.  In this post we will focus on some basic features of DC/OS and show a very simple example of how well it scales.   In a future post we will look at Swarm.

DC/OS

Following the introduction tutorial lined above it was relatively easy to create a DC/OS cluster with 8 worker nodes (and one public node) and one master.    Using the instructions, we also created a secure tunnel to the master node mapping port 80 there to localhost port 80.   The web link http://localhost on my windows10 box brought up the DC/OS web user interface.   What you see is the summary of all of the resources used as shown in Figure 1 below.

dcos1Figure 1.  DCOS web interface.

DCOS is the distributed cluster operating system and its job is to support deployed services.   The most valuable of these services is Marathon which is a container orchestration service that will allow you to easily scale the number instances of your containers and keep them running.   It can also be used to enforce special constraints.   For example, if you deploy a docker container that needs to bind to a special port on they host, it will not schedule another conflicting instance on that same host.   And it has a very nice graphical user interface shown in figure 2 that can be accessed through the DCOS interface.

dcos2

Figure 2.   Marathon interface showing all running services.

As you can see above I have an instance of Apache Spark, two instances of the streaming service Storm and one instance of the Zeppelin notebook and one instance of the simple web server Nginx all running.  Launching a new Docker container or service is very simple: fill in a web-form.    However, there is a command line tool that works very well on linux and windows.  For example, to get the information in Figure 2 above the command line call is as follows.

dcos7

The same command line interface can be used to launch new container instances and we will illustrate that below.

DCOS also has views of the of the individual resources.   Figure 3 displays the current view of all of the individual nodes in the cluster showing how many of the nodes are holding active containers or services.

dcos3

Figure 3.  DCOS display of worker node status and load

A Simple Example.

There are many prepackaged apps available for a one-click launch such as those listed above.  I originally wanted to Kafka in a demo, but there is still a bug with my deployment that does not allow me to access the Kafka gateway or the public node (10.0.0.5 in the Figure 3 list).   I will revise this report with an update as soon as I can solve that problem.

The example is a simple message filtering experiment.   Assume you have some source of independent tasks that must be analyzed as fast as possible and the results stored in a table or database.  Assume further that you task stream can be pre-filtered into and sorted into buckets of similar tasks that can be analyzed by code that is best suited for the tasks in that bucket.  For example, some tasks contain images of landscapes and others contain images of animals and you want to provide analyzers that are appropriate for each.   Or you are looking at logging data and your pre-filters detect several different types of anomalies and you want to group anomalies of similar type together.   We will use queues to hold the contents of each bucket.  The pre-filters push the data into the queues and workers pull the tasks of the queue, do the analysis and push the results to into a table. The general picture is shown in Figure 4 below.

dcos4

Figure 4.   Sample “microservice” configuration for our experiment.

Depending upon the complexity of the analysis undertaken by the worker and the arrival rate of tasks into the queues we may need to increase the number of workers assigned to each bucket queue as shown in Figure 5.

dcos5

Figure 5.   Adding additional workers to manage extra work at each queue.

In this simple experiment we will look at how increasing the number of workers can improve the throughput of the system.   Now for the details of the set-up.   Instead of using Kafka, we will use another common message broker RabbitMQ that is running on another linux box on Azure.   We use the Azure Table service to store the results.   Our worker service is a Docker container that is running a simple Python program that has two parts.

  1. When the worker starts-up it does not know what queue to list to.    So it looks in a separate queue called “roles” that will contain the name of the queue needed an extra worker.
  2. When it has the name of the queue to work on and begin pulling data items from the given queue and processing them and saving them to the table.  A time stamp is added to each item as it goes into the table.

In a real application step 2 can include task specialization once the worker knows what queue it is working on.  For example, in our text classifier example we loaded specific machine learning tables and states when we knew what topics we were analyzing.

In this example, we are only interested in the basic scale-up performance improvement as we increase the number of workers assigned to each queue.    The Python code for the container is not pretty, but you  are welcome to read and use it.   It is on GitHub here.   To deploy the Docker container on DCOS one needs a deployment configuration json file.  This config.json is shown below.

{
   "container": {
      "type": "DOCKER",
      "docker": {
          "image": "escigrp/rabbitpullpush"
       }
    },
   "id": "worker",
   "instances": 1,
   "cpus": 0.2,
   "mem": 512,
}

Notice that this specifies that the container is in the Docker hub with the name escigrp/rabbitpullpush and that we wish to devote 0.2 cpus and 512 MB of memory to this resource.  And we want one instance.

The dcos command to launch this container in the cluster is

dcos marathon app add config.json

Our “worker” will immediately show up as a deployment on the DCOS Marathon web page.

We are going to measure the throughput of the system in the number of events per second it can process as we increase the number of workers per node.   The way the experiment is done is as follows

For N = 1 to 14:

  1. Preload each of the 4 queues (named “1”, “2”, “3”, “4”) with 500 messages and start up 4*N instances of the worker container with Marathon on DCOS.
  2. Load the “roles” queue with N instances of each queue name.   Each of the four queues will now have N devoted workers.
  3. When all of the queues are empty, look in the table.   Subtract the earliest time stamp from the latest to get an approximation of the elapsed time.
  4. Use marathon to shutdown the workers and go to step 1.

Recall that there are 8  dual core nodes in the cluster.   Each instance of the worker container is allocated 0.2 of a core.   This means marathon could possibly schedule 80 instances.   However, there are other processes running on cluster so a practical limit was 60.   In fact we tested up to 56 container instances (14*4).   The results are shown in Figure 6 below.

dcos6

Figure 6.   Events processed per second as the number of workers per queue grows from 1 to 14.

There are several surprises for me here.   First the performance scales very linearly as the number of container instances grows.   Because there are only 16 cores available I expected this to level off when N was near 8  (32 instances), but, with the exception of an anomaly around 13, it kept climbing to N = 14 (56 instances).    Second, the absolute performance is not very good.   Digging deeper into the code and conducting several additional experiments revealed that the bottleneck is the table insertion due to an old and slow version of the python library.  Without the table insertion a single worker container instance call pull events at a rate of about 20 events per second, so 56 instances will be over 1000 events/sec which is well within the range of RabbitMQ.

Dynamic Scaling and Conclusion

A more interesting experiment would be to have the system described above dynamically scale the number of container instances as circumstances require.   For example, if one could monitor the depth of each queue, then if a queue starts to grow larger one could issue a command to increase the number of instances devoted to that queue.  If the queue is empty one could reduce the number of instances.    I am fairly certain there are a number of ways to do this, but one easy way is to use the “marathon update” command.   This command allows a “real-time” update to json configuration.    Any field in the configuration can be modified.   For example, to  update the configuration to 10 instances one can issue the command below.

dcos marathon app update worker env='{"instances":"10"}'

This change in status should trigger marathon to make the necessary adjustments and change the number of instances to 10.   It would be relatively straight forward to write a program that would poll the event broker for status and check the current queue lengths and, depending on the conditions issue the dcos command above.

Final Thoughts.

It is great to see this container service based on Mesosphere’s DC/OS finally available in a reliable and highly usable form.   This an excellent platform for managing large collections of Docker containers and orchestrating microservices deployments.   The performance of the system was excellent and the web user interface is well done.   The command line interface is solid and only gave me one problem.   Installing the command line interface for Kafka caused problems on windows and it did not follow the script here.  It seemed to be loading an old version that did not support windows.   The other problem was that the DC/OS cluster I deployed on Azure had one public node, but the “public” IP address give for this node was not reachable.   (Any reader who knows how to address these problems please comment here and I will update this post.  As is often the case, there are easy solutions to problems that stump me.)

In a future post we will look at the Docker Swarm deployment that is also part of this new Azure release.

Fun with Recurrent Neural Nets: One More Dive into CNTK and TensorFlow

In a previous article I set about comparing Microsoft’s Computational Network Took Kit for deep neural nets to Google’s TensorFlow.  I concluded that piece with a deep dive into how recurrent neural nets (RNNs) were represented in each system.   I specifically went after the type of RNNs known by the strange name of Long Short-Term Memory (LSTM) networks.   I wanted to learn a bit more about how these systems worked.  I decided to treat them like laboratory specimens so that I could poke and prod them to see what I could learn and what I could get them to do.  This article is essentially my lab notebook.  Warning:  With the exception of a bit toward the end, this is not technically very deep.   In fact, I did not discover anything that has not been extensively reported on elsewhere.   But I learned a lot and had some fun.   Perhaps it will be of interest to students just starting to learn about this subject.   Before I get to far into this, I would like to mention that I recently discovered an excellent series of tutorials on RNNs by Denny Britz that are definitely worth reading.

CNTK’s LSTM and Hallucinating Bloomberg Financial News

One of the many good examples in CNTK is language modeling exercise in Examples/Text/PennTreebank.   The documentation for this one is a bit sparse and the example is really just of a demo for how easy it is to use their “Simple Network Builder” to define a LSTM network and train it with stochastic gradient decent on data from the Penn Treebank Project.   One command starts the learning:

cntk configFile=../Config/rnn.cntk

Doing so trains the network, tests it and saves the model.  However, to see the model data in an easily readable form you need a trivial addition to the configfile: you need to add the following dumpnode command to put a dump file a directory of your choosing.

dumpnode=[
    action = "dumpnode"
    modelPath = "$ModelDir$/rnn.dnn"
    outputFile = "$OutputDir$/modeltext/dump"
]

This creates a big text file with all the trained data.   To experiment with the trained model, I decided to load it into a python notebook and rebuild the LSTM network from the defining equations.  From the CNTK book those equations are

lstm_eqn

I was pleased to see that the dumped model text had the same W and b tensors names as in the equations, so my job was relatively easy.    I extracted each of the tensors and saved them into a file (I will make these available in Github).   The python code for the LSTM based on the equations above is below.

def rnn(word, old_h, old_c):
      Xvec = getvec(word, E)

      i = Sigmoid(np.matmul(WXI, Xvec) + 
                  np.matmul(WHI, old_h) + WCI*old_c + bI)
      f = Sigmoid(np.matmul(WXF, Xvec) + 
                  np.matmul(WHF, old_h) + WCF*old_c + bF)
      
      c = f*old_c + i*(np.tanh(np.matmul(WXC, Xvec) + 
                               np.matmul(WHC, old_h) + bC))
      
      o = Sigmoid(np.matmul(WXO, Xvec)+ 
                  np.matmul(WHO, old_h)+ (WCO * c)+ bO)
      
      h = o * np.tanh(c)
      
      #extract ordered list of five best possible next words
      q = h.copy()
      q.shape = (1, 200)
      output = np.matmul(q, W2)
      outlist = getwordsfromoutput(output)
      return h, c, outlist

As you can see, this is almost a literal translation of the equations.    The only different is that this has as input a text string for the input word.  However the input to the equations is a vector encoding of the word.  The model generates the encoding matrix E which has the nice property that the ith column of matrix corresponds to the word in the ith position in the vocabulary list.  The function getvec(word, E) takes the embedding tensor E, and looks up the position of the word  in the vocabulary list and returns the column vector of E that corresponds to that word.   The output of one pass through the LSTM cell is the vector h.  This is a compact representation of the words likely to follow the input text to this point.  To convert this back into “vocabulary” space we multiply it by another trained vector W2.  The size of our vocabulary is 10000 and the vector output is that length.  The ith element of output represents the relative likelihood that that ith word is next word to follow the input so far.  Getwordsfromoutput simply returns the top 5 candidate words in order of likelihood.

Before going further, it is worth looking closer at the properties of the word embedding matrices E and W2.   There is a fascinating paper by  Mikolov, Yih and Zweig entitled “Linguistic Regularities in Continuous Space Word Representations” where they suggest that the embedding space for word has several interesting properties.   I decided to investigate that.   Their point is that words that are similar in a linguistic sense will be nearby in the embedding space.   For example, present tense verbs should be near other present tense verbs and singular nouns should be near each other, etc.   I decided to try that.  However, there are two embedding mappings.  One is based on the tensor E and the other based on the W2 tensor.   E has dimension 150 by 10000 and W2 is 200 by 10000.  The difference in dimensionality are because of arbitrary decisions made in defining the hidden layers in the network.  But both represent word imbeddings.  I experimented with both.  I wrote a function getnear(word, M) which takes a word and looks for the 5 most nearby words in the space where M is transpose of either E or W2. (I used cosine distance as the metric.) Verb tense locality and noun plurals worked best in the W2 space as illustrated below.

rnn-embedding

These are only illustrations.  For a deeper statistical analysis look at the Mikolov paper.   A more interesting conjecture from their study was that there may be some linearity in these embedding that might allow one to try simple analogies of the form “A is to B, as C is to __”.   Their idea is that if a, b and c are the vector embeddings of the words A, B and C, then the embedding of “__” may be computed as d = c + (b-a).  So I wrote a little function AistoBasCisto(A, B, C) that does this computation.   In the results I had to delete A, B and C from the candidate answers because they came up often as nearby.   In this case my results were less encouraging.  It worked better with the E space than with W2.   For example, for E we have

rnn-analogy1

And for the W2 space the results looked like

rnn-analogy2

As you can see the “run running walk __” example failed with E but was close, but still incorrect, with W2.

You may wonder why these particular words came up.  The data we used to train the system came from a small subset of the Penn TreeBank collection as provided in the CNTK package.   It is heavily dominated by financial news items.    This explains why the plural of person could be managers or customers.   A larger vocabulary and data collection would be needed to truly test the analogy by linearity conjecture.

Now on to hallucinating the financial news.

Now to test the LSTM as a truly recurrent network.   We provide the network with a starting word and let it suggest the next word.  And then we repeat this process constructing a “sentence”.  In the code below we randomly pick one of the top three suggest by the network as the next word.

c = np.zeros(shape = (200, 1))
h = np.zeros(shape = (200, 1))
output = np.zeros(shape = (10000, 1))
word = 'my'
sentence= word 
for _ in range(40):
    h, c, outlist = rnn(word, h, c)
    word = outlist[randint(0,3)]
    sentence = sentence + " " +word
print sentence+"."

In this case we start with the word “my” and let it generate a 40 word sentence.  The output is

my new rules which would create an interest position here unless there should prove   signs of such things too quickly although the market could be done better toward paying further volatility where it would pay cash around again if everybody can.

This is a great example of hallucinating financial news. Let’s try it again starting with the word “president”.

president michael de brown wrote himself against democratic union law which represents an emergency relief agreement during a new york state district or early tuesday before july after a federal government agency created early losses without mr. krenz or perhaps.

Now with the word “the”.

the company reported third-quarter results reflecting a number compared between N barrels including pretax operating loss from a month following fiscal month ending july earlier compared slightly higher while six-month cds increased sharply tuesday after an after-tax loss reflecting a strong.

The “sentences” end rather abruptly because of the 40 word limit I set.  If you let it go it will run on until the state vector for the sentence seems to break down.     Try this yourself.  To make it easy to play with this example, I have put the code in GitHub.  The trained model text files are in OneDrive and is a zipped file of about 50MB.

There are many more excellent and fun examples.  Andrej Karpathy has a great blog article showing how RNNs can mimic Shakespeare, or Latex science articles and many more.

TensorFlow’s seq2seq French Lesson.

One of the most interesting examples in the TensorFlow tutorials is an English to French translator.  As with the CNTK example it was trivial to start the translator learning following the instructions in the tutorial.   After letting this run for about a week, I wanted to see how well it would do.     As with the CNTK example, I created a Jupyter IPython notebook and loaded the trained model.   I will explain how that was done in more detail below but, for now, I will show how we can invoke it to test its translation ability.    This particular trained model was not very big and with a relatively small data set, so I didn’t expect much.    In fact, as you will see, to a French speaker it is a disaster.   On the other hand, it learned more French in a week of training that I did in three semesters of French in college.   (For full disclosure, this was my weakest subject in college and my grade was a hard-fought “C” each semester.)

The code below demonstrates how the model is invoked.   First you have to tokenize the input sentence.  The algorithm uses a system of buckets of fixed sizes to make the training more efficient.  You next find the smallest bucket that can contain your sentence and convert this to the input vector list needed by the model.   The step function takes a Tensorflow session, the input vector list and a null list of decoder inputs (to be explained later) and generates a list of vectors as outputs.  Each vector represents the likelihood that individual vocabulary words are the correct word at that point in the translated sentence.   We pick the most likely and print the sentence.

sentence = " I am not the president of France. "

token_ids = data_utils.sentence_to_token_ids(sentence, en_vocab)
      # Which bucket does it belong to?
bucket_id = min([b for b in xrange(len(_buckets))
                 if _buckets[b][0] > len(token_ids)])
      # Get a 1-element batch to feed the sentence to the model.
encoder_inputs, decoder_inputs, target_weights = 
    model.get_batch({bucket_id: [(token_ids, [])]}, bucket_id)

_, _, output_logits = model.step(sess, encoder_inputs, decoder_inputs, 
                                 target_weights, bucket_id, True)
		  
outputs = [int(np.argmax(logit, axis=1)) for logit in output_logits]
print(" ".join([rev_fr_vocab[output] for output in outputs]))

Je ne suis pas le président de la France .

This example is not too bad.  However, if I ask

“In which city does the president of France live?”

I get

“Dans quelle ville le président de la France ?”.

This is not exactly correct.    If I feed this into Google translate and ask what this means in English I get “In which city the President of France?”.   If I give it this one,

What is the name of a good restaurant?

The system responds with

Quel est le nom d’une bonne bonne bonne ?”

Which translates back to “What is the name of a good good good?”.  Probably not very helpful on the streets of Paris.   It turns out restaurant is not in the tiny training vocabulary used here.   Finally, given this sentence

” The article stated that the President of the United States is here today. “

The translator returned

Le paragraphe a indiqué que le président des États-Unis est aujourd ‘ hui aujourd ‘ hui .”

The end of this reply is “is today today”.    As I said, this is still much better than I could do with my college French.   However, as you can see from the previous two examples, our little translator runs out of gas at the end of sentences and tends to repeat itself.   You should try this yourself.   I have put the notebook file in github or you can execute these directly from the Tensorflow python code.   All you need to do is train the model from TensorFlow and run the notebook with the path to the model output directory.

While loading and using the trained model was easy and fun, understanding the seq2seq model used in this example takes a bit of work.   So this part of this article will get a bit more technical.

The TensorFlow translate program is based on a sequence-to-sequence model constructed from more primitive recurrent neural nets.   By sequence-to-sequence we mean a network that takes a sequence as input and produces a sequence as output.   It consists of two parts: an “encoder RNN” and a “decoder RNN” as shown in Figure 1 below.

seq2seq

Figure 1.   A sequence-to-sequence RNN English to French translator with the encoder and decoder unrolled to show the flow of messages.

In this figure the RNNs are “unrolled” to show the flow of messages.  The state vector at the end of the encoder is a vector embedding of the input sentence.   This state vector is used to start the decoder along with a “GO” token.  The diagram shows the network after it has been trained.   During training the inputs to the decoder are the French version of the English sentences.   I won’t talk about the training here because is enough to try to understand how this works.  Before I go any further I want to point you to some important papers.  Sutskever, Vinyals and Le published an early important paper on sequence to sequence models that is worth reading.

To understand how it is built the network we need to dig into the code a bit. The building blocks are a set of classes of base type RNNCell with specializations

  1. BasicRNNCell
  2. GRUCell
  3. BasicLSTMCell
  4. LSTMCell
  5. OutputProjectionWrapper
  6. InputProjectionWrapper
  7. EmbeddingWrapper
  8. MultiRNNCell

The ones we will see used here are GRUCell, MultiRNNCell and EmbeddingWrapper.   We discussed LSTMCell in our previous article but we need to look at GRUCell here because that is the one used in the example.   The GRUCell is a “Gated Recurrent Unit” invented by Cho et. al.  in “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation”.  The “gated” phrase comes from the way the output is defined as coming mostly from the previous state or from a combination with the new input.   The diagram below tries to explain this a bit better.

gru-pic

Figure 2.   GRU wiring diagram

It also helps to see it in terms of the defining equations.

gru-eq1

The quantity ut  is a gate vector.  Recall the sigmoid function switches sharply between one and zero.  So when ut is one then h is just a copy of the old h and we are ignoring the input x it is based on the value ct.   The gate rt is determines how much of the old state goes into defining the value of ct.   To understand how this is encoded in TensorFlow you need to understand the function.

linear(args, output_size, bias, bias_start=0.0, scope=None)

where args is a list of tensors each of size batch x n .   Linear computes sum_i(args[i] * W[i]) + bias where W is a list of matrix variables of size n x outputsize and bias is a variable of size outputsize.   In the equations above we have represented linear algebra as a matrix times a column vector.   Tensorflow uses the transpose notation:   row vector on the left times the transpose of the matrix.   So in linear the args are a list of row vectors.   Where is the matrix W and offset b?  This is fetched from memory based on the variable current scope, because W and b are variable tensors that are learned values.   If you look at the first two equations above, you will see they are almost identical.   In fact, we can write them as

gru-eq2

If you transpose the last one from column form into row form you can now compute both with one invocation of the linear function.   The code for the GRUCell is below.   As you can see they have encoded one pass through the GRU cell with only two matrix vector multiplies.   You can also see that the way the variable scope is used to pick out the W’s for the gates and the W for the state/output.  Another point to remember that an invocation of the “__call__ function operator does not cause the tensor to execute the operation, rather it builds the graph.

class GRUCell(RNNCell):
  def __init__(self, num_units):
    self._num_units = num_units
   ... stuff deleted ....
  def __call__(self, inputs, state, scope=None):
    with vs.variable_scope(scope or type(self).__name__):  
      with vs.variable_scope("Gates"):  # Reset gate and update gate.
        # We start with bias of 1.0 to not reset and not udpate.
        r, u = array_ops.split(1, 2, linear([inputs, state],  
                               2 * self._num_units, True, 1.0))
        r, u = sigmoid(r), sigmoid(u)
      with vs.variable_scope("Candidate"):
        c = tanh(linear([inputs, r * state], self._num_units, True))
      new_h = u * state + (1 - u) * c
    return new_h, new_h

The top level class we invoke for building our model is seq2seqModel.    When we create an instance of this class it sets in motion a set of flowgraph building steps.  I am going to skip over a lot of stuff and try to give you the big picture.  The first graph building step in the initialization of an instance of this object is

# Create the internal multi-layer cell for our RNN.
    single_cell = rnn_cell.GRUCell(size)
     …
    if num_layers > 1:
      cell = rnn_cell.MultiRNNCell([single_cell] * num_layers)

As you can see we are creating a GRU cell graph generator instance and making a list of num_layers of this object and passing that to the constructor for MultiRNNCell.   In our case, num_layers has been set to 2.   MultiRNNCell is pretty easy to understand.   It builds a graph consisting of a stack of (in this case) GRU cells where the output state vector of each level is fed to the input of the level above it.  This new compound cell has an output that is the state of the top sub-cell and whose output state is the concatenation of the output states of all the sub-cells.

The next part is not so easy to follow.    We will take our MultiRNNCell graph builder and use it to create and encoder and a special decoder.    But first we must make a short digression.

Paying Attention

There is a problem that is encountered in the sequence-to-sequence model.   The encoder encodes the entire sentence into a state vector which is used by the decoder as its input.    That state vector is an abstract representation of our entire sentence as a single point in a very high dimensional space.    The decoder has been trained to use that point as a starting point to unroll a translated version of the sentence.   I find the fact that it works at all to be rather remarkable.   It is as if the decoder takes the English state vector and transforms it into a similar point in “French” space.

Unfortunately, the longer the input sentence, the more difficult it is to decode it.   How much information can we pack into one point?   The problem is that at each decoding step we need a little bit more information than is provided by the state vector as it passes through the decoder loop.   The idea used here is to help the decoder by providing it a bit of focus derived from the input sequence at each stage of the decoder loop.  This is generally referred to as “attention”, as in “at this step of decoding please pay attention to what the encoder was doing here”.   Bahdanau, Cho and Bengio had an early paper about this that used a bidirectional pass over the input sequence. As they put it, they wanted to “automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word”.  (Denny Britz has a lovely blog article about attention and describes several fascinating applications.  It is  well worth  reading.) The mechanism for attention used in the TensorFlow example is based on a paper by Vinyals et. al. and we will follow that one here.   The key idea is rather than take the single final state vector from the encoder, let’s collect the state vectors at each stage of the encoder.   Following Vinyals, let the encoder state vectors for each input word be

atten1

And let the decoder state vectors be

atten2

Then for each decoder time step t compute

atten3

Where the Ws are learned matrices and v is a learned vector.    Then as the input to the t+1 state vector of the decoder we use the concatenation

atten4

The idea is this new state vector at time t+1 puts much more focus on the corresponding words in in the encoder string. This all happens in a function called seq2seq.attention_decoder that is called in another constructor function seq2seq.embedding_attention_seq2seq that wraps and an embedding around a graph generated by our MultiCellRNN graph builder to generate the final decoder graph.   These graphs are all stitched together in the Seq2SeqModel constructor.  It is fair to say that there are many levels of abstraction here that are used to build the decoder and link it to the encoder.  I am leaving out many details that are critical for the training such as the part that implements the bucket handler.   The final graph, in its most abstract form is pictured below in figure 3.

seq2seq_final

Figure 3.  The Translate.py sequence to sequence translator is based on a two level GRU cell encoder and an attention-augmented two level GRU cell decoder.  The input English is entered in reverse order as an optimization

Final Thoughts

As I have said above, I have not included all the details of how the seq2seq translator is put together, but I tried to include the highlights that I found most interesting.   I encourage you to dive into the code and discover the rest.   You will likely find some errors in what I described above.   If so, please let me know.

There is really a lot of exciting results that have come out in the last few years relating to RNNs.   For example, Lei Ba, Mnih and Kavukcuoglu demonstrated that RNNs with attention can be applied to interesting image analysis challenges, such as reading the house number from a street scene.   In “Teaching Machines to Read and Comprehend” Hermann et. al. excellent paper demonstrate the use of an attentive RNN build to answer simple questions about text.   I personally don’t think any RNN can pass a Turing test yet, so it ain’t A.I.  But these little statistical machines are certainly wonderful mimics and they can speak better French than I.

TensorFlow Meets Microsoft’s CNTK

Update Nov 10, 2016.   Microsoft now has  a new release of CNTK.  We have a post now that provides a quick look at this new version.  Go read that one instead of this one.

Update Oct 25, 2016.    this post describes the early version of CNTK.  Microsoft just released a very nice new version called the cognitive toolkit.   I would not base your impression of CNTK  on the following post.   I’ll update this as soon as I have time.

———————–

CNTK is Microsoft’s Computational Network Toolkit for building deep neural networks and it is now available as open source on Github.   Because I recently wrote about TensorFlow I thought it would be interesting to study the similarities and differences between these two systems.   After all, CNTK seems to be the reigning champ of many of the image recognition challenges.   To be complete I should also look at Theano, Torch and Caffe.   These three are also extremely impressive frameworks.   While this study will focus on CNTK and TensorFlow I will try to return to the others in the future.   Kenneth Tran has a very nice top level (but admittedly subjective) analysis of all five deep learning tool kits here.  This will not be a tutorial about CNTK or Tensorflow.  Rather my goal is to give a high level feel for how they compare from the programmer’s perspective.  This is not a performance analysis, but rather a programming model analysis.  There is a lot of code here, so if you don’t like reading code, skip to the conclusions.

CNTK has a highly optimized runtime system for training and testing neural networks that are constructed as abstract computational graphs.   In that sense, CNTK is very much like TensorFlow.   However, there are some fundamental differences.   To illustrate these features and differences I will take two standard examples that are included with both systems and work through the approach taken by each system.   The first example is a not-too-deep convolutional neural net solution to the standard MNIST handwritten digit recognition example.  I will conclude with some comments about how they differ in their approach in the case of recurrent neural networks.

Both TensorFlow and CNTK are basically script-driven.   By this I mean that the construction of the neural network flow graph is described in a script and the training is done using some very clever automated processes.   In the case of TensorFlow the script is embedded in the Python language and Python operators can be used to control the flow of execution of the computational graph.   CNTK does not currently have a Python or C++ binding (though one is promised) so currently the control flow of the execution of the training and testing is highly choreographed.    As I will show, this is not as much of a limitation as it sounds.   There are actually two scripts associated with a CNTK network:  a configuration file that controls the training and test parameters and a network definition language file for constructing the network.

I’ll start with the description of the neural network flow graph because that is where the similarity to TensorFlow is the greatest.   There are two ways to define the network in CNTK.  One approach is to use the “Simple Network Builder” that will allow you to create some simple standard networks by specifying only a few parameter settings.    The other is to use their Network Definition Language (NDL).   The example here (taken directly from their download package in Github) uses NDL.    Below is a slightly abbreviated version of the Convolution.ndl file. (I have used commas to put multiple lines on one line to fit the page better.)

CNTK network graphs have a set of special nodes.  These are FeatureNodes and LabelNodes that describe the inputs and training labels,  CriterionNodes and EvalNodes that that are used for training and result evaluation, and OutputNodes that represent the outputs of the network.   I will describe these below as we encounter them.   At the top of the file we have a set of macros that are used to load the data (features) and labels.   As can be seen below we read images of the MNIST digits as features which are now arrays of floating point numbers that we have scaled by a small scalar constant.   The resulting array “featScaled” will be used as input to the network.

load = ndlMnistMacros

# the actual NDL that defines the network
run = DNN

ndlMnistMacros = [
    imageW = 28, imageH = 28
    labelDim = 10

    features = ImageInput(imageW, imageH, 1)
    featScale = Const(0.00390625)
    featScaled = Scale(featScale, features)
    labels = Input(labelDim)
]

DNN=[
    # conv1
    kW1 = 5, kH1 = 5
    cMap1 = 16
    hStride1 = 1, vStride1 = 1
    conv1_act = ConvReLULayer(featScaled,cMap1,25,kW1,kH1,hStride1,vStride1,10, 1)

    # pool1
    pool1W = 2, pool1H = 2
    pool1hStride = 2, pool1vStride = 2
    pool1 = MaxPooling(conv1_act, pool1W, pool1H, pool1hStride, pool1vStride)

    # conv2
    kW2 = 5, kH2 = 5
    cMap2 = 32
    hStride2 = 1, vStride2 = 1
    conv2_act = ConvReLULayer(pool1,cMap2,400,kW2, kH2, hStride2, vStride2,10, 1)

    # pool2
    pool2W = 2, pool2H = 2
    pool2hStride = 2,  pool2vStride = 2
    pool2 = MaxPooling(conv2_act, pool2W, pool2H, pool2hStride, pool2vStride)

    h1Dim = 128
    h1 = DNNSigmoidLayer(512, h1Dim, pool2, 1)
    ol = DNNLayer(h1Dim, labelDim, h1, 1)

    ce = CrossEntropyWithSoftmax(labels, ol)
    err = ErrorPrediction(labels, ol)

    # Special Nodes
    FeatureNodes = (features)
    LabelNodes = (labels)
    CriterionNodes = (ce)
    EvalNodes = (err)
    OutputNodes = (ol)
]

The network is defined in the block DNN.   The network consists of two convolutional-maxpooling layers followed by an all-to-all standard network with one hidden later of 128 nodes.

In convolutional layer one we have 5×5 convolutional kernels and we specify 16 of these (cMap1) for the parameter space.   The operator ConvReLULayer is actually a shorthand for another subnetwork defined in a macro file.

Algebraically we would like to represent the parameters of the convolution as a matrix W and a scale vector B so that if the input is X, the output of our network layer is of the form output = f(op(W, X) + B).   In this case the operator op is convolution and f is the standard relu function relu(x)=max(x,0).

The NDL code for the ConvReLULayer is given by

ConvReLULayer(inp, outMap, inWCount, kW, kH, hStride, vStride, wScale, bValue) = 
[
    convW = Parameter(outMap, inWCount, init="uniform", initValueScale=wScale)
    convB = Parameter(outMap, 1,        init="fixedValue", value=bValue)
    conv = Convolution(convW, inp, kW, kH, outMap, hStride,vStride,
                zeroPadding=false)
    convPlusB = Plus(conv, convB);
    act = RectifiedLinear(convPlusB);
]

The W matrix and B vector are defined as Parameters and they will be the entities that are given an initial value and then modified during training to define the final model.   In this case convW is a matrix with 16 rows of 25 columns B is a scale vector of length 16.   Convolution is a built-in function that has been set to not use zero padding.  This means that convolution over the 28×28 image will be centered on the 24 by 24 interior region and the result will be 16 variations of a 24×24 output sudo-image.

We next apply Maxpooling based on 2×2 regions and the result is now 12×12 by 16.

convo-nn-cntk

For the second convolutional layer we up the number of convolutional filters from 16 to 32.  This time we have 16 channels of input so the size of the W matrix is 32 rows  of 25×16 = 400 and the B vector for this layer is 32 long.   The convolution is now over the interior of the 12×12 frames so it is size 8×8 and we have 32 copies.    The second maxpooling step takes us to 32 frames of 4×4 or a result of size 32*16 = 512.

The final layers have the 512 maxpooling output and a hidden layer of 128 nodes to a final 10 node output defined by the two operators

DNNSigmoidLayer(inDim, outDim, x, parmScale) = [
    W = Parameter(outDim, inDim, init="uniform", initValueScale=parmScale)
    b = Parameter(outDim, 1,     init="uniform", initValueScale=parmScale)
    t = Times(W, x)
    z = Plus(t, b)
    y = Sigmoid(z)
]

DNNLayer(inDim, outDim, x, parmScale) = [
    W = Parameter(outDim, inDim, init="uniform", initValueScale=parmScale)
    b = Parameter(outDim, 1,     init="uniform", initValueScale=parmScale)
    t = Times(W, x)
    z = Plus(t, b)
]

As you can see these are defined by the standard linear algebra operators as W*x+b.

The final part of the graph definition is the cross entropy and error nodes followed by a binding of these to the special node names.

We will define the training process soon, but first it is fun to compare this to the construction of a very similar network in TensorFlow.   We described this in a previous post but here it is again.   Notice that we have the same set of variables as we did with CNTK except they are called variables here and parameters in CNTK.   The  dimensions are also slightly different.  The convolutional filters 5×5 in both cases but we have 16 copies in the first stage and 32 in the second in CNTK and 32 in stage one and 64 in stage two in the TensorFlow example.

def weight_variable(shape, names):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial, name=names)

def bias_variable(shape, names):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial, name=names)

x = tf.placeholder(tf.float32, [None, 784], name="x")

sess = tf.InteractiveSession()

W_conv1 = weight_variable([5, 5, 1, 32], "wconv")
b_conv1 = bias_variable([32], "bconv")
W_conv2 = weight_variable([5, 5, 32, 64], "wconv2")
b_conv2 = bias_variable([64], "bconv2")
W_fc1 = weight_variable([7 * 7 * 64, 1024], "wfc1")
b_fc1 = bias_variable([1024], "bfcl")
W_fc2 = weight_variable([1024, 10], "wfc2")
b_fc2 = bias_variable([10], "bfc2")

The network construction is also almost identical.

def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')

#first convolutional layer
x_image = tf.reshape(x, [-1,28,28,1])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
#second convolutional layer
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)
#final layer
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

The only differences are that the convolutional operators here are defined with padding so the output of the first convolutional operator had dimensions of 28 by 28 flowed by a pooling reduction to 14 by 14.  The second convolutional operator and max pooling reduces this to 7×7, so the input to the final layer is 7x7x64 = 3136 with 1024 hidden nodes (with a relu instead of a sigmoid function).   (For training purposes the last stage uses a probabilistic dropout function that randomly set values to zero.   If keep_prob = 1, this is a no-op. )

convolutional

Network Training

The way network training is specified in CNTK differs substantially from the TensorFlow approach.  The training and testing is specified in a file called convolution.config.  Both CNTK and TensorFlow use a symbolic analysis of the flow graph to compute the gradient of the network for use in gradient decent training algorithms.  The CNTK team has a very nice “book” that describes a great deal about how the gradients are computed.   Currently CNTK only supports one learning method: Mini-batch Stochastic Gradient Decent, but they promise to add more in the future.  He, Zhang, Ren and Sun have a lovely paper that describes how they train extremely deep (up to 1000 layers) networks using a nested residual reduction method reminiscent of algebraic multi-grid, so it will be interesting to see if that method makes its way into CNTK.  An abbreviated version of the config file is shown below.

command = train:test
modelPath = "$ModelDir$/02_Convolution"
ndlMacros = "$ConfigDir$/Macros.ndl"

train = [
    action = "train"
    NDLNetworkBuilder = [
        networkDescription = "$ConfigDir$/02_Convolution.ndl"
    ]

    SGD = [
        epochSize = 60000
        minibatchSize = 32
        learningRatesPerMB = 0.5
        momentumPerMB = 0*10:0.7
        maxEpochs = 15
    ]

    reader = [
        readerType = "UCIFastReader"
        file = "$DataDir$/Train-28x28.txt"

        features = [
            dim = 784
            start = 1
        ]

        labels = [
		    # details deleted
        ]
    ]
]
test = [
   ….
]

The command line indicates the sequence to follow:  train then test.   Various file paths are resolved and then the train block specifies the network to be trained and the parameters for the Stochastic Gradient Decent (SGD).  A reader block specifies the way the “features” and “labels” from the network NDL file are read.   A test block is also included to define the parameters of the test.

Running this on a 16-core (non-GPU) linux VM took 62.95 real-time minutes to do the train and test and 999.01 minutes of user time and 4 minutes of system time.    The user time indicated that all 16 cores were all very busy (999/63 = 15.85).   Of course this means little as CNTK is designed for parallelism and massive GPU support is the true design point idea.

The training used by TensorFlow is specified much more explicitly in the Python control flow.   However, the algorithm is also a gradient based method called Adam introduced by Kingma and Ba.    Tensorflow has a number of gradient based optimizers in the library, but I did not try any of the others.

As can be seen below, the cross_entropy is defined in the standard way and fed to the optimizer to produce a “train_step” object.

y_ = tf.placeholder(tf.float32, [None, 10])
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
sess.run(tf.initialize_all_variables())
for i in range(20000):
  batch = mnist.train.next_batch(50)
  train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

print("test accuracy %g"%accuracy.eval(feed_dict={
    x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

Then for 20000 iterations the python program grabs a batch of 50 and runs the train_step with 50% random dropout.  The test step is to evaluate the accuracy subgraph on the entire test set.

Aside from the magic of the automatic differentiation and the construction of the Adam optimizer trainer, this is all very straightforward.   I also ran this on the same 16 core server with the same data as was used for the CNTK case.   Much to my surprise the real time was almost exactly the same as CNTK.   The real time was 62.02 minutes, user time 160.45 min, so much less parallelism was exploited.    I don’t believe these numbers mean much.   Both CNTK and Tensor flow are designed for large scale GPU execution and they are not running exactly the same training algorithm.

Recurrent Neural Nets with CNTK and TensorFlow

Recurrent Neural Networks (RNNs) are widely used in language modeling such as predicting the next word you are going to type when texting or in automatics translation systems.   (see Andrej Karpathy’s blog for some great examples.) It is really a lovely idea. The input to the system is a word (or set of words) along with the state of the system based on words seen so far and the output is a predicted word list and a new state of the system as shown in Figure 1.

rnn_basic

Figure 1.

There are, of course, many variations of the basic RNN.   One of the most popular is the Long-Short Term Memory (LSTM) version that is defined by the equations

lstm_eqn

Figure 2. LSTM Equations (taken from the CNTK book)

where sigma   is the sigmoid function.

If you want to read a great blog article about LSTMs and how they work, I recommend this one by Christopher Olah.   In fact, he has a diagram that makes it a bit easier to see the flow of the equations above.  I had to modify it a tiny bit to fit the CNTK version of the equations and the result is shown in Figure 3.

lstm_fig

Figure 3.  Adapted from Christopher Olah’s excellent article.

The notation in the picture uses sigmoid and tanh boxes and concatenated variables to represent this expression.

sigmoid

As can be seen, this is the form of the equations in figure 2 where the Ws and the bs are the learned weights.

CNTK version

Below is the network definition language specification for the LSTM graph.   There are two things to notice here.    The first is the way the recurrence is handled directly in the network using a delay operator called “PastValue” that takes variable, its dimension and a time delay value and returns a buffered copy of that value.   The second thing to see is the way the W matrix is handled and how it differs from our concatenated operator describe above and in Figure 3.   Here they “stack” all the Ws that belong to x and all the Ws that belong to h and a stack of b values.   They then compute one W*x and one W*h and add them and then add b.   They then use a row slice operator to pull them apart to be used in the separate sigmoid functions.   Also note that they use the fact the Ws for c are all diagonal matrices.

LSTMPComponent(inputDim, outputDim, cellDim, inputx, cellDimX2, cellDimX3, cellDimX4) = [
        wx = Parameter(cellDimX4, inputDim,  init="uniform", initValueScale=1);
        b = Parameter(cellDimX4,  1,         init="fixedValue", value=0.0);
        Wh = Parameter(cellDimX4, outputDim, init="uniform", initValueScale=1);

        Wci = Parameter(cellDim, init="uniform", initValueScale=1);
        Wcf = Parameter(cellDim, init="uniform", initValueScale=1);
        Wco = Parameter(cellDim, init="uniform", initValueScale=1);

        dh = PastValue(outputDim, output, timeStep=1);
        dc = PastValue(cellDim, ct, timeStep=1);

        wxx = Times(wx, inputx);
        wxxpb = Plus(wxx, b);
        
        whh = Times(wh, dh);

        wxxpbpwhh = Plus(wxxpb,whh)
        
        G1 = RowSlice(0, cellDim, wxxpbpwhh)
        G2 = RowSlice(cellDim, cellDim, wxxpbpwhh)
        G3 = RowSlice(cellDimX2, cellDim, wxxpbpwhh);
        G4 = RowSlice(cellDimX3, cellDim, wxxpbpwhh);

        Wcidc = DiagTimes(Wci, dc);
        it = Sigmoid (Plus ( G1, Wcidc));

        bit = ElementTimes(it, Tanh( G2 ));

        Wcfdc = DiagTimes(Wcf, dc);
        ft = Sigmoid( Plus (G3, Wcfdc));

        bft = ElementTimes(ft, dc);

        ct = Plus(bft, bit);

        Wcoct = DiagTimes(Wco, ct);
        ot = Sigmoid( Plus( G4, Wcoct));

        mt = ElementTimes(ot, Tanh(ct));

        Wmr = Parameter(outputDim, cellDim, init="uniform", initValueScale=1);
        output = Times(Wmr, mt); 
    ]

The TensorFlow version

The TensorFlow version of the LSTM recurrent neural network is very different from the CNTK version.   While they both execute the same underling set of equations the way it is represented in TensorFlow make strong use of the Python control flow.    The conceptual model is simple.  We create a LSTM cell and define a “state” which is input to the cell and also an output.   In pseudo code:

cell = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])

for current_batch_of_words in words_in_dataset:
    # The value of state is updated after processing each batch of words.
    output, state = cell(current_batch_of_words, state)

This is a nice pseudo code version of figure 1 taken from the tutorial.   The devil is in the very subtle details.   Remember that most of the time python code in Tensor flow is about building the flow graph, so we have to work a bit harder to build the graph with the cycle that we need to train and execute.

It turns out that the greatest challenge is defining how we can create and reuse the weight matrices and bias vectors inside a graph with a cycle.   CNTK uses the operator “PastValue” to create the needed cycle in the graph.  TensorFlow uses the literal recurrence above and a very clever variable save and recall mechanism to accomplish the same thing.   The moral equivalent of “PastValue” in Tensorflow is a function called tf.get_variable( “name”, size, initializer = None) whose behavior depends  upon a flag called “reuse” associated with the current variable scope.  If reuse==False and no variable already exists by that name in this scope then get_variable returns a new variable with that name and uses the initializer to initialize it.  Otherwise it returns an error.  If reuse == True then get_variable returns the previously existing variable by that name.  If no such variable exists, it returns an error.

To illustrate how this is used below is a simplified version of one of the functions in TensorFlow used to create the sigmoid function from eq. 1 above.  It is just a version of W*x+b where x is a list [a, b, c, …].

def linear(args, output_size, scope=None):
   #Linear map: sum_i(args[i] * W[i]), where W[i] is a variable.
   with vs.variable_scope(scope):
    matrix = vs.get_variable("Matrix", [total_arg_size, output_size])
    res = math_ops.matmul(array_ops.concat(1, args), matrix)
    bias_term = vs.get_variable(
        "Bias", [output_size],
        initializer=init_ops.constant_initializer(1.))
  return res + bias_term

Now to define the BasicLSTMCell we can write it roughly as follows.   (To see the complete versions of these functions look at rnn_cell.py in the TensorFlow Github repository.)

class BasicLSTMCell(RNNCell):
  def __call__(self, inputs, state, scope=None):
    with vs.variable_scope(scope): 
      c, h = array_ops.split(1, 2, state)
      concat = linear([inputs, h], 4 * self._num_units)
      i, j, f, o = array_ops.split(1, 4, concat)
      new_c = c * sigmoid(f) + sigmoid(i) * tanh(j)
      new_h = tanh(new_c) * sigmoid(o)
   return new_h, array_ops.concat(1, [new_c, new_h])

As you can see, this is a fairly accurate rendition of the diagram in Figure 3.  You will notice the operator split above is the counterpart to the rowslice operation in the CNTK version.

We can now create instances of a recurrent neural network that can be used for training and using the same variable scope we can create another one to use for testing that share the same W and b variables.  The way this is done is shown in ptb_word_lm.py in the TensorFlow tutorials for recurrent neural nets.  There are two additional points worth observing.  (I should say they were critical for me to understand this example.)   They create a class lstmModel that can be used to build the networks for training and test.

class lstmModel:
  def __init__(self, is_training, num_steps):
    self._input_data = tf.placeholder(tf.int32, [batch_size, num_steps])
    self._targets = tf.placeholder(tf.int32, [batch_size, num_steps])
 	cell = rnn_cell.BasicLSTMCell(size, forget_bias=0.0)
    outputs = []
    states = []
    state = self._initial_state
    with tf.variable_scope("RNN"):
      for time_step in range(num_steps):
        if time_step > 0: 
            tf.get_variable_scope().reuse_variables()
        (cell_output, state) = cell(inputs[:, time_step, :], state)
        outputs.append(cell_output)
        states.append(state)
        … many details omitted …

Where this is used is in the main program were we create a training instance and a test instance (actually there is a third instance which I am skipping to keep this as simple as possible).

with tf.variable_scope("model", reuse=None, initializer=initializer):
  m = PTBModel(is_training=True, 20)
with tf.variable_scope("model", reuse=True, initializer=initializer):
   mtest = PTBModel(is_training=False, 1)

What is happening here is that the instance m is created with 20  steps with no reuse initially.  As you can see from the initializer above that will cause the loop to unroll 20 copies of the cell in the graph and after the first iteration the reuse flag is set to True, so all instances will share the same W and b.   The training works on this unrolled version.   The second version mtest has reuse = True and it only has one instance of the cell in the graph.   But  the variable scope is the same as m, so it shares the same trained variables as m.

Once trained, we can invoke the network with a kernel like the following.

cost, state = sess.run([mtest.cost, mtest.final_state],
                                 {mtest.input_data: x,
                                  mtest.targets: y,
                                  mtest.initial_state: state})

Where x and y are the inputs. This is far from the complete picture of the tutorial example. For example, I have not gone into the training at all and the full example uses a stacked LSTM cell and a dropout wrapper. My hope is that the detail I have focused on here will help the reader understand the basic structure of the code.

Final Observations

I promised a programming model comparison of the two systems.    Here are some top level thoughts.

  1. TensorFlow and CNTK are very similar for the simple convolutional neural network example.  However, I found the TensorFlow version easier to experiment with because it is driven by python. I was able to load it as a IPython notebook and try different things.  With CNTK one needed to completely understand how to express things with the configuration file.   I found that difficult.   With TensorFlow I was able to write a simple k-means clustering algorithm (see my previous post on Tensorflow).   I was unable to do this with CNTK and that may be due to my cluelessness rather than a limit of CNTK.  (If somebody knows how to do it, I would appreciate a tip.)
  2. In the case of the LSTM recurrent neural network, I found the CNTK version to be completely transparent.   In the case of Tensorflow I found the top level idea very elegant, but I also found it very difficult to understand all the details because of the clever use of the variable scoping and variable sharing.   I had to dig very deep to understand how it worked.   And it is not clear that I have it all yet!   I did find one trivial bug in the Tensorflow version that was easy to fix and I am not convinced that the variable scoping and reuse flags, which are there to solve an encapsulation problem, are the best solutions.  But the good think about TensorFlow is that I can easily experiment with alternatives.
  3. I must also say that the CNTK book and the TensorFlow tutorials are both excellent introductions to the high level concepts.  I am sure more detailed, deep-dive books will come out soon.

I am also convinced that as both systems mature they will improve and become easier to program.   I did not discuss performance, but CNTK is the current champ in terms of speed on some difficult challenges.  But with the rapid evolution of these systems I expect to see the competition to heat up.

Raspberry meets Project Oxford: IoT & Judging a Book by Its Cover.

When I was (very) young I enjoyed building electronic devices from kits and from wiring diagrams I found in exciting journals like “Popular Electronics”.  Radio Shack was my friend and Heath Kits were a joy when I could afford them.   Today I marvel at the devices kids can tinker with.   They can build Arduino powered Lego robots that are lightyears beyond my childhood Lincoln Logs and Erector Sets.   And they have access to another relatively recent, world changing technology:  they have the ability to create software that brings their creations to life.  They can add sensors (accelerometers, GPS, altimeters, thermometers, gyroscopes, vibration, humidity and more) to their projects as well as actuators and motors of various types.  And these creations can live on the Internet.   Devices like the Particle (formerly Spark) Photon and Electron IoT include WiFi or cellular links so instruments can send data to the cloud.  They are programmed with the same C-like language used by the Arduino.   Photon devices are extremely small and very inexpensive so they can be deployed in very large numbers.

Indeed, these devices are the atomic particles of the Internet of Things (IoT).  They are being turned into modestly built home safety sensors as well as sophisticated instruments that control highly sensitive scientific experiments.   They will be the tools city planners will use to understand our complex urban environments.   For example, the Array of Things project at the University of Chicago and Argonne Labs is deploying an amazing small sensor pack on that will sit on light polls throughout the city.   (Check out this video.)  Charlie Catlett and Pete Beckman are leading the AoT project.  In discussions with Charlie he told me that he has built a full array of sensors in his home to monitor various devices.   He also pointed me to the Mathworks Thingspeak product which provides a lovely Matlab-based cloud platform for integrating and analyzing the streams of data from your instruments.

In a previous post I described ways in which event data could be streamed to the cloud and I will return to that topic again in the future.   In this article I wanted to look at what could be done with a small device with a video camera when it was combined with analysis on the Cloud.  I chose the Raspberry Pi2 and the cloud services from the new Microsoft Research project Oxford services.

Phone Apps and Project Oxford

While the instruments described above will eventually dominate the Internet of Things, the billions of smart phones out there today give us a pretty good idea of the challenge of managing data from sensors.   Almost every truly useful “App” on a smart phone interacts services in the cloud in order to function properly.    And each of these apps does part of its computation locally and part takes place remotely.   Cortana and Siri can do the speech to text translation by invoking a speech model on the phone but answering the query involves another set of models running in the cloud.    For any interesting computational task, including the tasks assigned to the Array of Things sensors, some computation must be done locally and some must be remote.   In some cases, the local computation is needed to reduce the amount of time spent communicating with the back end services and in other cases, for example for privacy reasons, not all locally collected information should be transmitted to the cloud.  (What happens to my phone in Vegas, should stay in Vegas.)  Indeed, according to Catlett and Beckman, protecting privacy is a prime directive of the Array of Things project.

Companies like Amazon, Google, IBM, Microsoft and Salesforce have been working hard to roll out data analytics services hosted on their cloud platforms.   More recently these and other companies have been providing machine learning services in the cloud.   I have already talked about AzureML, a tool that allows you to build very powerful analysis tools, but for people who want to build smart apps there are now some very specialized services that do not require them to be ML domain experts.  A good example is IBMs Watson Services that is a result of IBM’s work on bringing their great Jeopardy playing capability to a much broader class of applications.  Watson has a great collection of language and speech analysis and translation services.   In addition, it has an impressive collection of image analysis tools.

Project Oxford is another interesting collection of cloud services that cover various topics in computer vision, speech and language.  In the area of speech, they provide the APIs needed for speech to text and text to speech.  We are all familiar with the great strides taken in speech recognition with Siri, Cortana, Echo and others.   With these APIs one can build apps for IOS, Windows and Android that use speech recognition or speech generation.   The language tools include spell checkers, a nifty language understanding tool that you can train to recognize intent and action from utterances such as commands like “set an alarm for 1:00 pm.”  The computer vision capabilities include scene analysis, face recognition, and optical character recognition (OCR).  These are the services I will explore below.

Raspberry Pi2 and the OpenCV computer vision package.

The Raspberry Pi 2 is a very simple, credit card sized, $35 single board computer with a very versatile collection of interfaces. It has a Broadcom VideoCore GPU and a quad-core ARMv7 processor and 1 GB of memory.  For a few extra dollars you can attach a five-megapixel camera.  (In other words, it is almost as powerful as a cellphone.) The Pi2 device can run Windows 10 IoT Core or a distribution of Linux.  I installed Linux because it was dead easy.   To experiment with the camera, I needed the OpenCV computer vision tools.  Fortunately, Adrian Rosebrock has documented the complete installation process in remarkable detail.  (His book also provides many useful coding example that I used to build my experiments.)

Object Tracking

One of the most obvious sensing challenges one can tackle with a small Interconnected device with a camera is object tracking.   With OpenCV there are some very sophisticated ways to do this, but for some tasks it is trivial.   Outside my window there is a small harbor and I decided to track the movements of the boats.   This was easy because they are all pleasure boats so the vast majority of them are white.  By filtering the image to bring out white objects on a black background the OpenCV functions “findContours” and “minAreaRect” it requires only a few lines of code to draw bounding boxes around suspected boats. (Full source code for the examples here are in Github.) With the Pi device and camera sitting near the window.  I was able to capture the scene below in Figure 1.  Note that it took some scene specific editing.  Specifically, it was necessary to ignore contours that were in the sky.   As can be seen some objects that were not boats were also selected.    The next step is to filter based on the motion of the rectangles.  Those are the rectangles worth tracking. Unfortunately, being winter these pleasure craft haven’t moved for a while, so I don’t have a video to show you.

boats2-capture

Figure 1.   Using OpenCV to find boats in a harbor

Face Recognition

By face recognition we mean identifying those elements in a picture that are human faces. This is a far lower bar than face identification (who is that person?), but it is still an interesting problem and potentially very useful.  For example, this could be used to approximate the number of people in a scene if you know what fraction of them are facing the camera.   Such an approximation may be of value if you only wanted to know the density of pedestrian traffic.  You could avoid privacy concerns because you would not need to save the captured images to non-volatile storage. It is possible to use OpenCV alone to recognize faces in a picture and never send the image outside the device.

If you do want to know more information about a group of people project Oxford can do much better.  Given an image of a group it can identify the faces looking in the general direction of the camera and for each face it can give a number of interesting features such as gender, estimated age and even a “smile” index.   This information may be of interest to a business, such as a movie theater owner who wants to know the gender makeup and average age of the of the patrons.  It could even tell by counting smiles if they enjoyed the movie.    I used project Oxford to do an analysis of two photos.   The first is our class of amazing students and staff from the 2014 MSR summer school in Moscow.   The vision service also returned a rectangle enclosing each face.   I use OpenCV to draw the rectangle with a pink border for the males and green for females if they were smiling.   If they were not smiling, they got a black rectangle.  As can be seen, the reporting was very accurate for this staged group photo.

yandex-face-out

Figure 2.   Project Oxford Face Recognition of MSR Summer School 2014 group photo taken at Yandex headquarters Moscow.  Pink = male, green = female, black = no smile

The system also allowed us to compute estimated average age: approximately 21 for females, 24 for males.  (Some “senior” professors in the photo skewed the average for males.).   Figure 3 below shows the result for a stock street scene.  Very few smiling faces, but that is not a surprise.   It also illustrates that the system is not very good at faces in partial or full profile.

sidewalk-crowd-face-out

Figure 3.   Many faces missing in this crowd, but one is smiling.

Text Recognition

Optical Character Recognition (OCR) has been around for use in document scanners for a long time but it is much harder to read text when it is embedded in a photo with other objects around it.    For example, programing the Pi using a call to oxford OCR and pointing the camera at the image in Figure 4 produced the output in the superimposed rectangle (superimposition done with PowerPoint … not OpenCV).

oxford-text-out-final

Figure 4.   Oxford OCR test from Pi with command line output in black rectangle.

As can be seen, this was an easy case and it made only one error.     However, it is not hard to push the OCR system beyond its limits based on distance, lighting and the angle of the image.

To demonstrate the power of doing some computation on the local device and some of it remotely in the cloud, we can take an scene that is too hard for oxford to understand directly and so some preprocessing with OpenCV.   In Figure 5 we have taken a scene with the same piece of paper with text and placed it far from the camera and at a funny angle.   This yielded no result with Oxford OCR directly.   But in this case we used the same filtering technique used in the boat tracking experiment and located the rectangle containing the paper.  Using another OpenCV transformation we transformed that rectangle into a 500 by 300 image with the rotation removed (as shown in the insert).   We sent that transformed image to Oxford and got a partial result (shown in the black rectangle).

oxford-text-out-with-opencv

Figure 5.  Transformed image and output showing the coordinates of the bounding rectangle and output from Oxford OCR.   The green outline in the picture was inserted by OpenCV drawing functions using the bounding rectangle.

This experiment is obviously contrived.  How often do you want to read text at a distance printed on a white rectangle?  Let’s look at one final example below.

Judging a Book by Its Cover

An amusing App that may be able to use the Oxford OCR would be to use it to find information about a book based on an image of the cover.   Of course, this is not a new idea.  Amazon has an app called “Flow” from their A9 Innovations subsidiary.  More on that later.    What  I did was to integrate Oxford OCR with a call to Bing.    It works as follows.  The Pi is shown the image of a book,  the OCR app looks for text on the image like the title or author.   Then that returned string is sent to Bing via a web service call.   The top five results are then put into an HTML document along with the image and served up by a tiny webserver on the PI.   The first results are shown below.

bing-reader-out1

Figure 6.   Output of the book cover reader app webserver on the PI.  The first result is correct.

The interesting thing about this is that Bing (and I am sure Google as well) is powerful enough to take partial results from the OCR read and get the right book on the top hit.  The correct text is

The Fabric of Reality.   A leading scientist interweaves evolution, theoretical Physics and computer science to offer a new understanding of reality.”  (font changes are reproduced as shown on the book cover)

What the OCR systems was able to see was “THE FABRIC I scientist evolution, physics. Computer science a new understanding.”   The author’s name is occluded by a keyboard cable.

Unfortunately, the reliability at a distance was not great.  I next decided to try to enhance the image using the technique used in the “text on white rectangle” as illustrated in Figure 5.    Of course this needed a white book, so it is not really practical.  However, the results did improve the accuracy (for white books).   An example that the OCR-only version failed to recognize but that worked using the OpenCV enhanced version is shown in Figure 7.

black_swawn

Figure 7.   Using OpenCV with Oxford and Bing.

As can be seen the OCR saw “the black sivan nassim aicholas tal”, which is not very accurate. Even with the lower right corner of the book out of the image I would have expected it to do a bit better.   But Bing was easily able to figure it out.

I downloaded the Flow app and installed it on a little Android tablet.   It works very well up close but it could not handle the distance shots illustrated above.   Of course this is not a proper scientific comparison.    Furthermore, I have only illustrated a few trivial features of OpenCV and I have not touch the state-of-the-art object recognition work.  Current object recognition systems based on deep learning are amazingly accurate.  For example, MSR’s Project Adam was used to identify thousands of different types of object in images (including hundreds of breeds of dogs).   I expect that Amazon has the images of tens of thousands of book covers.   A great way to do the app above would be to train a deep network to recognize those objects in an image.  I suspect that Flow may have done something like this.

Final Thoughts

The examples above are all trivial in scope, but they are intended to illustrate what one can do with a very simple little device, a few weeks of programming and access to tools like Project Oxford.   We started with a look at IoT technologies and moved on to simple “apps” that use computer vision.   This leads us to think about the topic of augmented reality where our devices are able to tell us about everything and everyone around us.   The “everyone” part of this is where we have the most difficulty.   I once worked on a project called the “intelligent memory assistant”.   Inspired by a comment from Dave Patterson, the app would use a camera and the cloud to fill in the gaps in your memory so that when you met a person but could not remember their name or where they were from.  It would whisper into your ear and tell you “This is X and you knew him in 1990 …. “.  It is now possible to build this, but the privacy issues raise too many problems.   It is sometime referred to as the “creepy factor”. For example, people don’t always want to be “tagged” in a photo on Facebook.   On the other hand uses like identifying lost children or amnesiacs are not bad. And non-personal components of augmented reality are coming fast and will be here to stay.

PS Code for these examples is now in this github repo

An Encounter with Google’s TensorFlow

NOTE: This is a revised version of this blog that reflects much better ways to do some of the tensor algebra in the first example below.

Google has recently released some very interesting new tools to the open source community.   First came Kubernetes, their container microservice framework, and that was followed by two new programming systems based on dataflow concepts.   Dataflow is a very old idea that first appeared in the computer architecture community in the 1970 and 80s.   Dataflow was created as an alternative to the classical von-Neumann computer design.  It was hoped that it would have a performance advantage because it would exploit much greater levels of parallelism than thought possible with classical computers[1].  In dataflow systems computation can be visualized as a directed graph where the vertices of the graph are operators and data “flows” through the system along the edges of the graph.  As soon as data is available on all the input edges of an operator node the operation is carried out and new data is put on the output edges of the node.  While only a few actual data flow computers were built, the concept has been fundamental to distributed and parallel computing for a long time.  It shows up in applications like complex event processing, stream analytics and systems like the Microsoft AzureML programming model I described earlier.

Google’s newly released Cloud Dataflow is a programming system for scalable stream and batch computing.  I will return to Cloud Dataflow in another post, but for now, I will focus on the other dataflow system they released.  Called TensorFlow, it is a programming system designed to help researchers build deep neural networks[2] and it appears to be unrelated to Cloud Dataflow.  All the documentation and downloadable code for TensorFlow are on-line.  TensorFlow is also similar to Microsoft Research’s Computational Network Toolkit (CNTK) in several ways that I will describe later.

TensorFlow is designed allow the programmer to easily “script” a dataflow computation where the basic units of computing are very large multi-dimensional arrays.   The scripts are written in Python or C++ and it works very well with IPython/jupyter notebooks.   In the following pages I will give a very light introduction to TensorFlow programming and illustrate it by building a bare-bones k-means clustering algorithm.   I will also briefly describe one of their examples of a convolutional neural network.

TensorFlow can be installed and run on your laptop, but as we shall show below, it really shines on bigger more powerful hardware.  An interesting thing happened in the recent evolution of deep neural networks.   The most impressive early work on really large deep neural network was done on large cloud-scale clusters.  However, the type of parallelism needed for doing deep network computation was really very large array math, which is really better suited to execution on a GPU.  Or a bunch of GPUs on a massive memory machine or a cluster of massive memory machines with a bunch of GPUs each.  For example, the Microsoft CNTK has achieved some remarkable results on 8 GPU systems and it will soon be available on the Azure GPU Lab. (I also suspect that supercomputers such as the SDSC Comet with large memory, multi-GPU, multi-core nodes would be ideal.)

TensorFlow: a shallow introduction.

There is a really nice white paper by the folks at Google Research with far more details about the TensorFlow system architecture than I give here.  What follows is a shallow introduction and an experiment.

There are two main concepts in TensorFlow.   The first is the idea of computing on objects that are very large multi-dimensional arrays called tensors.   The second is the fact that the computation you build with tensors are compiled into graphs that are executed in a “dataflow” style.   We need to unpack both of these concepts.

Tensors

Let’s start with tensors.  First, these are not your great-great grandfather’s tensors.   Those tensors were introduced by Carl Friedrich Gauss for differential geometry where they were needed to provide metrics that could be used to describe things like the curvature of surfaces.  Einstein “popularized” tensors in his general theory of relativity.  Those tensors have a very formal algebra of covariant and contravariant forms. Fortunately, we don’t have to go there to understand the use of tensors in TensorFlow’s machine learning applications.   In fact, if you understand Numpy arrays you are off to a good start and Numpy arrays are just really efficient multidimensional arrays.   TensorFlow can be programmed in Python or C++ and I will use the Python binding in the discussion below

In TensorFlow tensors are created and stored in container objects that are one of three types: variables, placeholders and constants.   Let’s start with constants.  As the name implies constant tensors are initialized once and never changed.   Here are two different ways to create constant arrays that we will use in an example below.  One is a tensor of size Nx2 filled with zeros and the other is a tensor of the same shape filled with the value 1.

import numpy as np
import tensorflow as tf
N = 10000
X = np.zeros(shape=(N,2))
Xv = tf.constant(X, name="points")
dones = tf.fill([N,2], np.float64(1.))

We have used a Numpy array of values to create the TensorFlow constant. Notice that we have given our tensor a “name”. Naming tensors is optional but it comes in very handy when debugging.
Variables are holders of multidimensional arrays that persist across sessions and may be modified and even saved to disk. The concept of a “session” is important in TensorFlow. You can think of it as a context where actual TensorFlow calculations take place. The first calculation involving a variable is to load it with initial values. For example, let’s create a 2 by 3 tensor that, when initialized contains the constant 1.0 in every element. Then let’s convert that back to a Numpy array and print it. To do that we need a session and we will call the variable initializer in that session. When working in the IPython notebook it is easiest to use an “InteractiveSession” so that we can easily edit and redo computations.

sess = tf.InteractiveSession()
myarray = tf.Variable(tf.constant(1.0, shape = [2,3]))
init =tf.initialize_all_variables()
sess.run(init)
mynumpy =myarray.eval()
print(mynumpy)
[[ 1.  1.  1.]
 [ 1.  1.  1.]]

As shown above, the standard way to initialize variables is to initialize them all at once. The process of converting a tensor back to a Numpy array requires evaluating the tensor with the “eval” method. As we shall see this is an important operator that we will describe in more detail below.

The final tensor container is a “placeholder”.   Creating a placeholder object only requires specifying a type and some information about its shape.   We don’t initialize a placeholder like a variable because it’s initial values will be provided later in an eval()-like operation.  Here is a placeholder we will use later.

x = tf.placeholder(tf.float32, [None, 784], name="x")

Notice that in this case the placeholder x is two dimensional but the first dimension is left unbound.  This allows us to supply it with a value that is any number of rows of vectors of length 784.   As we shall see, this turns out to be very handy for training neural nets.

Dataflow

The real magic of TensorFlow is in the dataflow execution of computations.   A critical idea behind TensorFlow is to keep the slow world of Python away from the fast world of parallel tensor algebra as much as possible.  Python is used only to describe the dataflow graph of the computation.   TensorFlow has a large library of very useful tensor operators.   Let’s describe a computation we will use in the k-means example.   Suppose we have 8 points in the plane in a vector called Xv and 4 special points in an array called kPoints.   I would like to label each of the 8 points with the index of the special point nearest to it.   This should give me 4 clusters of points.   I now want to find the centroid of each of these clusters.  Assume we have a tensor called “blocked” where row i gives the distance from each of our 8 points to the ith special point.  For example, if “blocked” is as shown below then what we are looking for is the index of the smallest element in each column.tensor-table

To find the centroids I use this min index vector to select the elements of Xv in each cluster. We just add them up to and divide by the number of points in that cluster.    The TensorFlow code to do this is shown below.  TensorFlow has a large and handy library of tensor operators many of which are similar to Numpy counterparts.   For example, argmin computes the index of the smallest element in an array given the dimension over which to search.   Unsorted_segment_sum will compute a sum using another vector to define the segments.   The min index vector nicely partitions the Xv vector into the appropriate “segments” for the unsorted_segment_sum to work.  The same unsorted_segment_sum operator can be used to count the number of elements in each cluster by adding up 1s.

mins = tf.argmin(blocked, 0, name="mins")
sums = tf.unsorted_segment_sum(Xv, mins, 4)
totals = tf.unsorted_segment_sum(dones, mins, 4, name="sums")
centroids = tf.div(sums, totals, name = "newcents")

The key idea is this.  When this python code is executed it builds a data flow graph with three inputs (blocked, Xv, dones) and outputs a tensor called centroids as shown below[3].[4]

The computation does not start until we attempt to evaluate the result with a call to centroids.eval(). This sort of lazy evaluation means that we can string together as many tensor operators as needed that will all be executed outside the Python REPL.

tensor-graph1

Figure 1.  Computational flow graph

TensorFlow has a very sophisticated implementation.  The computing environment upon which it is running is a collection of computational devices.  These devices may be cpu cores, GPUs or other processing engines.   The TensorFlow compilation and runtime system maps subgraphs of the flow graph to these devises and manages the communication of tensor values between the devices.  The communication may be through shared memory buffers or send-receive message pairs inserted in the graph at strategically located points.   Just as the dataflow computer architects learned, dataflow by itself is not always efficient.  Sometimes control flow is needed to simplify execution.  The TensorFlow system also inserts needed control flow edges as part of its optimizations.   As was noted above, this dataflow style of computation is also at the heart of CNTK.   Another very important property of these dataflow graphs is that it is relatively easy to automatically compute derivatives by using the chain-rule working backwards through the graph.   This makes it possible to automate the construction of gradients that are important for finding optimal parameters for learning algorithms.   I won’t get into this here but it is very important.

A TensorFlow K-means clustering algorithm

With the kernel above we now have the basis for a simple k-means clustering algorithm shown in the code below.  Our initial centroid array is a 4 by 2 Numpy array we shall call kPoints.   What is missing is the construction of the distance array blocked.   Xv holds all the N points as a constant tensor.   My original version of this program used python code to construct blocked.  But there is an excellent improvement on computation of “mins” published in GitHub by Shawn Simister at.  Shawn’s version is better documented and about 20% faster when N is in the millions.  Simister’s computation does not need my blocked array and instead expands the Xv and centroids vector and used a reduction to get the distances.  Very nice (This is the revision to the blog post that I referred to above.)

N = 10000000
k = 4
#X is a numpy array initialized with N (x,y) points in the plane
Xv = tf.constant(X, name="points")
kPoints = [[2., 1.], [0., 2.], [-1., 0], [0, -1.]]
dones = tf.fill([N,2], np.float64(1.))

centroids = tf.Variable(kPoints, name="centroids")
oldcents = tf.Variable(kPoints)
initvals = tf.constant(kPoints, tf.float64)

for i in range(20):
    oldcents = centroids
    #This is the Simister mins computation
    expanded_vectors = tf.expand_dims(Xv, 0)
    expanded_centroids = tf.expand_dims(centroids, 1)
    distances = tf.reduce_sum( tf.square(
               tf.sub(expanded_vectors, expanded_centroids)), 2)
    mins = tf.argmin(distances, 0)
    #compute the new centroids as the mean of the points nearest
    sums = tf.unsorted_segment_sum(Xv, mins, k)
    totals = tf.unsorted_segment_sum(dones, mins, k, name="sums")
    centroids = tf.div(sums, totals, name = "newcents")
    #compute the distance the centroids have moved sense the last iteration
    dist = centroids-oldcents
    sqdist = tf.reduce_mean(dist*dist, name="accuracy")
    print np.sqrt(sqdist.eval())
    kPoints = centroids.eval()

However, this version is still very inefficient.  Notice that we construct a new execution graph for each iteration.   A better solution is to pull the graph construction out of the loop and construct it once and reuse it.   This is nicely illustrated in Simister’s version of K-means illustrated on his Github site.

It is worth asking how fast Tensor flow is compared with a standard Numpy version of the same algorithm. Unfortunately, I do not have a big machine with fancy GPUs, but I do have a virtual machine in the cloud with 16 processors. I wrote a version using Numpy and executed them both from an IPython notebook. The speed-up results are in Figure 2. Comparing these we see that simple Python Numpy is faster than TensorFlow for values of N less than about 20,000. But for very large N we see that TensorFlow can make excellent use of the extra cores available and exploits the parallelism in the tensor operators very well.

kmeans-speed

Figure 2. speed-up of TensorFlow on 16 cores over a simple Numpy implementation. The axis is log10(N)

I have placed the source for this notebook in Github. Also an improved version based on Simister’s formulation can be found there.

TensorFlow also has a very interesting web tool call TensorBoard that lets you look at various aspects of the logs of your execution. One of the more interesting TensorBoard displays is the dataflow graph for your computation. Figure 3 below shows the display generated for the k-means computational graph.

tensorboard

Figure 3.   TensorBoard graph displace of the k-means algorithm.

As previously mentioned this is far more complex than my diagram in Figure 1.   Without magnification this is almost impossible to read.  Figure 4 contains a close-up of the top of the graph where the variable newcents is created.   Because this is an iteration the dataflow actually contains multiple instances and by clicking on the bubble for the variable it will show you details as shown in the close-up.

tensorboard3

Figure 4.   A close-up of part of the graph showing expanded view of the multiple instances of the computation of the variable newcents.

A brief look at a neural network example

The k-means example above is really just an exercise in tensor gymnastics. TensorFlow is really about building and training neural networks.   The TensorFlow documents have a number of great example, but to give you the basic idea I’ll give you a brief look at their convolutional deep net for the MNIST handwritten digit example.   I’ll try to keep this short because I will not be adding much new here. The example is the well-known “hello world” world of machine learning image recognition.   The setup is as follows.   You have thousands of 28 by 28 black and white images of hand written digits and the job of the system you build is to identify each.   The TensorFlow example uses a two-layer convolutional neural net followed by a large, densely connected layer and a readout layer.

If you are not familiar with convolutional neural nets, here is the basic idea.   Images are strings of bits but they also have a lot of local 2d structure such as edges or holes or other patterns.  What we are going to do is look at 5×5 windows to try to “find” these patterns.  To do this we will train the system to build a 5×5 template W array (and a scalar offset b) that will reduce each 5×5 window to a point in a new array conv by the formula

conv-formula

(the image is padded near the boundary points in the formula above so none of the indices are out of bounds)

We next modify the conv array by applying the “ReLU” function max(0,x) to each x in the conv array so it has no negative values.   The final step is to do “max pooling”.  This step simply computes the maximum value in a 2×2 block and assigns it to a smaller 14×14 array.   The most interesting part of the convolutional network is the fact that we do not use one 5×5 template but 32 of them in parallel producing 32 14×14 result “images” as illustrated in Figure 5 below.convolutional

Figure 5.   The first layer convolutional network.

When the network is fully trained each of the 32 5×5 templates in W is somehow different and each selects for a different set of features in the original image. One can think of the resulting stack of 32 14×14 arrays (called h_pool1) as a type of transform of the original image much as a Fourier transform can separate a signal in space and time and transform it into frequency space. This is not what is going on here but I find the analogy helpful.
We next apply a second convolutional layer to the h_pool1 tensor but this time we apply 64 sets of 5×5 filters to each the 32 h_pool1 layers (and adding up the results) to give us 64 new 14×14 arrays which we reduce with max pooling to 64 7×7 arrays called h_pool2.

Rather than provide the whole code, which is in the TensorFlow tutorial,  I will skip the training step and show you how you can load the variables from a previous training session and use them to make predictions from the test set. .   (The code below is modified versions of the Google code found at GitHub and subject to their Apache copyright License.)

Let’s start by creating some placeholders and variables.  We start with a few functions to initialize weights.   Next we create a placeholder for our image variable x which is assumed to be a list of 28×28=784 floating point vectors.  As described above, we don’t know how long this list is in advance. We also define all the weights and biases described above.

def weight_variable(shape, names):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial, name=names)

def bias_variable(shape, names):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial, name=names)

x = tf.placeholder(tf.float32, [None, 784], name="x")

sess = tf.InteractiveSession()

W_conv1 = weight_variable([5, 5, 1, 32], "wconv")
b_conv1 = bias_variable([32], "bconv")
W_conv2 = weight_variable([5, 5, 32, 64], "wconv2")
b_conv2 = bias_variable([64], "bconv2")
W_fc1 = weight_variable([7 * 7 * 64, 1024], "wfc1")
b_fc1 = bias_variable([1024], "bfcl")
W_fc2 = weight_variable([1024, 10], "wfc2")
b_fc2 = bias_variable([10], "bfc2")

Next we will do the initialization by loading all the weight and bias variable that were saved in the training step. We have saved these values using TensorFlow’s save state method previously in a temp file.

saver = tf.train.Saver()
init =tf.initialize_all_variables()
sess.run(init)
saver.restore(sess, "/tmp/model.ckpt")

We can now construct our deep neural network with a few lines of code. We start with two functions to give us a bit of shorthand to define the 2D convolution and max_pooling operators. We first reshape the image vectors into the image array shape that is needed. The construction of the flow graph is now straight forward. The final result is the tensor we have called y_conv.

def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')

#first convolutional layer
x_image = tf.reshape(x, [-1,28,28,1])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
#second convolutional layer
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)
#final layer
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
y_conv=tf.nn.softmax(tf.matmul(h_fc1, W_fc2) + b_fc2)

Notice that we have not evaluated the flow graph even though we have the values for all of the weight and bias variables already loaded. We need a value for our placeholder x. Suppose tim is an image from the test set. To see if our trained network can recognize it all we need to do is supply it to the network and evaluate. To supply the value to the network we use the eval() function with a special “feed_dict” dictionary argument. This just lists the name of the placeholder and the value you wish to give it.

tim.shape = ((1,28*28))
y = y_conv.eval(feed_dict={x: tim})
label = np.argmax(y)
print(label)

You can read the TensorFlow tutorial to learn about the network training step.   They use a special function that does the stochastic backpropagation that makes it all look very simple.

How does this thing work?

Developing an intuition for how the convolutional neural network actually works is a real challenge.  The fact that it works at all is amazing.  The convolutional steps provide averaging to help create a bit of location and scale invariance, but there is more going on here.   Note that given a 28*28 image the output of the second convolutional step is 64 7×7 array that represent feature activations generated by the weight templates.   It is tempting to look at these as images to see if one can detect anything.  Obviously the last fully connected layer can do a great job with these.   But it is easy to see what they look like.   If we apply h_pool2.eval(feed_dict={x: image}) we can look at the result as a set of 64 images.  Figure 6 does just that.  I picked two random 9 images, two 0s, two 7s and three 6s.   Each column in the figure represents depicts the first 9 of the 64 images generated by each.    If you stare at this long enough you can see some patterns there.  (Such as the diagonals in the 7s and the cup shape of the 4s.) But very little else.   On the other hand, taken together each (complete) column is a sufficient enough signature for the last layer of the network to identify the figure with over 99% accuracy.   That is impressive.

conv_digits.JPG

Figure 6.  Images of h_pool2 given various images of handwritten digits.

I have put the code to generate these images on Github along with the code to train the network and save the learned weights.  In both cases these are ipython notebooks.

A much better way to represent the data learned in the convolutional networks is to “de-convolve” the images back to the pixel layers.  In Visualizing and Understanding Convolutional Networks by Matthew Zeiler and Rob Fergus do this and the results are fascinating.

Final Notes

I intended this note to be a gentle introduction to the programming style of TensorFlow and not an introduction to deep neural networks, but I confess I got carried away towards the end.   I will follow up this post with another that look at recurrent networks and text processing provided I can think of something original to say that is not already in their tutorial.

I should also note that the construction of the deep network is, in some ways, similar to Theano, the other great Python package for building deep networks.  I have not looked at how Theano scales or how well it can be used for tasks beyond deep neural networks, so don’t take this blog as an endorsement of TensorFlow over Theano or Torch.

On a final note I must say that there is another interesting approach to image recognition problem.  Antonio Criminisi has led a team that has developed Deep Neural Decision Forests that combine ideas from decision forests and CNNs.

[1] This was before the explosion of massively parallel systems based on microprocessors changed the high-end computing world.

[2] If you are new to neural networks and deep learning, you may want to look at Michael Nielsen’s excellent new on-line book. There are many books that introduce deep neural networks, but Nielsen’s mathematical treatment is very clear and approachable.

[3] This is not a picture of the actual graph that is generated.   We will use another tool to visualize that later in this note.

[4] The observant reader will note that we have a possible divide by zero here if one of the special points is a serious outlier.   As I said, this is a bare-bones implementation.

Streaming Events to AzureML Through Azure Stream Analytics

 

UPDATE!  Microsoft has just recently released a much better way to integrate Azure Machine Learning with Azure Stream Analytics.  You can now call an AzureML service directly from the SQL query in the stream analytics system.  This means you don’t need the message bus and microservice layer I describe below.   Check out the blog by Sudhesh Suresh and the excellent tutorial from Jeff Stokes.   I will do a performance analysis to compare this method to the one below, but I will wait until the new feature comes out of “preview” mode :).   I will also use a better method to push events to the eventhub.


 

Doing sophisticated data analytics on streaming data has become a really interesting and hot topic.   There is a huge explosion of research around this topic and there are a lot of new software tools to support it.  Amazon has its new Kinesis system,  Google has moved from MapReduce to Cloud DataFlow, IBM has Streaming Analytics on their Bluemix platform and Microsoft has released Azure Stream Analytics.  Dozens of other companies that use data streaming analytics to make their business work have contributed to the growing collection of open source of tools.   For example LinkedIn has contributed the original version of Apache Kafka and Apache Samza and Yahoo contributed Storm.  And there are some amazing start-ups that are building and supporting important stream analytics tools such as Flink from Data-Artisans. Related university research includes systems like Neptune from Colorado State.  I will post more about all of these tools later, but in this article I will describe my experience with the Microsoft EventHub, Stream Analytics and AzureML.   What I will show below is the following

  1. It is relatively easy to build a streaming analytics pipeline with these tools.
  2. It works, but the  performance behavior of the system is a bit uneven with long initial latencies and a serious bottleneck which is probably the result of small size of my experiment.
  3. AzureML services scale nicely
  4. Finally there are some interesting ways in which blocking can greatly improve performance.

In a previous post I described how to use AzureML to create a version of a science document analyzer that I originally put together using Python Scikit-Learn, Docker and Mesosphere.  That little project is described in another post.   The streaming challenge here is very simple.   We have data that comes from RSS feeds concerning the publication of scientific papers.  The machine learning part of this is to use the abstract of the paper to automatically classify the paper into scientific categories.  At the top level these categories are “Physics”, “math”, “compsci”, “biology”, “finance”.    Big streaming data challenges that you find in industry and science involve the analysis of events that may be large and arrive at the rate of millions per second.   While there are a lot of scientists writing lots papers, they don’t write papers that fast.   The science RSS feeds I pull from are generating about 100 articles per day.  That is a mere trickle.  So to really push throughput experiments I grabbed several thousands of these records and wrote a simple server that would push them as fast as possible to the analysis service.

In this new version described here I want to see what sort of performance I could get from AzureML and Azure Stream Analytics.   To accomplish this goal I set up the pipeline shown in Figure 1.

system-diagram-eventhub

Figure 1.   Stream event pipeline

As shown in the diagram events (document abstracts) are initially pushed into the event hub which is configured as the source for the stream analytics engine.  The analytics engine does some initial processing of the data and then pushes the events to a message queue in the Azure Message Bus.  The events are then pulled from the queue by a collection of simple microservices which call the AzureML classifier and then push the result of the call into the azure table service.

In the following paragraphs I will describe how each stage of this pipeline was configured and I will conclude with some rather surprising throughput results.

Pushing events into the Event Hub.

To get the events into the Event Hub we use a modified version the solution posted by Olaf Loogman.  Loogman’s post provides an excellent discussion of how to get an event stream from the Arduino Yun and the Azure event hub using a bit of C and Python code.  Unfortunately the Python code does not work with the most current release of the Python Azure SDK[1], but it works fine with the old one. In our case all that we needed to do was modify Loogman’s code to read our ArXiv RSS feed data and convert it into a simple JSON object form

{ ‘doc’: ‘the body of the paper abstract’,
  ‘class’: ‘one of classes:Physics, math, compsci, bio, finance’,
  ‘title’: ‘ the title of the paper’ 
  }

The Python JSON encoder is very picky so it is necessary to clean a lot of special characters out of the abstracts and titles.  Once that is done it is easy to push the stream of documents to the Event Hub[2].

According to Microsoft, the Event Hub is capable of consuming many millions of events per second.  My simple Python publisher was only able to push about 4 events per second to the hub, but by running four publisher instances I was up to 16 eps.   It was clear to me that it would scale to whatever I needed.

The Stream Analytics Engine

The Azure stream analytics engine, which is a product version of the Trill research project, has a very powerful engine for doing on-line stream processing.   There are three things you must do to configure the analytics engine for your application.    First you must tell it where to get its input.   There are three basic choices:  the event hub,  blob storage and the “IOT” hub.   I configured it to pull from my event hub channel.   The second task is to tell the engine where to put the output.  Here you have more choices.  The output can go to a message queue, a message topic, blob or table storage or a SQL database.  I directed it to a message queue.  The final thing you must do is configure the query processing step you wish the engine to do.   For the performance tests described here I use the most basic query which simply passes the data from the event hub directly to the message queue.   The T-SQL code for this is

SELECT 
     *
INTO
     eventtoqueue
FROM
     streampuller

where eventtoqueue is the alias for my message bus queue endpoint and streampuller is the alias for my event queue endpoint.  The original JSON object that went into the event hub emerges with top level objects which include the times stamps of the event entry and exit as shown below.

{  "doc": "body of the abstract",
   "class": "Physics",
   "title":"title of the paper",
   "EventProcessedUtcTime":"2015-12-05T20:27:07.9101726Z",
    "PartitionId":10, 
    "EventEnqueuedUtcTime":"2015-12-05T20:27:07.7660000Z"
}

The real power of the Analytics Engine is not being used in our example but we could have done some interesting things with it.   For example, if we wanted to focus our analysis only on the Biology documents in the stream, that would only require a trivial change to the query.

SELECT
     *
INTO
    eventtoqueue
FROM
    streampuller
    WHERE class = ‘bio’

Or we could create a tumbling window and group select events together for processing as a block.  (I will return to this later.) To learn more about the Azure Analytics Engine and T-SQL a good tutorial is a real-time fraud detection example.

Pulling events from the queue and invoking the AzureML service

To get the document events from the message queue so that we can push them to the AzureML document classifier we need to write a small microservice to do this.  The code is relatively straightforward, but a bit messy because it involves converting the object format from the text received from the message queue to extract the JSON object which is then converted to the list form required by the AzureML service.   The response is then encoded in a form that can be stored in the table.    All of this must be encapsulated in the appropriate level of error handling.   The main loop is shown below and the full code is provided in Github.

def processevents(table_service, hostname, bus_service, url, api_key):
    while True:
       try:
            msg = bus_service.receive_queue_message('tasks', peek_lock=False)
            t = msg.body
            #the message will be text containing a string with a json object 
            #if no json object it is an error.  Look for the start of the object
            start =t.find("{")
            if start > 0:   
                t = t[start:]
                jt = json.loads(t)
                title = jt["title"].encode("ascii","ignore")
                doc  = jt["doc"].encode("ascii","ignore") 
                tclass = jt["class"].encode("ascii","ignore")
                evtime = jt["EventEnqueuedUtcTime"].encode("ascii", "ignore")
                datalist = [tclass, doc, title]
                #send the datalist object to the AzureML classifier
                try:
                       x = sendrequest(datalist, url, api_key)
                       #returned value is the best guess and 
                       #2nd best guess for the class
                       best = x[0][1]
                       second = x[0][2]
                       #save the result in an Azure Table using the hash 
                       # of the title as the rowkey
                       #and the hostname of this container as the table partition
                       rk = hash(title)
                       timstamp = str(int(time.time())%10000000)
                      item = {'PartitionKey': hostname, 'RowKey': str(rk),
                              'class':tclass, 'bestguess':best, 'secondguess':second,
                              'title': title, 'timestamp': timstamp, 'enqueued': evtime}
                      table_service.insert_entity('scimlevents', item)
                 except:
                        print "failed azureml call or table service error"
           else:
                 print "invalid message from the bus"
       except:
          print "message queue error”

Performance results

Now that we have this pipeline the most interesting experiment is to see how well it will perform when we push a big stream of messages to it.

To push the throughput of the entire pipeline as high as possible we wanted as many instances of the microservice event puller as possible.   So I deployed a Mesos Cluster with 8 dual core worker nodes and one manager node using the template provided  here.   To maximize parallelism for the AzureML service, three additional endpoints were created.   The microservices were assigned one of the four AzureML service endpoints as evenly as possible.

We will measure the throughput in seconds per event or, more accurately, the average interval between event arrivals over a sliding window of 20 events.     The chart below shows the seconds/event as the sliding window proceeds from the start of a stream to the end.   The results were a bit surprising.   The blue line represents one instance of the microservice and one input stream.   As you can see there is a very slow startup and the performance levels out at about 1 second/event.   The orange line is the performance with 12 microservice instances and four concurrent input streams.  Again the startup was slow but the performance leveled off at about 4 events/sec.

stream2

Figure 2.   Performance in Seconds/Event with a sliding window of 20 events.   Event number on the x-axis.

This raises two questions.

  1. Why the slow startup?
  2. Why the low performance for 12 instance case? Where is the bottleneck in the pipeline? Is it with the Azure Stream Analytics service,  the message bus or the AzureML service?

Regarding the slow startup, I believe that the performance of the Azure services are scaled on demand.  In other words, as a stream arrives at the event hub, resources are allocated to try to match the processing rate with the event arrival rate.   For the first events the latency of the system is very large (it can be tens of seconds), but as the pool of unprocessed events grows in size the latency between event processing drops very fast.   However, as the pool of unprocessed events grows smaller small resources are withdrawn and the latency goes back up again.  (We pushed all the events into the event hub in the first minute of processing, so after that point the number in the hub declines.  You can see the effect of this in the tail of the orange curve in Figure 2.)  Of course, belief is not science.   It is conjecture.

Now to find the bottleneck that is limiting performance.   It is clear that some resources are not scaling to a level that will allow us to achieve more than 4 events/second with the current configuration.   So I tried two experiments.

  1. Run the data stream through the event hub, message bus to the microservices but do not call the AzureML service.   Compare that to the version where the AzureML service is called.
  2. Run the data stream directly into the message bus and do not call the AzureML service.  Compare that one to the others.

The table below in Figure 3 refers to experiment 1.   As you can see, the cost of the AzureML service is lost in the noise.   For experiment 2 we see that the performance of the message bus alone was slightly worse that the message bus plus the eventhub and stream analytics.  However, I expect these slight differences are not statistically significant.

stream3

 Figure 3.  Average events per second compairing one microservice instance and one input stream to 8 microservices and 4 input streams to 8 microservices and 4 input streams with no call to AzureML and 8 microservices with 4 input streams and bypassing the eventhub and stream analytics.

The bottom line is that the message bus is the bottleneck in the performance of our pipeline.  If we want to ask the question how well the AzureML service scales independent of the message bus, we can replace it with the RabbitML AMQP server we have used in previous experiments.     In this case we see a very different result.   Our AzureML service demonstrates nearly linear speed-up in terms of events/sec.    The test used an ancreasing number of microservices (1, 2, 4, 8, 12, 16 and 20) and four endpoints to the AzureML service.   As can be seen in Figure 4 the performance improvement lessens when the number of microservice instances goes beyond 16.  This is not surprising as there are only 8 servers running all 20 instances and there will be contention over shared resources such as the network interface.

stream4

Figure 4.   AzureML events per second using multiple microservice instances pulling the event stream from a RabbitMQ server instance running on a Linux server on Azure.

Takeaway

It was relatively easy to build a complete event analytics pipeline using the Azure EventHub, Azure Stream Analytics and to build a set of simple microservices that pull events from the message bus and invoke an AzureML service.   Tests of the scalability of the service demonstrated that performance is limited by the performance of the Azure message bus.    The AzureML service scaled very well when tested separately.

Because the Azure services seems to deliver performance based on demand (the greater the number unprocessed events in the EventHub, the greater the amount of resources that are provisioned to handle the load.)  The result is a slow start but a steady improvement in response until a steady state is reached.   When the unprocessed event pool starts to diminish the resources seem to be withdrawn and the processing latency grows.      It is possible that the limits that I experience were because the volume of the event stream I generated was not great enough to warrant the resources to push the performance beyond 4 events/sec.   It will require more experimentation to test that hypothesis.

It should also be noted that there are many other ways to scale these systems.  For example, more message bus queues and multiple instances of the AzureML server.    The EventHub is designed to ingest millions of events per second and the stream analytics should be able to support that.  A good practice may be to use the Stream Analytics engine to filter the events or group events in a window into a single block which is easier to handle on the service side.   The AzureML server has a block input mode which is capable of processing over 200 of our classification events per second, so it should be possible to scale the performance of the pipeline up a great deal.

 

[1] The Microsoft Azure Python folks seem to have changed the namespaces around for the Azure SDK.   I am sure it is fairly easy to sort this out, but I just used the old version.

[2] The source code is “run_sciml_to_event_hub.py” in GitHub for those interested in the little details.