Category Archives: Uncategorized

Interacting with a Running FuncX Instance

This short report is an addendum to the report “A Look at Parsl and FuncX: Two Excellent Parallel Scripting Tools for Clouds and Supercomputers.”  At the conclusion of that report it was stated that one of the missing pieces of our analysis was a description of how distributed FuncX function instances can communicate with other objects to handle remote interactions or to process streaming data.  Here is another way to state the problem.   FuncX gives you a way to package and launch a function on a remote resource, but you have no direct way to interact with that executing function until it returns or otherwise terminates.  We present three different scenarios that show how to do this.

Consider the following scenario.   You have a function that loads a deep learning vision model and you want it to run on a remote CUDA-capable device and then use the camera on that device to capture a picture and return the analysis and the image to you.   Even if the model has been previously cached on the remote device, loading it, moving the model to the CUDA engine and then analyzing the image can take over 10 seconds. However, once it has been run through the CUDA pipeline, subsequent images can be processed 100 times faster.    How can we interact with the function execution to tell it “Take another picture now” (taking advantage of 100 fold speed up) and return the result of the inference and the image without having to terminate the function’s execution?    

Figure 1. Interacting with a remote execution.

In Figure 1 above we have a small desktop client application that communicates with our function execution.  There is a button at the top to send a message to the waiting function to activate the camera and do the inference to recognize the image.   The first image was a picture of a chair and it took 21.0 seconds.  The next was a picture of a desk.  Third image was of a book filled office which was labeled “library”.   These last two images only took about 0.16 seconds.   The full transaction recorded on the client was

The original function was launched from a Jupyter notebook on the same desktop that is running the client app.  When the client app is closed it sends  a “quit”  message to the function which causes it to terminate normally and return to the jupyter notebook.   The key to making this work is the communication between the client app and the executing function.  The communication mechanism used here is asynchronous queueing as illustrated below.

Figure 2.   Client and running instance communicating asynchronously through queues

It is important to note that this is not a ‘request-response’ scenario.  It is fully asynchronous.  Both the client and the function instance run loops that monitor their input queue.  The client sends either a “quit” action or a “take picture” action depending on the user input. A separate thread continuously monitors the input stream of messages from the function instance.   Messages coming from the function instance are either informational, such as “model loaded” or “model cuda-ized” meaning that the model has been moved to the Nvidia Cuda cores.   The client displays these in the text box.  The other message that the instance sends are the inference classifications such as “library” followed by the time it took for the inference.   When it sees a “class” message it uses secure copy (scp) to pull the image and display it.

We implemented this scenario with two different queue systems: RabbitMQ with Pika and AWS Simple Queue Service.   RabbitMQ is installed as a separate service on a host visible to the network that has the client and the Jetson device.  Pika is the AMQP protocol python implementation that allows communication with the RabbitMQ service.   Message consumers based on Pika use a continuation-passing style in which the consumer provides a callback function that will be invoked when the queue has a message to deliver. Our FuncX function that is run on the Jetson device takes the form below.

When invoked by FuncX on the remote device it first establishes a connection and channel back to the RabbitMQ service.  It then goes about loading and initializing the Resnet18 computer vision model and moving it to the Cuda Nvidia cores.  It next registers the callback function and issues a “start_consuming” action.   At this point the function will wait for messages on the “command” queue.  When it receives a “quit” command it stops consuming and returns to the continuation point which is the return from the function back to the FuncX calling Jupyter notebook. 

The solution using AWS Simple Queue Service is conceptually easier to grasp.  SQS allows us to read a message from a queue if there is a message there.   If not, the read operation waits for a prescribed interval of time and if nothing arrives in the queue, it will return.  The main body of the function is as follows,

The function goes into a loop that start with a receive-message on its input queue.  It asks for 1 message and the maximum wait time is 5 seconds.  If a message arrives in that interval it is either “quit” or “take picture”.   If it is “quit” it send a signal back to the client program signaling it to quit and it then return from the FuncX invocation.   

The source code for both solutions is in the dbgannon/pars-funcx github repository as funcx-interactive-camera-final.ipynb (and html).   The desktop client program are also there.   You need to supply an AWS account information to run the aws example.  To run the rabbitmq version you need to have an instance of rabbitmq running.  It doesn’t cost you anything to download and run it, but it is not a fun installation. 

Dealing with Streams

If we want to monitor a data stream from an instrument on the remote resource it is easy enough to write a function that will go into a continuous loop gathering that data, doing needed analysis and sending results to some remote database.   The problem is that we may need to stop the function, perhaps to replace it with a better version, but the FuncX client does not give us a way to do so.   There are several solutions to sending graceful termination signals to such a function and we discuss one below.   

Another challenge is designing a protocol that allows to or more remotely executing functions to exchange messages reliably without entering deadlock states. 

The scenarios we are interested in are

  1. A function generates a stream of messages that can be sent to zero or more listeners.   For example, the sending function may be drawing samples from an instrument.  If the sender is sending message it should not wait on a “blocking send” for a listener to show up because the instrument may be generating interesting values that will be missed.   Consequently, it may be best to just push the messages to some “device” that allows listeners to pick up the most recent values.   We will look at a sample where multiple senders are sending messages to 0 or more listeners and we will use a publish-subscribe model.  Listeners select a “channel” to subscribe to and they will receive only the messages that are sent on that channel.  You may have multiple listeners on a channel and listeners may come and go as needed.   Senders can send messages on any channel and the send operations are non-blocking.   This example uses ZMQ service for the messaging.  The routing device is a small 10-line program that runs on a server exposed to the network.
  • In the case above we use a routing device to channel messages from senders to listeners and if there is no listener on a channel, we just drop messages that come in on that channel.  In the second case we want to avoid dropping messages.  To do this we encapsulate collections of function invocations together with a queue service into a single “component”.  Messages sent to the component queue are processed in a first-in-first-out manner by one of the function instances.  In our example we consider the case of two “components”: a front-end that receives messages from our control program and a backend that receives messages from the front-end.  The front-end function processes the message and then save the result in an external table or it forwards the message to a back-end component that has multiple function instances to process the request.   This design allows messages that require more cpu intensive processing to be handled by a pool of back-end workers as shown below.

Pub-Sub Routing with ZMQ in FuncX

The first case above is illustrated by a  demo in notebook funcx-zmq-final.ipynb that shows how four funcx instances an communicate streams through a zmq pub-sub routing filter. The routing filter is a simple program that runs on a server with a “public” IP. In this case it is a small NVIDIA jetson device and the access is via the local area network at address and it is listening on port 5559 for the listeners to subscribe and on port 5560 for the publishers.  This is a standard ZMQ example and the core of the program is shown below.

context = zmq.Context(1)
# Socket facing clients
frontend = context.socket(zmq.SUB)
frontend.setsockopt(zmq.SUBSCRIBE, "")

  # Socket facing services
backend = context.socket(zmq.PUB)
zmq.device(zmq.FORWARDER, frontend, backend)

In the demo we have two listeners. One listener subscribes to messages on channel 5 and the other on channel 3. We have two “sender” functions that send a stream of messages to the router. One “Sally” runs on a small server and the other “Fred” is invoked from the notebook. The only difference is that Sally sends a message every 0.2 seconds and Fred sends messages twice as often. Both alternate messages between channels 3 and 5.  The code for Fred is below.

In this case it sends only 22 messages of the form


For x in the range 0 to 21 alternating between channels 3 and 5.   It then sends a “Stop” message on both channels.

The stop message causes all listeners on channels 3 and 5 to terminate.  The listeners are also quite simple.

The listener subscribes on the supplied topic channel and processes messages until the “Stop” message is received.  It is also easy to stop a specific listener if we provide the listener with a unique id so that it can stop when it sees a Stop message with that ID attached.  This is important for the case when a listener needs to be replaced by an upgraded instance without missing messages.  One starts the new version and then stop the old version with its specific kill signal.   The Jupyter notebook illustrates this by showing how the listeners receive the messages from the senders in interleaved order. 

Reliable Messaging Using a Component with Queue Model

To implement the solution in case 2 above we need a queue management system with the following properties

  1. It must allow FIFO queues to be easily created and destroyed.
  2. To ensure the solution remains deadlock free and termination guarantees it must be possible for a process to read from the head of the queue and, if there is nothing there the reader is released in a bounded amount of time. 
  3. The queues should be able to hold an unbounded number of messages.

Of course, 3 is impossible, so we satisfy ourselves with queues built into substantial storage backends.  The easiest way to do this is to use Azure storage queues or the AWS simple queue service SQS.  SQS is simple to use and it is very inexpensive.  (For the work on this project my expenses were far less than $1.) 

For the demo we have two components:

  1. A Front-end component that receives messages and processes them in FIFO order. If the message is “forward” it passes the message to a Back-end component. Otherwise if the message is not “Stop”, it processes message and stores the result in a table. The table we use is in Azure Storage Service because it is cheap and reliable.
  2. The Back-end component consists of one or more instances of a backend processor functions which pull messages from the input queue for that component. We can control throughput of the back-end component by increasing or decreasing the number of functions servicing the queue. When the back-end processing functions complete and execution they store the result in the queue.

The function we incorporate into the front end is as follows.

In this demo every message is a simple python dictionary with a field called “action” which tell the function what action to take.   

The details of the auxiliary function are in the full code in the notebook aws-sqs-example in the repository.   The back end function is similar

The following simple wrapper creates an instance of the components.   The parameters are:

  1. the base name of the component (like “Front” or “Back”)
  2. the name of the input queue for this component.
  3. the output name which could be another queue name or the table name.
  4. repl_factor: the number of function instances of the type of func_id to create.
  5. the end_point that will be the host service for the function instances.
  6. func_id the uuid of the function resulting from funcx registration as above.

This launcher creates repl_factor copies of our function instance.  Running this on Kubernetes launches one pod per instance with each listening to the input queue.   The return value is the list of funcx return values for each instance. 

Final thoughts.

The solutions above are somewhat ad hoc and the programming patterns are not new.  A possible improvement to FuncX would be to make a dedicated message queue service available.  This could be an extension of existing Globus functionality already being used in FuncX.       

A Look at Parsl and FuncX: Two Excellent Parallel Scripting Tools for Clouds and Supercomputers.

In 2019, Yadu N Babuji, Anna  Woodard, Zhuozhao  Li,  Daniel S. Katz, Ben  Clifford, Rohan  Kumar, Lukasz  Lacinski, Ryan Chard, Justin Michael Joseph Wozniak, Michael  Wilde and Kyle  Chard published a paper in HPDC ’19 entitled Parsl: Pervasive Parallel Programming in Python.  I have been looking forward to finding the time to dig into it and give it a try.  The time did arrive and, as I started to dig, I discovered some of this same group, along with Tyler Skluzacek, Anna Woodard, Ben Blaiszik and Ian Foster published  funcX: A Federated Function Serving Fabric for Science in HPDC ’20.  In the following paragraphs we look at both and show examples of FuncX running on Kubernetes on Azure and on a tiny Nvidia Jetson Nano.

An Overview of Parsl

Parsl is best thought of as a tool for constructing and managing distributed parallel scientific workflows. For example, suppose you have to do data processing on a collection of files in a remote Kubernetes cluster at the same time manage a large simulation that will run  in parallel on a supercomputer and finally channel the results to a visualization system.  As sketched in the diagram in Figure 1, you want the main thread of the workflow to be managed from a python Jupyter notebook session. Parsl can do this.

Figure 1.  Hypothetical parallel distributed workflow involving remote resources managed from a Jupyter session on a laptop.

The list of actual examples of scientific applications studied using Parsl is impressive and it is documented in their case studies page.   They include examples from chemistry, physics, cosmology, biology and materials science. 

Programming Parsl is based on the concept of futures.   This is an old idea in which function invocations returns immediately with an object that represents the “future value” that the function will compute.  The calling thread can go about other work while the function computation takes place in another thread of execution.   The calling thread can later wait for the function to complete and retrieve the result.  To illustrate this here is an example of a function that computes Pi

The decoration @python_app indicates that this function will return a future.    We can check to see if the computation is complete by calling done() on the future object.   When done() returns true we can get the result value with the result() function.

Parsl will allow functions returning futures to be composed into graphs that can be scheduled and executed in “dataflow” style.   For example if we have to additional functions F(x,y) and G(a,b) that return futures then the graph in Figure 2

Figure 2.   Parsl dataflow-style scheduling

will be scheduled and executed so that F and G are not invoked until the values of its arguments are available. 

Parsl also has a data provider class that facilitates access to remote files.  Parsl has a way to handle futures for file objects called dataFutures which is a mechanism to guarantee synchronization around file reads and writes from remote threads.

The true strength of Parsl is in how it separates the execution and runtime from the high-level language Python script.   Parsl is unique among parallel programming systems in that it allows a Parsl program to run and SCALE across laptops, shared memory multicore servers,  small HPC Clusters,  Kubernetes in the cloud, and supercomputers.   It accomplishes this by allowing to programmer to specify an Executor class to manage concurrency in the running of the application.   There are currently four executors: ThreadPool, HighTroughPut, WorkQueue, and ExtremeScale.   Each executor must have a Provider that provides the mechanism to connect to the local resource for launching tasks.   There is a simple LocalProvider for shared memory multiprocessors and a provider for Kubernetes in the cloud.   In addition, there are special providers for a host of supercomputers including

  • Argonne’s Theta and Cooley
  • ORNL’s Summit
  • Open Science Grid Condor Clusters
  • University of Chicago Midway Cluster
  • TACC’s Frontera
  • NERSC’s Cori
  • SDSC’s Comet
  • NCSA’s Blue Waters
  • NSCC’s Aspire 1

To illustrate executors,  we have allocated an ubuntu “data science” vm on Azure that is an 8 core (4 real cores) server.   We will run 100 instances of the pi program from above, but we will do this we different levels of concurrency. We would like to do this with maximum throughput so we will use the “HighThroughputExecutor” configured as

We will first run pi(10**6)  sequentially 100 times.   Next, we launch two instances of pi repeated 50 times.   Doing 4 instances concurrently for 25 repetitions is next.  Repeating this process for 5, 10 and 100 concurrent instances gives us the following

Compared to Dask

In many ways Parsl is like Python Dask (which we wrote about in a 2018 blog article.)   Dask is heavily integrated into the python stack.   Numpy and Pandas smoothly interoperate with Dask.  For the embarrassingly parallel bag-of-tasks applications Dask has a feature called dask.bag (db below).  We can compare Dask on the same 8 core Azure ubuntu machine that we used above.   We create a list of 100 copies of the value 10**6 and create a bag sequence from this.  We partition this bag into “nparts” partitions and invoke it in parallel as follows.

Running this with the same set of partitions as above we get the following.

Note that the best performance is achieved when there is one execution of pi(10*6) per partition. The graph below illustrates the relative performance of Parsl and Dask.

Figure 3.  Dask (orange) vs Parsl (blue) execution time for block sizes 1, 2, 4, 5,  10, 20,  100.

Without further tuning of the executor and provider for running Parsl on your multicore laptop I believe Dask is the best performer there.  (The Jupyter notebook that contains these tests is in the repo dbgannon/parsl-funcx (  But this is certainly not the whole story.  You can use your laptop to debug a Parsl script and then run it with a different executor on a massive cluster or supercomputer to achieve remarkable results.   In the Parsl HPDC ’19 paper, the authors provide ample evidence that Parsl outperforms every other alternative except perhaps a custom MPI program. 

FuncX – a function as a service fabric.

Serverless computing is one of the standards computing paradigms that cloud computing supports.  The AWS Lambda service was the first to introduce the idea of providing a way to have your function deployed and invoked without requiring you to deploy  VMs or other infrastructure.    In addition to Lambda, Azure has a serverless offering called Azure functions and Google has Cloud Functions and IBM supports Apache OpenWhisk.  One of the major shortcomings of these serverless FaaS products is that they are designed to support relatively light weight computations (small memory requirement and short-lived execution). 

FuncX is a FaaS fabric that is designed for science.  Technically, FuncX is not a serverless platform.   FuncX requires that the user acquire or deploy the physical resources needed.   To submit a function to the resource you need an instance of the FuncX client and an endpoint string which is the key to the host resource.   To create an instance of the FuncX client  and bind a function to it, you simply do the following.

Once we have registered the function with the FuncX client and when we have the FuncX endpoint we can run the function.

As shown above the run method returns a uuid for the result which can be passed to the client object to obtain the execution status and eventually the result.   What is not obvious from this example is where the host endpoint comes from and how does the client convey the task to the host for execution.

The FuncX paper describes the architecture in detail, so we will not dwell on it here.  In simple terms, requests from the client (fxc above) to run a function are sent to an AWS-based web service which puts the request into a task queue in Redis for the specified endpoint. A “forwarder process” in AWS monitors the redis queue then sends the request to the remote endpoint agent (the funcx-endpoint process running on the endpoint host). The endpoint process, called the funcx-manager distributes tasks to processes called funcx-workers that carries out the execution and returns the result.    The fact that this is all managed through a cloud-based web service is very impressive.  It is very fast and reliable.

FuncX on Nvidia Jetson Nano

The Nvidia Jetson Nano is a tiny, but powerful computer for embedded applications and AI. It has a Quad-core ARM Cortex-A57 MPCore processor, 128 NVIDIA CUDA® cores and 4GB memory.  Installing a FuncX end point on the Nano is easy.  Just follow the steps in the docs.  At the end of the installation, you will have the endpoint uuid. An important application of FuncX is to use it to probe an edge device and read instruments or use special features that the device provides.  Because the Nano has an Nvidia GPU (and my laptop doesn’t) why not ship computation to the nano?  Here is a trivial example.   KMeans clustering is a standard ML algorithm that clusters points in space into a given number of nearby sets.  KMeans also involves lots of vector operations so it is suitable for execution in a GPU.  We have a program that contains four functions:

  • random_init  which initializes a random partition of the points in set by giving each point a code.
  • update_centers calculates the “center of gravity” of each set of points.
  • Compute_codes uses the current set of center points to reclassify each point according to the new centers.
  • Cluster is the main function that uses iteration to over the set of centers and codes

For FuncX we will execute Cluster remotely from our laptop.   It is shown below.

There are two important points to notice here.  First is that we must import all the needed libraries inside the scope of the function because they may not be available in the worker execution environment on the remote host.  Second, we note that we need to import the three additional functions from which is stored as the file “/home/jetbot/”  on the remote machine.  The inputs to the cluster function consist of the array of data, the number partitions and a Boolean flag which signals to use the GPU or CPU to do the computation. 

Running the function with 400000 random points in 2-d space looking for 7 partitions goes as follows.

With results shown below.

Figure 4.  Jetbot device on the left and kmeans plot on the right.

Running it again with the GPU gives a similar result but the execution time is 3.6 seconds.  A scatter plot of the clustering is above.  The GPU yields a speed-up of 7.8 over the CPU. (This is low because the kmeans algorithms are mostly vector operations and not matrix-matrix operations.  Algorithms that have that property will yield speedup in the hundreds.) A significant shortcoming of the approach taken above is that we passed the input array of points to the function.  This limits the size we can handle to about 1 million single precision floats.  A more appropriate solution is to pass the location of the data to the remote computer and have it fetch the data.  An even better solution is to use the properties of this edge device to gather the data from its on-board instruments for processing. 

The Jenson board has a camera, so as an experiment, we decided to write a function which will capture an image from the camera and return it to display.  Unfortunately, this camera requires a complex initialization and only one process can initialize and own the camera at a time.   But FuncX allows multiple instances of a function to execute concurrently, so we need some way to mediate access to the camera. The solution was to create a simple miro-web service that runs continuously.  It initializes the camera and has a single method “hello”.   Invoking that method causes the service to grab an image and store it in a file.  The path to the file is returned to the FuncX function that can grab the file and return it to the caller. In this case the function is now very simple. The camera service is waiting on a port on the local host where the function is executing.  The local file is read and the image is returned. 

The result is the view from the camera of part of the office where it sits.

FuncX on Kubernetes

In the examples below we will demonstrate the use of FuncX with a small Kubernetes cluster.  More specifically we will deploy a deep learning (BERT-based) document classifier as a function and do  some simple performance analysis.  The appendix at the end of this document give the instructions for installing Kubernetes on the Docker distribution on your laptop and also on Azure.  The typical application of Kubernetes involves running Docker-style containers on each node of the cluster. We are going to create a special container that contains our BERT classifier model that will be loaded as the worker.   Our goal is to run as many instances of the classifier in parallel as possible. 

The BERT classifier

The classifier that we are going to use as our test case is described in a previous post.  The classifier takes as input a text abstract of a science paper posted on ArXiv and classifies it as being either Math, Physics, Computer Science, Biology or Finance.   The classifier was trained on a subset of the document abstracts.  In this case we start with a pretrained BERT model, so for classification we only have to fine-tune an extra layer.

Figure 5.  BERT modified with classifier layer.

Using the simpletransformers library, the training was done with two line of code:

The model is trained on a pandas dataframe of 4500 (abstract, classification) pairs.  We will test it on 2600 additional abstracts. 

The model along with the python code that will load data and do the classification was saved in a directory “classifier”.  To build the Docker container we will run on the cluster we need to start with a container with a good Python 3 implementation.  Next we have to install the torch and simpletransformers library and our classifier directory are loaded and we copy the program to the root level.  Next we need to install the funcx-endpoint code.  The complete Docker file is shown below.

If you want to experiment with this container it is in the Docker repo as dbgannon/classify.

The program, has two methods for doing inference:

  • classifys( list-of-abtract-texts ) which takes a list of 1 or more texts of the paper abstracts and returns a  predicted classification.
  • classify(list-of-abstract-IDs) with take a list of abstract IDs and returns a predicted classification and the actual classification which has been looked up by the abstract ID.

The function to send a list of abstract strings to the classifier is

If we let “s” be the string

We show that effective theories of matter that classically violate the null energy condition cannot be minimally coupled to Einstein gravity without being inconsistent with both string theory and black hole thermodynamics. We argue however that they could still be either non-minimally coupled or coupled to higher-curvature theories of gravity.’

and run this on the classifier through FuncX we get

which returns 2, the correct classification (Physics).  (The true and false values are results from the pending queries.)   To do the performance evaluation we will use the other function which allows us to send a list of document ids.  This makes us able to send longer lists (because FuncX has a limit on the size of messages).  The following is an example.

The reply gives the predicted classification and the actual classification.   Notice they agree except in position 6 which corresponds to document 890:

'For a paradigmatic model of chemotaxis, we analyze the effect how a nonzero affinity driving receptors out of equilibrium affects sensitivity. This affinity arises whenever changes in receptor activity involve ATP hydrolysis. The sensitivity integrated over a ligand concentration range is shown to be enhanced by the affinity, providing a measure of how much energy consumption improves sensing. With this integrated sensitivity we can establish an intriguing analogy between sensing with nonequilibrium receptors and kinetic proofreading: the increase in integrated sensitivity is equivalent to the decrease of the error in kinetic proofreading. The influence of the occupancy of the receptor on the phosphorylation and dephosphorylation reaction rates is shown to be crucial for the relation between integrated sensitivity and affinity. This influence can even lead to a regime where a nonzero affinity decreases the integrated sensitivity, which corresponds to anti-proofreading.

The classifier predicted 3 = Biology.  The correct classification (according to the authors who submitted it to ArXiv) is 2=Physics. ( I too would have gotten that one wrong.)

Now that we have our Kubernetes cluster, we can run many invocations of this function in parallel and see how it performs.   The cluster we are using has only five nodes, each with 2 cores, but only one copy of the container can fit on each node.  One of the beauties of FuncX is that when it gets to many request it automatically spawn a new worker (which are called Pods in Kubernetes).   Using the Kubernetes dashboard we can see what it looks like when we have the system loaded with parallel requests.

We have only 5 nodes, so each node has one copy of the classifier container running as funcx-…. One node is also running the endpoint server.   We should also note that when not receiving new requests the endpoint manager starts killing off un-needed workers and the number of deployed pods drops.  

A Performance Experiment.

If you divide a fixed number of independent tasks between P processor in parallel the the total time to execute them should drop by a factor of P.   Let’s check that with our classifier.  We consider our set of tasks to be 250 document abstracts.  We have created an array of document indexes called vals_stringVals_string[ n-1] contains 250/n of the documents for n in the range 1 to 10.  In the code below we launch p instances of our funcx_impl2 each working on vals_sting[ p – 1].  Then we wait for them all to finish.  We do this with p in the range of 1 to 10.  

in the first case one pod computes the entire set of 250 docs in 54 seconds. Next two pods working in parallel complete the task in 29 seconds. The optimal case occurs when 4 pods work in parallel on the 4 blocks.  After that, the 5 pods suffer scheduling delays trying to execute more tasks than 4. Recall that we only had 1 cpu per pod. While 5 pods are available, the endpoint is also running on one of them.

A second question is to look at the average time per inference achieved.  In each case we are asking the classifier to classify a set of documents.   Each classification requires a BERT inference, so what is the inference rate.  

Another factor is that every execution of the classify function must load the model. If the model load time is C and the rate of each classify operation is r, then for N classification the total time is

So the average for time for all N is

Hence for large N,  C/N is small and the inference rate is r.  If the load is divided among p processors in parallel, then the time is

Looking at the actual measured inference time from the experiments above we see the average time per inference in the graph below.

In a separate experiment we measured the load time C to be about 3 seconds.  Looking at the first item in the graph above we see that r is close to 0.2 for a single thread.   Using the formula above we can plot an approximate expected values (where we add a small scheduling penalty for tasks counts greater than our number of available servers as

The added scheduling penalty is just a guess. By plotting  this we get a graph that looks similar to the data.

Final Observations

Parsl and FuncX are excellent contributions to computational science.   The only parallel computing paradigm that they don’t explicitly cover is distributed parallel streams such as AWS Kinesis or apache Storm or Kafka. On the other hand, we experimented with using FuncX to launch functions which communicated with other functions using message brokers like AWS simple queue service and Azure  storage queues as well as ZeroMQ.  For example, you can use FuncX to launch a function that connects to an instrument and establishes  a stream of values that can be pumped to a queue or remote file system.  We didn’t include those examples here, because this is already too long.    However, we were convinced we could emulate the capability of Kafka and Storm stream parallelism with FuncX.   In fact, it seems FuncX uses ZeroMQ internally for some of its function.

I want to thank Ryan Chard, Zhuozhao Li and Ben Galewsky for guiding me through some rough spots in my understanding of FuncX deployment.   In the following appendix I have documented the steps I leaned while deploying FuncX on Azure and Docker.   All of the code described here will be in the repo dbgannon/parsl-funcx (  

In summary,  FuncX is Fun!

Appendix:  Getting Kubernetes up on Windows, Mac and Azure

If you are running the latest Windows10 or MacOS you can install docker and with it, Kubernetes.  The latest version on Windows10 will use the Linux subsystem as part of the installation and with that, you get a version of Kubernetes already installed.  If you go to the docker desktop control (click on the docker icon and in the control setting you will see the Kubernetes control.   You will see an “enable Kubernetes” button.  Startup Kubernetes.

You will also need Helm.  (To install helm on Windows you need to install chocolatey .) Helm is a package manager for  Kubernetes that is used by Funcx.   In a shell run

>choco install kubernetes-helm

On the mac you can install helm with

$brew install kubernetes-helm

Followed by

$ helm init

A good way to see what is going on inside a Kubernetes cluster is to run the Kubernetes dashboard.  First you must start a proxy server on Kubernetes with

>kubectl proxy

Starting to serve on

In another shell run

kubectl apply -f

To access the dashboard you will need a special token.   To get that run

kubectl -n kubernetes-dashboard describe secret $(kubectl -n kubernetes-dashboard get secret | grep admin-user | awk '{print $1}')

On windows, the “grep” won’t work but token is displayed anyway as part of the output.  

Go to the link  on localhost: 8001 given by this  Kubernetes Dashboard and plug in the token.   You will see the important categories of services, pods and deployments.  

Now we are ready for FuncX!  Go to and download the file.   Store that in some place like your home directory C:\Users/You or /home/you.  (I am indebted to Ryan Chard, Zhuozhao Li and Ben Galewsky for walking me though the next few steps!!)

First you will  need a .funcx with a credentials subdirectory which contains the file funcx_sdk_tokens.json.

To get this do

>pip install funcx_endpoint

>funcx-endpoint configure

You will be asked to authenticate with Globus Auth . From the website:

We require authentication in order to associate endpoints with users and enforce authentication and access control on the endpoint. As part of this step we request access to your identity information (to retrieve your email address) and Globus Groups management. We use Groups information to facilitate sharing of functions and endpoints by checking the Group membership of a group associated with a function.

Once you have completed that, cd to the credentials directory and execute

>kubectl create secret generic funcx-sdk-tokens --from-file=funcx_sdk_tokens.json

and then pick a name for your endpoint and do

>helm repo add funcx

And then

>helm install yourEndPointname   ./funcx-dev/helm/funcx_endpoint

if you do

>kubectl get pods

you will see yourEndPoint.   Given its pod ID you can do

>kubectl get logs YourEndPointPodID

You will find the endpoint uuid in the logs.   Alternatively, you can go to the logs directly on the Kubernetes dashboard.  You are now ready.   You can try the example above to test it out.  To deploy a special container as the worker, such as the container created for the BERT classifier described above you need a special yaml file.  In that case the file is shown below.

Let’s call it Myvalues.yaml. The final helm deployment step is the command

>helm install -f Myvalues.yaml myep4 funcx-dev/helm/funcx_endpoint

You can now grab the endpoint uuid as described above and invoke functions that use the environment contained in the container.

Kubernetes on Azure

If you want to install it on an Azure Kubernetes cluster, you need to first create resource group and a container registry and an azure account.   To deploy the cluster, it is best to do this from the command line interface.   You can get the instructions for that here.  

To manage the azure container registry, go to here.  Suppose the container registry called “myContainers” and the resource group is called “parl_test”.    To create the cluster named “funcx-cluster”  in the Azure region “northcentralus” do the following.

>az aks create -g parl_test -n funcx-cluster --location northcentralus --attach-acr funcx --generate-ssh-keys --node-count 5 -node-vm-size Standard_D2s_v3

You can monitor the creation status on the portal.    Next the following steps get you endpoint setup for the cluster

 >az aks get-credentials --resource-group parl_test --name funcx-cluster
 >cd .\.funcx\credentials\
>kubectl create secret generic funcx-sdk-tokens --from-file=funcx_sdk_tokens.json
>helm repo add funcx
 >cd ..
 >cd ..
 >helm install -f Myvalues.yaml myep4 funcx-dev/helm/funcx_endpoint
 >kubectl get pods

Look for the name of the endpoint pod called myep4-endpoint- followed by long key.

Grab it and do

 >kubectl logs myep4-endpoint-6f54d9549-6hkmj

At this point you may have two clusters: one for the docker k8 cluster and one for Azure. 

>kubectl config get-clusters

Use “kubectl config current-context” to see which is currently responding to kubectl commands and use “kubectl config set-context” to change it.  When you are done you should delete the deployment for your endpoint.   It will persist as long as your cluster does and also automatically restart after crashes.

Building a Tiny Knowledge Graph with BERT and Graph Convolutions.


Knowledge graphs (KGs) have become an important tool for representing knowledge  and accelerating search tasks.   Formally, a knowledge graph is a graph database formed from entity triples of the form (subject, relation, object) where the subject and object are entity nodes in the graph and the relation defines the edges.  When combined with natural language understanding technology capable of generating these triples from user queries, a knowledge graph can be a fast supplement to the traditional web search methods employed by the search engines.    In this tutorial, we will show how to use Google’s Named Entity Recognition to build a tiny knowledge graph based on articles about scientific topics.  To search the KG  we will use BERT to build vectors from English queries and graph convolutions to optimize the search. The result is no match for the industrial strength KGs from the tech giants, but we hope it helps illustrate some core concepts.


We have explored the topic of KGs in previous articles on this blog.  In “A ‘Chatbot’ for Scientific Research: Part 2 – AI, Knowledge Graphs and BERT.”, we postulated how a KG could be the basis for a smart digital assistant for science. If you search for Knowledge Graph on the web or in Wikipedia you will lean that the KG is the one introduced by Google in 2012 and it is simply known as “Knowledge Graph”.  In  fact, it is very large (over 70 billion nodes) and is consulted in a large fraction of searches.   Having the KG available means that a search can quickly surface many related items by looking at nearby nodes linked to the target of the search.   This is illustrated in Figure 1.  This  is the result of a Google search for “differential equation” which is displayed an information panel to the right of the search results.  

Figure 1.  Google information panel that appears on the right side of the page.  In this case the search was for “differential equation”.  (This image has been shortened a bit.)

The Google KG is extremely general, so it is not as good for all science queries, especially those that clash with popular culture.   For example, if you search for the term that describes the surface of a black hole,  an “event horizon” you get an image from the bad 1997 movie by that name.   Of course, there is also a search engine link to the Wikipedia page describing the real scientific thing.   Among the really giant KGs is the Facebook entity graph which is nicely described in “Under the Hood: The Entities Graph” by Eric Sun and Venky Iyer in 2013.  The entity graph has 100+ billion connections.  An excellent KG for science topics is the Microsoft Academic graph.  See “Microsoft Academic Graph: When experts are not enough” Wang 2019 which we will describe in more detain in the last section of this note.   In its earliest form the Google’s KG was based on another KG known as Freebase.  In 2014 Google began the process of shutting down Freebase and moving content to a KG associated with Wikipedia called Wikidata.  We will use Wikidata extensively below.

Wikidata was launched in 2012 with a grant from Allen Institute, Google and the Gordon and Betty Moore Foundation and it now has information that is used in 58.4% of all English Wikipedia articles.   Items in Wikidata each have an identifier (the letter Q and a number) and each item has a brief description and a list of alias names.  (For example,  the item for Earth (Q2) has alternative names: Blue Planet, Terra Mater, Terra, Planet Earth, Tellus, Sol III, Gaia, The world, Globe, The Blue Gem, and more.)  each item has a list of affiliated “statements” which are the “object-relation-object” triples that are the heart of the KG.   Relations are predicates and are identified with a P and a number.  For example, Earth is an “instance of” (P31) “inner planet” (Q3504248).  There are currently 68 million items in Wikidata and, like Wikipedia it can be edited by anyone.

In what follows we will show how to build a tiny knowledge graph for two narrow scientific topics and then using some simple deep learning techniques, we will illustrate how we can query to KG and get approximate answers. 

As mentioned above ideal way to construct a KG from text is to use NLP methods to extract triples from text.  In a very nice tutorial by Chris Thornton, this approach is used to extract information from financial news articles.   If the relations in the  object-relation-object are rich enough one may be able to more accurately answer questions about the data.   For example, if the triples are as shown in the graph in Figure 2  below

Figure 2.  Graph created from the triples “Mary attended Princeton” and “Princeton is located in New Jersey”

one may have a good chance at answering the question “Where has Mary lived?”.There has been a great deal of research on the challenge of building relations for KG. (for example see Sun,     Unfortunately, scientific technical documents are not as easy to parse into useful triples.  Consider this sentence.

The precise patterns prevalent during the Hangenberg Crisis are complicated by several factors, including difficulties in stratigraphic correlation within and between marine and terrestrial settings and the overall paucity of plant remains.

Using the AllenAI’s NLP package we can pull out several triples including this one:

[ARG1: the precise patterns prevalent during the Hangenberg Crisis]

[V: complicated]

[ARG0: by several factors , including difficulties in stratigraphic correlation within and between marine and terrestrial settings and the overall paucity of plant remains]

But the nodes represented by ARG0 and ARG1 are not very concise and unlikely to fit into a graph with connections to other nodes and the verb ‘complicated’ is hard to use when hunting for facts. More sophisticated techniques are required to extract usable triples from documents than we can describe here. Our approach is to use a simpler type of relation in our triples.   

Building our Tiny Knowledge Graph

The approach we will take below is to consider scientific documents to be composed as blocks of sentences, such as paragraphs and we look at the named entities mentioned in each block. If a block has named entities, we create a node for that block called an Article node which is connected to a node for each named entity.  In some cases we can learn that a named entity is an instance of a entity class.   In other words, our triples are of the form

(Article  has named-entity)  or (named-entity instance-of  entity-class)

If entities, such as “Hangenberg Crisis”, occur in other blocks from the same paper or other papers we have an indirect connection between the articles. 

To illustrate what the graph looks like consider the following tiny text fragment.  We will use it to generate a graph with two article nodes.

The Theory of General Relativity demonstrates that Black Holes are hidden by an Event Horizon. Soon after publishing the special theory of relativity in 1905, Einstein started thinking about how to incorporate gravity into his new relativistic framework.

In 1907, beginning with a simple thought experiment involving an observer in free fall, he embarked on what would be an eight-year search for a relativistic theory of gravity. After numerous detours and false starts, his work culminated in the presentation to the Prussian Academy of Science in November 1915 of what are now known as the Einstein field equations, which form the core of Einstein’s famous Theory of General Relativity.

Each sentence is sent to the Google Named-Entity Recognition service with the following function.

The NER service responds with two types of entities.   Some are simply nouns or noun phrases and some are entities that Google NER recognizes as having Wikipedia entries.  In those cases, we also use the Wikipedia API to pull out the Wikidata Identifier that is the key to Wikidata.  In our example we see three entity node types in the graph.   The blue entity nodes are the ones that have Wikipedia entries.  The green entities are nodes that are noun phrases.  We discard returned entities consisting of a single noun, like “space”, because there are too many of them, but multiword phrases are more likely to suggest technical content that may appear in other documents.  These nodes are in green.  In the case of the Wikidata (blue) nodes, we can use the Wikidata identifier to find out if the entity is an instance of a class in Wikidata.  If so, these are retrieved and represented as gray entities.   The first pair of sentences are used to create an article node labeled ‘Art0’.   The second pair of sentences are used to form node ‘Art1’. 

Figure 3.  The rendering of our two-article graph using NetworkX built-in graphing capabilities.

In this case we see that Einstein is an instance of human and an event horizon is an instance of hypersurface (and not a movie).   The elegant way to look for information in Wikidata is to use the SPARQL query service.   The code to pull the instanceOf property (wdt:P31)  of an entity given its wikiID is

Unfortunately, we were unable to use this very often because of time-outs with the sparql server.   We had to resort to scraping the wikidata pages.  The data in the graph associated with each entity is reasonably large.   We have a simple utility function showEntity that will display it.   For example,

The full Jupyter notebook to construct this simple graph in figure 3 is called build-simple-graph.ipynb in the repository

We will build our tiny KG from 14 short documents which provide samples in the topics climate change, extinction, human caused extinction, relativity theory, black holes, quantum gravity and cosmology.  In other words, our tiny KG will have a tiny bit of expertise in two specialized: topics relativistic physics and climate driven extinction.   

English Language Queries, Node Embeddings and Graph Convolutions.

Our tiny KG graph was built with articles about climate change, so it should be able to consider queries like ‘The major cause of climate change is increased carbon dioxide levels.’ And respond with the appropriate related items.   The standard way to do this is to take our library of text articles stored in the KG and  build a list of sentences (or paragraphs) and then use a document embedding algorithm to map each one to a vector in RN for some large N so that semantically similar sentences are mapped to nearby vectors.    There are several standard algorithms to do sentence embedding.   One is Doc2Vec and many more have been derived from Transformers like BERT. As we shall see, it is important that we have one vector for each article node in our KG.  In the example illustrated in Figure 3, we used two sentences for each article. Doc2Vec is designed to encode articles of any size, so 2 is fine.   A newer and more accurate method is based on the BERT Transformer, but it is designed for single sentences but also works with multiple sentences.   To create the BERT sentence embedding mapping we need to first load the pretrained model.

The next step is to use the model to encode all of the sentences in our list.  Once that is done, we create a matrix mar where mar[i] contains the sentence embedding vector for the ith sentence normalized to unit length. 

Now if we have an arbitrary sentence Text and we want to see which sentences are closest to it we simply encode the Text, normalize it and compute the dot product with all the sentences.   To print the k  best fits starting with the best we invoke the following function.

For example,  if we ask a simple question:

Is there a sixth mass extinction?

We can invoke find_best to look for the parts of the KG that is the best fit.

Which is a reasonably good answer drawn from  our small selection of documents.   However there are many instances where it is not as good.   An important thing to note is that we have not used any properties of the graph structure in this computation. 

Using the Structure of the Graph:  A Convolutional Approach.

 In our previous tutorial “Deep Learning on Graphs” we looked at the graph convolutional network as a way to improve graph node embedding for classification.   We can do the same thing here to use the structure of the graph to augment the Bert embedding.   First, for each article node x  in the graph we collect all its immediate neighbors where we define immediate neighbor to mean those other article nodes linked to an entity node shared with x.   We then computed a weighted sum (using a parameter lambda in [0,1]) of the Bert embedding vectors for each neighbor with the Bert embedding of for x.   We normalize this new vector, and we have a new embedding matrix mar2.  

Figure 4.  Illustrating the convolution operation

Intuitively the new embedding captures more of the local properties of the graph. We create a new function find_best2 which we can use for question answering.   There is one problem.   In the original find_best function we convert the query text to a vector using the BERT model encoder.   We have no encoder for the convolved model. Using the original BERT encoder didn’t work… we tried.)  We decided to use properties of the Graph.  We asked the Google NER service to give us all the named entities in our question.  We then compare those entities to entities in the Graph.  We search for the  article node with the largest number of named entities that are also in our query. we can use the convolved mar2 vector for that node as a reasonable encoding for our query.   The problem comes when there is no clear winner in this search.   In that case, we use the article that find_best says is the best fit and use that article’s mar2 vector as our encoding.   Pseudo code  for our new find_best2 based on the convolution matrix mar2 is

To illustrate how the convolution changes the output consider the following cases.

The first two responses are excellent.    Now apply find_best2 we get the following.

In this case there were 18 article nodes which had named entities that matched the entity in the text: “carbon dioxide”.   In this case find_best2 just uses the first returned value from find_best.  After that it selected the next 3 best based on mar2.  The second is the same, but the last two are different are arguably a better fit than the result from find_best.

To see a case where find_best2 does not invoke find_best to start, consider the following.

Both fail on addressing the issue of “Who” (several people are mentioned that answer this question are in the documents used to construct our KG.).  However the convolutional version find_best2 addressed “solve” “field equations” better than find_best.

The process of generating  the  BERT embedding vectors mar[ ] and the results of the convolution transformation mar2[ ] below is illustrated in Figure 5 as a sequence of two operators.  

BertEncode: list(sentences1 .. N ) ->  RNx768           GraphConv : RNx768 -> RNx768

Figure 5.  Going from a list of N sentences to embedding vectors followed by graph convolution.  Additional convolution layers may be applied.

There is no reason to stop with one layer of graph convolutions.    To measure how this impacts the performance we set up a simple experiment.  This was done by computing a score for each invocation of the find_best function.   As can be seen from the results, some of the replies are correct and others are way off.   One way to  quantify  this is to see how far the responses are from the query.  Our graph was built from 14 documents which provide samples in the topics climate change, extinction, human caused extinction, relativity theory, black holes, quantum gravity and cosmology.   Allocating each of the 14 documents to an appropriate one of 6 topics, we can create have a very simple scoring function to compare the documents in the find_best response.  We say score(doc1, docN) = 1 if both doc1 and docN belong to documents associated with the same topic The score is 0 otherwise.  To arrive at a score for a single find_best invocation, we assume that the first response is likely the most accurate and we compute the score in relation to the remaining responses

(score(doc1,doc1) + score(doc1, doc2) + sore(doc1, doc3) + score(doc1, doc4))/4.

The choice of measuring four responses is arbitrary.   It is usually the case that responses 1 and 2 are good and 3 and 4 may be of lower quality. 

The six topics are  climate, extinction, human-related extinction, black holes, quantum gravity and cosmology. 

Notice that we are not measuring the semantic quality of the responses.  We only have a measure of the consistency of the responses.  .  We return to the topic of measuring the quality of response in the final section of the paper. The algorithm to get a  final score for find_best is to simply compute the score for each node in the graph.

We do the same for find_best2 for different layers of convolution.  As can be seen from the chart below one convolution does improve the performance.   The performance is peaked out at 3 convolutional layers.  

Recall that the convolution operator was defined by a parameter lambda in the range 0 to 1.   We also ran a grid search to find the best value of lambda for 1, 2 and 3 convolutional layers.  We found a value of 0.75 gave reasonable results.   (note that when lambda = 1 the layers degenerate to no layers.)

Plotting the Subgraph of Nodes Involved in a Query

It is an amusing exercise to look at the subgraph of our KG that is involved in the answer to a query.   It is relatively easy to generate the nodes that connect to the article nodes that are returned by our query function.  We built a simple function to do this.  Here is a sample invocation.

The resulting subgraph is shown below.

Figure 6.   Subgraph generated by the statement “’best-known cause of a mass extinction is an Asteroid  impact that killed off the Dinosaurs.”

The  subgraph  shown in figure 6 was created as follows.  We first generate an initial subgraph generated by using the article nodes returned from find_best2 or find_best and the entity nodes they are connected to.   We then compute a ‘closure’ of this subgraph by selecting all the graph nodes that are connected to nodes that are in the initial subgraph. 

Looking at the subgraph we can see interesting features.  An amusing exercise is to follow a path from one of the nodes in the graph to the other and to see what the connections tell us.   The results can form an interesting “story”.   We wrote a simple path following algorithm.

Invoking this with the path from node Art166 to Art188

gives the following.

Trying another path

Yields the following

There is another, perhaps more interesting reason to look at the subgraph.  This time we use the initial subgraph prior to the ‘closure’ operation.   If we look at the query

The results include the output from find_best2 which are:

You will notice only 2 responses seem to talk about dark energy.   Art186 is related, but Art52 is off topic.  Now, looking at the graph we see why.

Figure 7.  Subgraph for “What is dark energy”

There are three connected components, and one is clearly the “dark energy” component.   This suggests that a further refinement of the query processing could be to  filter out connected components that are “off topic”. 

We conducted a simple experiment.   In calls to find_best2(4, text)  we searched the ten best and eliminated the responses that were not in the same connected component as the first response.   Unfortunately this, sometimes resulted in  fewer than k responses, but the average score was now 83%. 


In this little tutorial we have illustrated how to build a simple knowledge graph based on a few scientific documents.    The graph was built by using Google’s Named Entity Recognition service to create simple entity nodes.   Article nodes were created from the sentence in the document.   We then used BERT sentence embeddings to enable a basic English language query capability to pull out relevant nodes in the graph.   Next we showed how we can optimize the BERT embedding by apply a graph convolutional transformation.  

The knowledge graph described here is obviously a toy.   The important questions are how well the ideas here scale and how accurate can this query answering system be when the graph is massive.  Years of research and countless hours of engineering have gone into the construction of the state-of-the-art KGs.   In the case of science, this is Google Scholar and Microsoft Academic.   Both of these KGs are built around whole documents as the basic node.   In the case of the Microsoft Academic graph,  there are about 250 million documents and only about 36% come from traditional Journals.  The rest come from a variety of sources discovered by Bing.   In addition to article nodes, other entities  in the graph are authors, affiliations, concepts (fields of study),  journals, conferences and venues.  (see “A Review of Microsoft Academic Services for Science of Science Studies”, Wang, et.  al.   for an excellent overview.)  They use a variety of techniques beyond simple NER including reinforcement learning to improve the quality of the graph entities and the connections.   This involves metrics like eigencentrality and statistical saliency to measure quality of the tuples and nodes.   They have a system MAKES that transforms user queries into queries for the KG.    

If you want to scale from 17 documents in your database to 1700000 you will need a better infrastructure than Python and NetworkX.  In terms of the basic graph build infrastructure there are some good choices.

  • Parallel Boost Graph Library.  Designed or graph processing on supercomputers using MPI communication. 
  • Pregel: A System for Large-Scale Graph Processing” developed by Google for their cloud.
  • GraphX: A Resilient Distributed Graph System on Spark.
  • AWS Neptune is fully managed, which means that database management tasks like hardware provisioning, software patching, setup, configuration, and backups are taken care of for you.  Building KG’s with Neptune is described here.  Gremlin and Sparql are the APIs.
  • Janus Graph is an opensource graph database that is supported on Google’s Bigtable.
  • Neo4J is a commercial product that is well supported, and it has a SQL like API.
  • Azure Cosmos is the Azure globally extreme-scale database that supports graph distributed across planet scale resources.

If we turn to the query processing challenge,  the approach we took in our toy KG, where document nodes were created from as few as a single sentence, it is obvious why the Microsoft KG and Google Academic focus on entire documents as a basic unit.  For the Microsoft graph a variety of techniques are used to extract “factoid” from documents that are part of concepts and taxonomy hierarchy in the Graph.  This is what the MAKES system searches to answer queries.    

It is absolutely unclear if  the convolutional operator applied to BERT sentence embedding described here would have any value when applied to the task of representing knowledge at the scale of MAG or Google Scholar.   It may be of value in more specialized domain specific KGs.    In the future we will experiment with increasing the scale of graph.

All of the code for the examples in this article is in the repository

Deep Learning on Graphs (a Tutorial)

This tutorial gives an overview of some of the basic work that has been done over the last five years on the application of deep learning techniques to data represented as graphs. Convolutional neural networks and transformers have been instrumental in the progress on computer vision and natural language understanding. Here we look at the generalizations of these methods to solving problems where the data is represented as a graph. We illustrate this with examples including predicting research topics by using the Microsoft co-author graph or the more heterogeneous ACM author-paper-venue citation graph. This later case is of interest because it allows us to discuss how these techniques can be applied to the massive heterogeneous Knowledge networks being developed and used by the search engines and smart, interactive digital assistants. Finally, we look at how knowledge is represented by families of graphs. The example we use here is from the Tox21 dataset of chemical compounds and their interaction with important biological pathways and targets.

The full tutorial is on our other site:

Predicting Tomorrows Temperature with RNNs and Gaussian Processes


This note provides a gentle introduction to streaming data regression and prediction using Recurrent Neural Networks and Gaussian processes.  We look at two examples.  In the first we look at daily high temperature for two years at three different, but nearby NOAA weather stations. In the second example we look at daily confirmed infection of the Coronavirus in several US states.  This study shows that if you have a predictable natural pattern that changes in predicable ways from season to season one can make reasonable predictions, but in the case of complex phenomenon such as the Covid19 pandemic, simple models  do not always work well.

 In a previous post we looked at anomaly detection in streaming data using sophisticated cloud based tools, but here we use simple “classic” ML tools that you can run on a laptop.  We expect the seasoned data scientist will not get much out of this article, but if the reader is not familiar with recurrent networks or Gaussian processes this may be a reasonable introduction.  At the very least, if you are interested in studying the Coronavirus infection data, we show you how to search it using Google’s BigQuery service.

Building a Very Simple LSTM Recurrent Neural Network

Recurrent Neural Networks were invented to capture the dynamic temporal behaviors in streams of data.  They provided some early success in areas related natural language understanding, but in that area they have been superseded by Transformers, which we discussed in an earlier article.   We will construct an extremely simple recurrent network and ask it to predict the temperature tomorrow given the temperature for the last few days.

We will train it on the average daily temperature over 2 years measured at three nearby weather stations in the Skagit valley in Washington state.   The data comes from the Global Summary of the Day (GSOD) weather from the National Oceanographic and Atmospheric Administration (NOAA) for 9,000 weather stations.  The three stations are Bellingham Intl, Padilla Bay Reserve and Skagit Regional.  The data is available in Google’s BigQuery service named “bigquery-public-data.noaa_gsod.stations”. (We will show how to search BigQuery in the next section.)

Averaging the daily temperature for 6 years (2 years each for 3 stations) we get data like Figure 1 below.


                           Figure 1.  Average daily high temperature for 2 years at 3 stations.

RNNs work by having a “memory” state tensor that encodes the sequence of inputs that have been seen so far.  The input to the RNN is a word or signal, along with the state of the system based on words or signals seen so far; the output is a predicted value and a new state of the system, as shown in figure 2 below.


Figure 2.  Basic Recurrent Neural network with input stream x and output stream h

Many variations of the basic RNN exist. One challenge for RNNs is ensuring that the state tensors retain enough long-term memory of the sequence so that patterns are remembered. Several approaches have been used for this purpose.  One popular method is the Long-Short Term Memory (LSTM) version that is defined by the following equations, where the input sequence is x, the output is h and the state vector is the pair [c, h].


Where sigma is the sigmoid function and W is the leaned tensor.   We are not going to use these equations explicitly because PyTorch has a built-in version that we will use.  All we need to do is provide the dimension of the input (which is a sequence of scalar values, so that is 1) and the dimension of the state vector c_h is a tuple of tensors of size (arbitrarily chosen here to be) 100. We also have linear layer that will map the final value of h down to the one-dimensional output.  Our simple LSTM neural network is


It is important to understand the role of the input x as a sequence.   We want to train this network so that if we set x to be a sequence of consecutive daily high temperatures [t1, t2, … tn] then the forward method will return [tn+1].  To accomplish this training task we build a training set by scanning through the weather record andcreating tuples of sequences of length tw and the next-day temperature as a training label. This is accomplished with this function


From this we create a list of such tuples called train_inout_seq and the training loop takes the form


The complete details are in the notebook lstm-for-stream-final  in the Github repository.  This was trained on the average year and one the six yearly record.   The results are shown below.  The first two are the years that were training cases.  The original data is printed in blue and the predicted data is printed in orange.   As you can see virtually no blue shows.  The network has memorized the training data almost perfectly with an average daily error that is less than 0.23 degrees Fahrenheit.



Figure 2.  The results for the average of all the data (Figure 1 above) and for one of the individual stations. The LSTM network has memorized the training data.

In Figure 3 below we show the results when applied to two of the weather records for two of the other stations.   In these cases, the results are not very impressive.  In both cases the error average error was over 3.5 degrees and it was greater than 10 degrees for more than a dozen days.  However, the predictions for one day ahead did track the general trends.   It looks like it was able to predict todays temperature better than tomorrows.



Figure 3.   Predicting tomorrows temperature on two other station records.

Doing Regression and Prediction with Gaussian Processes

Before we define Gaussian Processes let us point to Christopher Bishop’s amazing book “Pattern Recognition and Machine Learning” (Springer 2006) for a complete treatment of the subject.  We will only provide a superficial introduction here.   For on-line resources there is the excellent blog on the subject by Peter Roelants.   We will use many of the code bits from that blog in what follows.  Another fun on-line source for learning about Gaussian Processes is the blog A Visual Exploration of Gaussian Processes by Görtler, Kehlbeck and Deussen.

In the simplest terms, a Gaussian Process is a statistical distribution of functions with some special properties.   In our case, the functions will represent the time evolution of stochastic processes. For example, the temperature at some location as a function of time,  or the number of daily infections of a virus in a community, or the random walk of a particle suspended in a fluid.

The distribution that defines a Gaussian Process are characterized by a mean function u(x) and a covariance function k (x, x).   A function f(x) drawn from this distribution, which is written


has the property that if when we pick a finite set of time points  X = { x1 … xn } which we view as n random variable, the values y=f(X) are normally distributed in a multivariate Gaussian distribution with mean u(X) and a covariance matrix by a kernel function k(X, X).  Written another way,


Given the covariance kernel function k() and mean function u() we can use this multivariant distribution to visualize what functions drawn from the Gaussian distribution look  like.   Let us pick 300 points on the interval [0,2] and a specific kernel (which we will describe later) and a mean function with constant value 1.   The following  numpy function will allow us to draw 10 sample functions


As shown in Figure 4 below they appear to be like random walks but they also appear to be not only continuous be also smooth curves.  That is because nearby points on the x axis correspond to highly correlated random variables due to the choice of k().  If we had set Σ to be the identity matrix the variables at neighboring points would be independent random variables and the path would look like noise.  (We will use that fact below.)


Figure 4. 10 different functions drawn from the Gaussian process.

Now for the interesting part.  What if we have some prior knowledge of values of y for a sample of x points?   We can then ask what is then the  distribution of the  functions?

View the n points on the time axis as n random variables.   Partition them into two sets X1 and X2 where we are going to suppose we have values Y1 for the X1 variables.  We can then ask for  the posterior distribution p(Y2 | Y1 , X1 , X2 ).  Reordering the variables so that X1 and X2  are contiguous the equation takes the form




One can prove that our condition probability distribution p(Y2 | Y1 , X1 , X2) is also a multivariate normal distribution described by the formulas


The proof of this is non-trivial.  See this post for details.   The good news here is we can calculate this if we know the prior kernel function k() and mean m().  Picking these function is a bit of an art.  The usual way to do this is to pick k() so that m(x) = 0 so that u2 and u1 in the above are 0.   Picking the kernel is often done by forming it as a linear combination of well-known standard kernel function and then formulating a hyper-parameter optimization problem to select the best combination.

To illustrate this, we can return to the weather station data.   We have two years of data from three nearby stations.    We note two properties of the data we must exploit: it is noisy and approximately periodic with a period of 365 days.   We will not bother with the optimization and rather take a straightforward linear combination of three standard kernels.


The first of these is the exponential quadratic and it is a very good, default kernel.   The second is the white noise kernel where the parameter sigma gives us the standard distribution of the noise we see in the data and the third is the periodic kernel which if we map our 365 days onto to the unit interval we can set p = 1.   Our kernel of choice (chosen without optimization, but because it seems to work o.k.) is


Where for the first two terms we have set sigma to one and we pick the sigma for the noise term to best fit the data at hand.   The figure below illustrates the result of using the average of the six instrument years as the raw (prior) data.   Then we select 46 points in the first 230 days (spaced 5 days apart) as our X1 days.

In the figure the red dots are the points and the red line is u2|1 conditional mean function.  Three additional lines (blue, green and yellow) are sample function drawn from the posterior.  The pink zone is two sigma of standard deviation in the prediction.   We also calculated the error in terms of average difference between the mean prediction and the raw data.   For this example, that average error was 3.17 degrees Fahrenheit.  The mean function does a reasonable job of predicting the last 130 days of the year.


Figure 5.   Graph of the raw data,  the mean conditional u2|1 (red line), and three additional functions (blue, yellow and green) drawn from the posterior.

The full details of the computation are in the jupyter notebook “Gaussian-processes-temps-periodic”,  but the critical function is the one that computes p(Y2 | Y1 , X1 , X2) and it is shown below (it is taken from Roelants blog)


In this case we invoked it with kernel_function  as keq + kp.   Sigma_noise was 0.3.   The clever part of this code was the use of  a standard linear algebra solver to solve for Z in this equation


But because Sigma11 is symmetric and the transpose of Sigma12 is Sigma21 we have


Once you have that the rest of the computation is accomplished with the matrix multiply (@) operator.

In the notebook Gaussian-process-temps-periodic in the Github repository you can see the Gaussian processes for the six year samples.

The Coronavirus Data

Another interesting source of data comes from the daily confirmed cases of coronavirus infections in various states.   We shall see that the troubling recent growth rate is so large that it is very hard for our Gaussian process models to make predictions based on recent past samples.  However, we thought it may be of value to illustrate how to obtain this data and work with it.

The Covid-19 data is in the Google cloud uploaded from the New York times. To access this you must have a google cloud account which is free for simple first-time use.   We will run google’s bigquery to extract the data and we will run it through a client in a Jupyter  notebook.    You will need to install the bigquery libraries.   A good set of instructions are here.    To use Jupyter go here . You will need to add the json package containing you service account key to your environment variables and described here.  Finally install the local libraries with this command on your machine.

pip install –upgrade google-cloud-bigquery[pandas]

First load the bigquery library and create the client in a Jupyter notebook with the following


There are a number of covid19 data sets available on BigQuery.   The one we will use is the New York Times collection.   The following query will request the data for the state of Washington, load it into a Pandas dataframe and print it.



 In our notebook bigquery-Covid we have the code that will extract the number of cases per day so that we can fit the Gaussian process to that.    That data is stored in the array ar_wash.   We attempted to make predictions with a sample every 9 days until the last 10 day.  Because of the large range of the data we scale it down by a factor of 1000.   The result is shown below.   The function make_gaussian is the same one we used for the weather station data except that the kernel is only the exponential quadratic.


As can be seen the mean function (red line) capture features of the last 10 days reasonably well.   Looking at New York we see similar results, but the fit for the last few days is not as good.


Where we fail most spectacularly is for those states that experienced a wave of new cases on the first week of July.  Here is Florida.


Changing the prediction window to the last 3 days does a bit better.  But 3 days is not much of a prediction window.


However, it is clear that a much more complex process is going on in Florida than is captured by this simple Gaussian process model.   The approach presented here is not a proper infectious disease model such as those from Johns Hopkins and IHME and other universities.  Those models are far more sophisticated and take into account many factors including social and human behavior and living conditions as well as intervention strategies.


As was pointed out in the introduction, this is a very superficial look at the problem of predicting the behavior of streaming data.  We looked at two approaches; one that focuses on accurate prediction of the next event using neural networks and one that attempts to capture long range statistical behavior using Gaussian process models.  The neural net model was able to learn the temperature patterns in the training data very well, but for test data it was much less accurate with average error of about 3- or 4-degrees Fahrenheit per day.   (This is about as good as my local weather person).   On the other hand, the Gaussian process made very good long range (over 100 days) predictions with only a small number of sample points.  This was possible because the Gaussian process model works well for patterns that have reasonably predictable cycles such as the weather.   However, the Gaussian process failed to capture changes in the more complex scenario of a viral infection where the dynamics changes because of social and human behavior or by evolutionary means.

If the reader is interested in extracting the data from Google’s BigQuery,  we have included the detail here and in the notebooks in the repository


Notes on Deep Learning and Differential Equations.

Over the last two years  some very interesting research has emerged that illustrates a fascinating connection between Deep Neural Nets and differential equations.    There are two aspects of these discoveries that will be described here.  They are

  1. Many differential equations (linear, elliptical, non-linear and even stochastic PDEs) can be solved with the aid of deep neural networks.
  2. Many classic deep neural networks can be seen as approximations to differential equations and modern differential equation solvers can great simplify those neural networks.

The solution of PDE by neural networks described here is largely the excellent work of Karniadakis at Brown University and his collaborators on “Physics Informed Neural Networks” (PINNs).   This work has led to some impressive theory and also advances in applications such as uncertainty quantification of the models of subsurface flow at the Hanford nuclear site, one of the most contaminated sites in the western hemisphere.

While the work on PINNs shows us how to use neural networks to solve differential equations, the second discovery cited above tells us how modern differential equation solvers can simplify the architecture of many neural networks.  A team led by Chen , Rubanova , Bettencourt, Duvenaud at the University of Toronto, Vector Institute examined the design of some neural networks and noticed how their architecture resembled discretizations of certain differential equations.  This led them to define a hybrid of neural net and differential equation they call a “Neural Ordinary Differential Equation”. Neural ODEs  have several striking properties including excellent accuracy and greatly reduced memory requirements.

These works suggest that there exists a duality between differential equations and many deep neural networks that is worth trying to understand.  This paper will not be a deep mathematical analysis, but rather, try to provide some intuition and examples (with PyTorch code).  We first describe the solution of PDEs with neural networks and then look neural ODEs.


We now turn to the work on using neural networks to solve partial differential equations.  The examples we will study are from four papers.

  1. Raissi, Perdikaris, and Karniadakis, “Physics Informed Deep Learning (Part II): Data-driven Discovery of Nonlinear Partial Differential Equations”, Nov 2017, is the paper that introduces PINNS and demonstrates the concept by showing how to solve several “classical” PDEs.
  2. Yang, Zhang, and Karniadakis, “Physics-Informed Generative Adversarial Networks for Stochastic Differential Equations”, Nov 2018, addresses the problem of stochastic differential equations but uses a generative adversarial neural network.
  3. Yang, et. all “Highly-scalable, physics-informed GANs for learning solutions of stochastic PDEs”, oct 2019, is the large study that applies the GAN PINNs techniques to the large scale problem of  as uncertainty quantification of the models of subsurface flow at the Hanford nuclear site.
  4. Shin, Darbon and Karniadakis, “On the Convergence and Generalization of Physics Informed Neural Networks”, 2020, is the theoretical proof that the PINNs approach really works.

We will start with a very simple partial differ that is discussed in the Rassi paper.  Burgers’ Equation is an excellent example to study if you want to understand how shock waves can come about from relatively benign initial conditions.  The equation is


Were the domain of x is the interval [-1, 1] and the time variable t go from 0 to 1.   The initial, t=0, condition is u(0,x) = -sin(pi*x). and the boundary conditions are u(t,-1) = u(t,1)=0.  The parameter v is 0.01/pi.   (If we set v = 0 then this is the inviscid equation which does describe shock waves, but with v >0 the equation is called the viscus Burgers’ equations.  While not technically describing discontinuity shock as t goes forward in time, it is awfully close.    The figure below shows the evolution of the value of u(t,x) from the initial sine wave at t=0 to a near discontinuity at t=39.


Figure 1.  Samples of u(t, x) for x in [-1,1] at 40 points and t in 4 points.

The basic idea of a physics informed neural net is,  in this case, a network defining a map


This network, when trained, should satisfy both the boundary condition and the differential equation.   To satisfy the differential equation we must be able to differentiate the network as a function of x and t.  But we know how to do that! We compute the derivatives of a network symbolically when we do the back propagation for training.   Neural nets are non-linear functions with first derivatives.   In our case we also need the second derivatives which means that we cannot use ReLU as an activation function because it is piecewise linear and has second derivatives that are all zeros.   However Tanh is smooth with fine second derivatives so we will build our net function as follows.


Our goal is to train an instance of this network to minimize the functions


The function f in Torch is


Where we are using the grad function from torch.autograd  to compute derivatives.  The function flat(x) just returns the list [x[i] for i in range(x.shape[0)].

The unique thing here is that we are training the network to satisfy the differential equation and boundary conditions without data samples from the solution u.  When converged the network should give us the solution u.  This works because Burger’s equation, in this form, has been proven to have a unique solution,  Of course, this raises the question: when are a differential equation and boundary conditions sufficient to guarantee the existence and uniqueness of a solution?  Simple linear differential operators do, but not all differential equations.  In the case of PINNs, one can speculate that if our laws of nature are correct, then nature proves the solution exists. (Thanks to Wolfgang Gentzsch for making me realize this point needed to be made.)

To train the network we iterate over 200000  epochs  using two optimizers (one for the boundary and one for the function differential equation function f).   For each epoch we randomly draw a batch of boundary samples and a batch of samples drawn from the interior of the [0,1]x[-1,1] rectangle.   Each sample is a tuple consisting of a t-value, an x-value and a u boundary value or zero.  (Recall we are forcing f(t, x) to zero and bndry to zero or u(0,x).)  The main part of the code is below.  The full details are in the github repository.


A python library giving us an approximation of an exact solution is available on line that we can use to compare to our result. Figure 2 below shows the heatmap of the solution.


Figure 2.  The horizontal axis is time (t in [0,1]) sampled at 40 points and the vertical axis is x in the interval [-1,1] sampled at 40 points.  Dark colors are u values near 1 and white are u values near -1.

The mid horizontal line shows the evolution of the shock as it becomes a sharp transition as t goes from 0 to 1 on the right.

A better view can be seen from the evolution of the solution from the initial condition.  This is shown in Figures 3 and 4.  It is also interesting to note that there is a subtle difference between the neural net solution and the approximation to the exact solution.


Figure 3.  The x axis x in the interval [-1,1] at 40 points and the y axis is u(t,x).  Each line represents u(t, x) for a specific value of t.   The blue line is the initial condition t=0.  Subsequent lines show the evolution to the singularity.


Figure 4.   A 3D view showing the evolution of the sine wave on the left to the sharp shock on the right edge.

Nonlinear Boundary Value Problems

The classical elliptical PDE takes the form


where u is defined on a region in x-y space and known on a bounding edge.   A good physical example is a steel plate that is clamped and temperature controlled on perimeter and f represents heat applied to the surface.  u will represent the temperature when it reaches a steady state.

For purposes of illustration we can consider the one-dimensional case, but to make it interesting we will add some non-linear features.  In addition to the function f(x),  let us add another known function k(x) and consider


To show that the technique works, we will consider special cases where we know the exact solution.  We will let k(x) = x and u(x) = sin2(x).    Taking the derivative and applying a few trig identities we get f(x) as


Where we will look for the solution u on the interval [0, pi] subject to the boundary condition u(0) = u(1) = 0.  The second case will be similar, but the solution is much more dynamic: u(x) = sin2(2x)/8 with the same boundary conditions.   In this case the operator is


The network is like the one above but with only one input parameter


The differential operator that is the left side of the equation above is computed with the function Du(x) shown below.


Training the network is straightforward.  As with the Burgers example, we use a mean squared error lost function and gradient descent optimizer. We have an array vx of x-axis points from which batches of samples are drawn.   For each point x in each batch we also create a batch consisting of f(x).  There is only one batch for the boundary.   In this case, the boundary is only two points, so it is represented by one batch.


Running this for 3000 epochs with batch of size 10 will converge the network.   The result for the 1st case (u = sin2(x) ) is extremely close to the exact solution as shown in figure 5.   Both the true solution (a blue line) and the computed solution (an orange line) are plotted together.   They are hard to distinguish.   Below the solution we plot the differential equation f(x) which is what the networked was trained to learn and it is not surprising the fit is good there too.

The second case (u = sin2(2x)/8 ) is more dynamic.    The differential equation f(x) fit is excellent but the solution is shifted up (because the boundary condition was off on one end.


Figure 5.   Solution to the differential equation d/dx(x du/dx) = f(x)

Stochastic Differential Equations and Generative Adversarial Nets

If we consider the differential equation from the previous section


but now make the stipulation that the functions k(x) and f(x) are Gaussian processes rather than fixed functions, we now have a stochastic differential equation.   Papers 2 and 3 mentioned above show us how to create a generative adversarial network to solve for the Gaussian process that defines u.  they do not treat this as an initial value, boundary value problem as described above, but rather the authors assume we have a series of snapshots  ( k(xi), u(xi), f(xi) ) for i in 1, … n for some reasonably large n.  With these we will set up a GAN to find a representation of u.

Recall how a GAN works.  We have two networks, a generator, and a discriminator.  The generator is trained to transform a normal distribution into a distribution that fits your data.  The discriminator is trained to recognize your true data and distinguish it from the fake data from the generator.  This is illustrated in Figure 6 below.


Figure 6.  Basic GAN configuration.

It is relatively easy to build a GAN that can reproduce u from samples of x, k(x) and u(x).  Unfortunately, making a GAN that looks like u(x) does not mean it satisfies the differential equation.  Taking the simple case where k(x)=x, a GAN base based on the design above generated the result in Figure 7 below.  On the left is a plot of the distribution of the converging solution for random normally distributed samples X over the true solution in blue (see right half of figure 5).  However, when we apply the differential operator to the generated solution we see it (multi-colored dots) is nothing like our the true solution operator (blue).


Figure 7.  GAN from figure 6 distribution of normally distributed sample X and, on right differential operator applied to gen(X)

To solve this problem the GAN discriminator must evaluate both u(x) AND the result of the differential operator f(x) = Du(x, k(x), u) as show in Figure 8.


Figure 8.  Full GAN to map normal distribution to solution that also satisfies the differential equation.

The construction of the GAN generator is straight forward.  It takes inputs as batches of samples from a normal distribution and generates a batch of points in R2 that are eventually trained to represent samples from (x, u(x)).   As for our previous examples we use Tanh () for the activation function.


The discriminator takes triples of the form (xi, u(xi), f(xi)) from the samples.  (Note we are using k(x)=x, so there is no need for that argument.   We can use ReLU for the activation because we do not need second derivatives of the discriminator.


The torch version of the differential operator is


The discriminator wants inputs of the form


Which are provided by the samples (xi, u(xi), f(xi)) and from the generator in the form


The training algorithm alternates between optimizing  log(D(G(z))) for the generator and optimizing log(D(x)) + log(1 – D(G(z))) + a gradient penalty for the discriminator.   The gradient penalty is introduced by Yang, Zhang, and Karniadakis in paper 2.  The full code for this example is in the Github repository.  Figure 9 illustrates the convergence of the solution in for snapshots (A, B, C, D) with step D taken at 200000 epochs.  It is interesting to observe that image of the 1-D normal sample space in R2 takes to form of a path that gradually conforms to the desired distribution.  However at step D there is still unfocused resolution for x > 2.6.


Figure 9.   Four snapshots in time (A, B, C, D) of the convergence of the GAN.

We should also note that this is not a true stochastic PDE because we have taken our samples for the training from the exact solution and are not Gaussian sources, but the concepts and training are correct.

A much more heroic and scientifically interesting example is in paper 3.   The authors of this paper address the problem of modeling the subsurface flow at the Hanford Site in Washington State where the reactors to generate plutonium for the US atomic arsenal.   The 500 square mile site had nine nuclear reactors and five large plutonium processing complexes and produced 3 million US gallons of high-level radioactive waste stored within 177 storage tanks, an additional 25 million cubic feet of solid radioactive waste.  And over the years the tanks have been leaking.   Hanford is now nation’s largest environmental cleanup project.   The team involved with paper 3 is from Brown University, Lawrence Berkeley National Lab, Pacific Northwest National Lab, Nvidia, Julia Computing and MIT.  They look at a large 2-D version of the PDE discussed above.   In this case u(x,y) is hydraulic head of the subsurface flow, k(x,y) is depth-averaged hydraulic conductivity and f is the infiltration from the earth to the flow.   These quantities are measured by sensors at a large number of sites on the Hanford reservation.

Because k and u are both stochastic and are supported only by data at the sensor points the generator in their GAN must produce both k and u (actually  log(k) and u) as shown in Figure 10.


Figure 10.  GAN architecture from Yang, et. al, “Highly-scalable, physics-informed GANs for learning solutions of stochastic PDEs”, arXiv:1910.13444v1.

In order to tackle a problem of the size of the Hanford problem, they partitioned the domain into a hierarchy of subdomains  and had a separate discriminator for each subdomain.   The top levels (1 and 2) capture long range characteristics while the lower levels capture properties that correspond to short range interactions.   They parallelized the computing utilizing over 2700 GPUs on the ORNL Summit machine so that it maintained 1.2 exaflops.

Neural Ordinary Differential Equations

In the previous section we saw how neural networks can solve differential equations.   In this section we look at the other side of this coin: how can differential equation solvers simplify the design, accuracy, and memory footprint of neural nets.   Good papers and blogs include the following.

  1. Chen, Rubanova, Bettencourt, Duvenaud “Neural Ordinary Differential Equations
  2. Colyer, Neural Ordinary Differential Equations, in the Morning Paper, Jan 9, 2029, Colyer’s amazing blog about computer science research.
  3. Duvenaud, comments in Hacker News about the Chen, et. al. paper.
  4. Chen, “PyTorch Implementation of Differentiable ODE Solvers”.
  5. Rackauckas, “Mixing Differential Equations and Machine Learning” .
  6. He, et. al. “Deep Residual Learning for Image Recognition”.
  7. Gibson, “Neural networks as Ordinary Differential Equations
  8. Holländer, “Paper Summary: Neural Ordinary Differential Equations
  9. Surtsukov, “Neural Ordinary Differential Equations”

Chen et. al. in [1] and others have observed that the deep residual networks that made training very deep networks possible [6] had a form that looked like Euler’s method for solving a differential equation.  Residual networks use blocks of network layers where the input is transformed by a residual before sending it to the next layer by the simple equation


So that what we are training is the sequence or residual layers Ni  .  If you have a differential equation dy/dx = f(x) then Euler’s method is to compute y from a sequence of small steps of based on an approximation of dy/dx by


By analogy, our residual network now looks like this


With delta i is equal to 1.   If we abstract the sequence of networks into a single network that depends upon t as well as y, we can define a neural ODE to be of the form


Where theta represents the network parameters.

The interesting idea here is that if residual networks are basically Euler’s method applied then why not use a much more modern and accurate differential equation solver?  If we have an initial value y0 at  time t0, we can integrate forward to time tn to obtain


To illustrate how we can train a neural ordinary differential equation we look at a slightly modified version of the example provided by the authors. we will take our sample data from the solution of a simple ODE that generates spirals in 2-D given by


Where y is a 2-d row vector.   Given the starting point y0 = (2.0, 0.0) and evolving this forward 1000 steps the values are plotted below.


Figure 11.  Spiral Training Data

Our neural network is extremely simple.


You will notice that while the network takes a tuple (t, y) as input, the value t is unused.  (This is not normally the case.) The training algorithm is quite simple.  Recall that the function is a derivative, so to predict new values we must integrate forward in time.


The authors use Hinton’s Root Mean Square Propagation optimizer. And the function get_batch() returns a batch of 20 points on the spiral as starting points as batch_y0.  Batch_t is always the 10 time points between 0 and 0.225 and batch_y is a batch of 20 10-step paths along the spiral starting from the corresponding batch_y0 point.   For example, shown below is a sample batch.


Figure 12.  Sample training data batch

An incredibly important point is a property of the ODE solver we use “odeint”.


It comes from the author’s  torchdiffeq package as an adjoint integrator.  The actual integrator we want to use is scipy.integrate.odeint which is a wrapper for an old, but sophisticated method written in the 1980s and from ODEPACK.  But we need to be able to compute the symbolic derivative of the loss operation so we can do the backpropagation.   But we can’t do the symbolic integration back through the old ODEPACK solver.  To get around this problem the Chen and the team uses the integrator to solve an adjoint problem which, very cleverly, allows us to compute all the derivatives without trying to differentiate the solver.  The details of this step are described in the original paper and in the other references above.  We won’t go into it here.

It is fun to see the result of the training as the it evolves.   We captured the output every 10 iterations.


Figure 13.  Snapshots of training showing the trajectories of the solution and the correct values.

It is worth observing that, once trained our network, when given a point (x, y) in the plane returns a vector of the trajectory of the spiral path through that point.   In other words, it has learned a vector field which can be plotted as on the right in figure 13 and below.


Figure 14.  plot of the vector field generated by the network

Neural ODEs can be applied well beyond applications that resemble simple differential equations.  The paper illustrates that they can be applied to the vision tasks that resnet was invented to solve.  The paper show how it can be applied to build a MNIST hand written digit classifier but with a much smaller memory footprint (one layer  versus many layers in resnet).  There is a very nice implementation in Surtsukov [9].

The solution (detailed in Surtsukov’s notebook in his gitub repo) also contains all the details of the solution to the adjoint problem for computing the gradients of the loss.   The model to classify the MNIST images is almost identical to the classical Resnet solution.   A slightly modified version that combines parts of Surtsukov’s solution with Chen’s solution  is in our Github repo.

The input to the model is passed through an initial down-sampling followed by the residual blocks that compute the sequence


More specifically the Resnet version consists

  • A down sampling layer
  • 6 copies of a residual block layer
  • And a layer that does the final reduction to 10 scores.

The neural ODE  replaces the six residual layers Ni(yi ) with the continuous derivative  N(y,t).  The forward method of the NeuralODE invokes the adjoint solution to the integration by using the odeint() function described in the previous example.   You can experiment with this in the Jupyter notebook in the repo.  With one epoch of training the accuracy on the test set was 98.6%.  Two epoch put it over 99%.

One of the other applications of Neural ODEs is to time series data.  A challenge for many tradition neural network approaches to time series data analysis is that they require uniformly spaced samples for training.  Because the Neural ODE is continuous,  the sampling can be flexible and adaptive.

Final Thoughts

This report is an attempt to illustrate a striking duality that seems to exist between deep neural networks and differential equations.   On the one hand, neural networks are non-linear functions that, with the right choice of activation functions, can have smooth first and second derivatives and, consequently, they can be trained to solve complex differential equations.    Generative adversarial networks can even model the solution to stochastic differential equations that fully satisfy the governing laws of physics.  Seen from another perspective, many deep networks are just discrete approximations to continuous operators that can be solved with advanced differential equations packages.   These Neural ODEs have much smaller memory requirements and are more adaptive in their execution and may more accurately solve difficult problems.

It is going to be interesting to see where these approaches lead in the years ahead.

The github repository for this paper contains four Jupyter notebooks:  the solution to Burger’s equation, the simple non-linear and generative adversarial example, and the simple Neural ODE solution to the spiral described above and the MNIST solver.

Postscript. I realize now that I overlooked another significant contribution to this discussion.  “Deep Learning Based Integrators for Solving Newton’s Equations with Large Timesteps” arXiv:2004.06493v2 by Geoffrey Fox and colleagues show how RNN can be use to vastly improve the performance of the integration of Newton’s equation.

Accelerating Deep Learning Inference with Hardware and Software Parallelism


A persistent problem when using deep neural networks in production is the speed of evaluating the network (known as inference) on a single input.   While neural network inference has ample opportunities for using parallelism to gain speedup, these techniques are not as easy to exploit as when training the network. In this short report we will look how several new system/chip designs from companies like Groq, Celebras, Graphcore and Samba Nova are approaching the inference performance problem and we will also explore the software challenge in compiling Neural Nets to run on this parallel computer hardware.

Why is Inference Harder to speedup than Training?

Training Deep learning systems require vast computational resources and data. Fortunately, the algorithms used for training are highly parallelizable and hardware that supports either data parallel and/or highly multithreaded execution can make a huge difference in the training time.   In a previous post, we described how simple GPUs and clusters of CPUs can be used to train networks. However, once a network has been trained it must be deployed so one can use it to make inferences. Making an inference involves taking a single input (an image, query or sound clip) and pushing it through the many layers of the network. The trained model may be hosted in the cloud to support image identification or search, or to do natural language translation or question answering. Doing this fast for on-line applications is essential as the load on the application increases. The speed of inference is also critically important in robotics applications such as self-driving vehicles or controlling complex critical hardware which may involve life-support.

While the inference process still involves operations that can be parallelized, the challenge in using this parallelism to gain performance is different than it is when training the network.  The reason that this is the case is easy to understand. The metric of performance for inference is latency (the time it takes to push an item through the network) while the metric of performance for training is throughput (the volume of training data per second that you can manage). Training a network involves pushing batches of training data through the pipeline of network layers. GPUs are effective when you can reuse data that has already been loaded into their local memories.   Evaluating a batch of input data allows the GPU to load the layer weights once and reuse them for each item in the batch.

It is a well-known fact of life in high performance computing that the latency involved moving data is a performance killer unless you can hide that latency by using your hardware to do other useful computation. Because inference has fewer opportunities to reuse data, the best way to reduce inference latency is to reduce the amount or cost of data movement.

The new architectures we will look at all use extraordinary amounts of parallelism, but they also depend very heavily on compilers that can translate the neural network designs into the low level code and optimized data movements that will achieve the performance goals. In fact most, if not all, of the systems here were co-design efforts involving the simultaneous planning for the hardware and software. Hence as we present hardware details, we will need also to describe the compiler and runtime.   The last section of this report will focus on the general challenges of compilers for this class of computing system.

Hardware Advances

In the following paragraphs we will outline the advanced deep learning processor designs that are now coming on the market. While they all address the issues of training and inference, there are several that has put the issue of inference performance as a prime design objective.   The descriptions below vary greatly in the level of detail. The reason for this is that some are still a work in progress or highly proprietary. For example, all we know about the Huawei Ascend 910 is that it “performs much better than we expected”.

Groq is a Silicon Valley company co-founded by Johnathan Ross who was on the team that designed the Google Tensor Processing Unit.   The Groq Tensor Streaming Processor (TSP) is very different from the other systems which rely on massive scale multi-core parallelism. Instead the TSP can be classified as a Very Long Instruction Word (VLIW) single core, single instruction stream system.   The design is very unusual but there is a good description in the Linley Group Microprocessor Report January 2020. We will only give a capsule summary here.

The TPU is a type of systolic processor in that it has horizontal data flows with instructions streaming from the main issue engine down through 20 data layers called superlanes.   Each Superlane is composed of 16 parallel lanes of 8 byte wide data paths.   The superlanes have blocks for Matrix accumulators, transpose and permute operations and vector ALUs as shown in Figure 1 below.   Note that memory is imbedded directly in the superlanes.   Notice also that each superlane has is duplicated around a central axis so data moves between units in both directions


Figure 1.   Groq Architecture.

Instruction issue is also systolic.   The first instruction is executed on superlane 0 in one cycle. In the next cycle that instruction is executed on superlane 1 while the 2nd instruction is executed on superlane 0. In the next cycle the first instruction is executed on superlane 2, the 2nd instruction is now on superlane 2 and the 3rd instruction is on superlane 0. So, in 20 cycles an instruction has been executed on all superlanes and each subsequent instruction is complete on all superlanes, etc.   (Note: this description may not be totally accurate. We do not have the detailed Groq technical specs.)


Figure 2. Groq TSP

The Groq TSP is designed to deliver a 1000 trillion operations per second and live up to a major design goal: on the Resnet-50 deep learning model it delivers 20,000 inferences per second with a latency of 0.04ms on a batch size of 1.

A big challenge in creating a system like the Groq TSP is building a compiler that can generate an efficient instruction stream to keep all that hardware busy.   Because there are no caches, locality is not an issue and advanced architectural features like prefetch and branch prediction are not needed, so the computation is completely deterministic, and performance is completely predictable. We will return to the compiler issues below.

Habana Goya

Habana Labs was one of the first to introduce a fast inference processor.   In 2018 they announced the Goya processor.   By the end of 2019 they were acquired by Intel. The Goya architecture has an array of Tensor Processor Core (TPC) compute engines which are VLIW single-instruction-multiple-data processors. It also has a general matrix multiply engine. The TPC engines have local memory but there is a fast, shared static RAM.

One very interesting thing about the Habana design team is their work with Facebook on the Glow compiler back end for Pytorch.   More on that later.

Alibaba Hanguang 800

Another newcomer to the race for faster inference is the Alibaba Hanguang 800.   Alibaba is not planning on selling this new chip and it is intended solely for internal use in its cloud servers. There is very little that is published about its internal architecture.   In the table below we see some interesting performance numbers including one that indicates that the Alibaba system has better inference performance than the Groq TSP. However, we do not know if this IPS number is for a batch size of 1.


Figure 3. From

A Digression about Compilers, ResNet-18 and ONNX.

Before we continue discussing interesting new architectures, it is helpful to stop and discuss some general issues related to the compilers and benchmarks.

One of the big problems you encounter when writing a compiler for a new architecture is that there are several very good deep learning frameworks that are used to build deep neural networks. These include MxNet, Caffe, CNTK, Tensorflow, Torch, Theano, and Keras. One could write a compiler for each, but, given that they all build very similar network models, it makes sense to have a “standard” high-level, graph intermediate form that captures the properties of a large fraction of all neural nets.  Then, if third parties build translators from the high-level frameworks to this intermediate form, the chip architect’s job is half done: all they need to do is write a code generator mapping that intermediate to their architecture.

The Open Neural Network Exchange (ONNX) may become that standard intermediate form. Originally developed by Microsoft and Facebook, it has been taken over as a community project involving 20 companies. While we do not know how Groq, or some of the other hardware companies described here, are building their proprietary compilers, looking at ONNX as it relates to a real example can give a clue of how compilers like these do work.

In the last three hardware descriptions, performance number were often cited in terms of ResNet-50. Resnet is one of a family of very deep convolutional neural networks. Originally presented by He, Zhang, Ren and Sun in their 2015 paper, they describe a clever way to improve the ability to train very deep networks.   You can think of each level of a deep neural network as learning more subtle and abstract features of the training images than were detected by the previous layers.   A residual network is one where you “subtract” the features discovered by previous layers so the following layers can work on learning the properties of the residual.  Doing this subtraction is a way to focus the learning on what remains and helps solve a problem known as the vanishing gradient that makes it hard to train very deep networks. Mathematically If your training goal is to learn a function H(X), then the residual at some layer is F(X) = H(X)-X. Hence we want the following layers to learn F(X)+X to recover H(X). Creating F(X)+X in the network is easy and it is shown in Figure 4.


Figure 4. From Zhang, Ren and Sun in their 2015 paper.

We can construct such residual network with Torch or TensorFlow and then we can look at the ONNX intermediate. In Torch, the code is summarized below. (The complete code is in a Jupyter notebook that accompanies this post.) There are two networks. One is the residual block as illustrated in Figure 4 above and the other is the full model that incorporates a sequence of residual blocks.


In the image above, we created an instance of the model as “resnet” and then set it to the eval() state. Using the Torch built-in ONNX export operator we can save a copy of the model in the file resnet.onnx.   Doing so gives an output like Figure 5 below.   On the right we have fragments of the ONNX intermediate code and on the left, a graph that is generated from the ONNX code with a tool called netron. What is shown here is only a small part of the ONNX graph. The top is just a list of all the model variables. Following that we have actual code for the graph.

The ONNX exporter will build the graph from the internal Torch model.   There are two ways in which it does this.   One is to directly “unroll” the graph by interpreting the execution of the forward(input) eval operator. In some cases, if the definition of the model contains conditionals, it will insert conditional code in the graph, but these are rare cases.

In this case the code consists of an initial convolutional layer followed by a batch normalization which is based on the mean and variance of previously seen batches.   This is followed by the first instance of the Residual block model.


Figure 5. Fragment of the ONNX output for the Resnet18 model.

As you can see, the ONNX graph consists of nodes that are parameterized operators with inputs that are the model tensors and they produce one output that is a well-defined tensor. Generating code for a specific architecture can be a simple as building well-tuned native versions of the ONNX operators and then managing the required data movement to ensure the input tensors are in the right place at the right time for their associated operation nodes. On the other hand, there are a number of important optimization that can be made as we “lower” the ONNX graph to a form that is executed. We will return to this point after we complete the descriptions of the new architectures.

Cerebras Systems

Cerebras Systems has taken the parallelism to an extreme. The power of their approach is most evident during network training rather than inference, but it is interesting enough to describe it in more detail. Their CS-1 system is based on a wafer-scale chip that consists of a 2d grid of 400,000 compute cores interconnected by a 2d-mesh network capable of 100 Petabits/sec bisection bandwidth that delivers single word active messages between individual cores.


Figure 6. Cerebras WSE

The Cerebras software contains the Cerebras Graph Compiler that maps deep learning models to the hardware. Their approach is an extreme form of model parallelism where each layer of the network is mapped to as many compute cores as is required to contain it. Their philosophy is nicely described in the post “Neural Network Parallelism at Wafer Scale”​ by Natalia Vassilieva and in their product overview.


Figure 7. Cerebras Software Stack

The training uses pipelined back-propagation.   The graph compiler takes the source description of the network and extracts a static graph representation of the problem and converts it into the Cerebras Linear Algebra Intermediate Representation (CLAIR). This is then converted into a “Kernel graph” and mapped to the hardware as shown in Figure 7. In their approach the entire network is mapped onto the computing fabric, so pipelining batches through has no points of congestion.


Graphcore is a U.K. startup started shipping their accelerator, called the Intelligence Processing Unit (IPU), in 2018 and it is now available on Azure for evaluation.   Like Cerebras, the architecture is based on massively parallel processing. Each IPU contains 1,216 processing elements called tiles; a tile consists of one computing core plus 256 KiB of local memory.   There is no shared memory, but the local memory is SRAM and faster than the DRAM in most CPU servers.  To hide latencies, the IPU cores are multithreaded.   Each core has 6 execution contexts that served in round-robin style. In terms of computational performance, each IPU is 31.1 TFlops/s in single precision.


Figure 8.   Graphcore IPU

There is an inter-processor communication switch called the exchange that provides processing element data communication and multiple IPUs are connected via a fast, off-chip interface network.   Citadel has published an excellent performance analysis by Jai “Dissecting the Graphcore IPU Architecture via Microbenchmarking”.   They have measured “On a per-tile basis, each of the 1,216 tiles can simultaneously use 6.3 GB/s of bandwidth to transfer data to an arbitrary destination on chip. The latency of an on-chip tile-to-tile exchange is 165 nanoseconds or lower and does not degrade under load above that value.” They also measured the latencies between cores that reside on different IPU boards where several boards had to be crossed to deliver the message. This increased the latency to about 700 nanoseconds.  Their report provides a very complete analysis of the data traffic performance under a variety of conditions.

Ilyes Kacher, from the European Search engine company Quant have also produced an analysis: “Graphcore C2 Card performance for image-based deep learning application: A Report”. Their interest in Graphcore was to improve performance of their image search product. In their study they considered the image analysis network ResneXt101. Inference experiments for batch sizes of 1 and 2 are 1.36 ms. Their benchmarks claim this is 40time lower latency than an Nvidia V100 GPU.   They also compare performance on BERT and they measure 30% lower latency with 3 time higher throughput.

The programming model is based on Bulk Synchronous Parallelism in which computation is dived into phases, with each phase consisting of a computation step followed by a communication step and then a barrier synchronization.


Figure 9. Graphcore BSP execution (from Graphcore IPU Programmers Guide.) is a discussion of their stack.   More significantly they have open sourced the entire software stack.  They have a runtime environment called the Poplar Advanced Run Time (PopART) that can be used to load a ONNX model from python and run it on their hardware. For Tensoflow they have a separate compiler and runtime.

Graphcore hardware is now available on Azure for evaluation.


SambaNova is a bay area startup founded by two Stanford professors and a former executive from Sun and Oracle. They have not yet announced a product, but they have an interesting background that may indicate a very novel approach to the design of an AI accelerator.

Reconfigurable computing is an idea that has been around for since the 1960s. Field Programmable Gate Array are in common use today to configure a processor to execute a new algorithm, but these usually take 10s to 100s of milliseconds to “reprogram”. Suppose you could configure the logic elements on the chip to perform a needed transformation on a block of tensors just as that block emerges from a previous operation? The SambaNova team has looked at specialized programming languages that allow them to generate streams of high-level templated instructions such as map, reduce, shuffle and transpose that are natural elements of deep network kernels.   This is clearly a talented, well-funded team and it will be interesting to see what is eventually released.


A Toronto startup called Tenstorrent has built a device called GraySkull.   The chip has 120 small processing nodes, called tensix, and two toroidal mesh networks that can be extended off-chip to build larger clusters. There is no shared memory.   In various articles about Tenstorrent they emphasize their approach to dealing with sparsity in large neural net models is key to high performance on big models.   Like several of the other startups, their compiler translates ONNX graphs into tensix primitive operators which are mapped to the nodes. They claim 22,431 IPS on resnet 50 and 23,345 sentences/sec on BERT.


Figure 10. From Tenstorrent.


Finally, we include NLVDA from NVIDIA. Called a Deep Learning Accelerator, this is an open source modular architecture for building inference accelerators. There is a hardware instance called Xavier that NVIDIA has produced to support inference for autonomous transportation applications.

Compiling Neural Nets for Parallel execution.

In the remainder of this report we will look at the techniques that are used in modern compilers to optimize performance on neural network training and inference. Many of the basic techniques have been used in compilers for 50 years. These techniques evolved as CPU arithmetic and logical units became so fast that many operations were dominated by the time it took to move data from main memory through layers of faster and faster caches. Data locality was critical: if an item of data was going to be reused you needed to keep it in fast cache as long as possible.

Almost all of the operations in a neural network involve matrix and vector arithmetic.   If you consider the most basic type of network layer, an n by n full connection, it is just an n by n matrix and a vector of offsets. Applying such a network to a single vector of n inputs is just a matrix-vector multiply and a vector addition. The problem with matrix-vector multiply is that the matrix elements must be fetched from memory and are used only once. On the other hand, if the computation is properly blocked so that small chunks of the array are loaded into the GPU or CPU caches, then if you have a batch of n vectors, each element of the array can be fetched once and used n times.   This method of improving matrix-matrix computation is the basis of the standard library known as the Level-3 Blas developed by Jack Dongarra and others in the 1980s.

A more interesting example of how locality can be used to speed up performance in a neural network is 2-D convolutions that are used in deep learning image networks.   Figure 11 below shows a 2-D convolution operating on a 6×7 image data with 3 color channels and outputs a new 6×7 array with 6 channels.   Each output channel is produced by using, in this case, 3 filters of size 3×3. Each filter is applied to a channel of the input (which has been expanded with an extra border of ghost pixels). The filter moves across the input channel computing the inner product of the filter with the image at that point. The corresponding output point is the sum of the three filter inner products applied to each of the input channels.


Figure 11. A 2-D convolution applied to a 3 layer, 6 by 7 image producing 6 output images.

The 6 by 3 by 9 tensor of filters, W, is the learned object and the full computation is show in the formula above (we have suppressed the bias terms to simplify the presentation). If we let Cin be the number of input channels and Cout be the number of output channels for a width by height image the full computation takes the form below. (The input is the padded (width+1) by (height+1) array, so that the pixel at position (0,0) is in location (1,1) in the array Input.)


This form of the computation has extremely poor locality and it will run slowly. However, we illustrate below a sequence of program transformations that allow us to simplify this loop nest and “lower” the execution to primitives that will allow this to run up to 400 times faster.


A close inspection of this nest of six loops will convince you that we can execute them in any order. In fact, the addition recurrence is only carried by the inner three loops.   Consequently, we can pull out the three inner loops as a separate function that does not involve repeated writes to memory around the Output tensor. The result is shown below.


The next thing to notice is that if you move the t loop in the kernel function to the innermost position the operation is an inner product. The computation now takes the form


One final reduction can be made when we notice that the two loops in kernel2 are just a pointwise matrix product of the 3×3 filter W with the Input shifted to position (k,j). And the summation can be done with the torch.sum() function, so our function now takes the form below.

We ran these four versions of the function on two machines: an Intel core I7 and an Nvidia Jetson nano. The results are in Tables 1 and 2 below. As you can see, the performance improves substantially for each transformation. In addition, the speed up of the matrix product version over the 6 nested loop version varies from 68 to over 400 times with the greatest speedup occurring when the values of Cin are largest.

6 nested loops Factored Kernel Kernel with dotprod Matrix product Cin Cout W,H SpeedUp
2.24 seconds 1.26 1.0 0.022 16 4 10,10 68
4.47 2.75 0.19 0.047 16 8 10,10 95
8.24 4.98 0.39 0.077 16 16 10,10 107
8.51 4.97 0.20 0.038 32 8 10,10 223
8.66 5.06 0.10 0.020 64 4 10,10 433

Table 1. Execution time on Intel Core i7 for the four versions of the loop with various values of Cin and Cout. The speedup is measured as the ratio of the 6 nested loop time to the time for the matrix product.

6 nested loops Factored Kernel Kernel with dotprod Matrix product Cin Cout W,H SpeedUp
47.9 seconds 28.1 7.02 0.7 16 4 10,10 68
87.9 52 9.7 0.73 16 8 10,10 120
168.9 107 18.9 1.17 16 16 10,10 144
171 107.9 9.8 0.59 32 8 10,10 289
174 104.0 4.38 0.43 64 4 10,10 404

Table 2. Execution time on Nvidia Jetson Nano for the four versions of the loop with various values of Cin and Cout.

The final point to notice about this last version is the final (i,j,k) loops may all be executed in parallel. In other words, if you had a processor for each pixel on each output plane the entire operation can be run in parallel with an addition speedup factor of Cout*width*height. Of course, all of these versions are far slower than the highly optimized conv2d() library function.

The compilers we talk about below do not operate at the level of program transformation on Python loop nests.   They start at a higher level, transforming ONNX-like flow graphs and eventually lowering the granularity to primitive operators and scheduling memory management and communication traffic and eventual code generation.

Deep Neural Network Compilers.

Every one of the hardware projects we described above has a companion compiler capable of mapping high level DNN frameworks like PyTorch or Tensorflow to run on their new machine.   Not all of these have detailed descriptions, but some do, and some are also open source.   Here is a list of a few notable efforts.

  • Intel’s NGraph compiler “Adaptable Deep Learning Solutions with nGraph Compiler and ONNX” with extensive open source and documentation.
  • ONNC: A Compilation Framework Connecting ONNX to Proprietary Deep Learning Accelerators. Also fully open source. It is designed to support NVDLA hardware.
  • Nvidia TensorRT is an SDK for high performance inference that is built on CUDA. A TensorRT backend for ONNX is also available and open source.
  • Apache TVM is an open source compiler stack for deep learning that originated with the Computer Science department of the University of Washington.
  • Google MLIR, a Multi-Level IR Compiler Framework that provides tools for many types of programming language compiler challenges.
  • ONNX Runtime for Transformer Inference from Microsoft has been open sourced.
  • The Facebook Glow compiler “Glow: Graph lowering compiler techniques for neural networks” paper.
  • The GraphCore PopART runtime was discussed in the GraphCore section above.
  1. Sivalingam and N. Mujkanovic of CRAY EMEA has a nice summary of these compilers in this post.

Glow translates the input from ONNX or Caffe2 into a high-level intermediate graph that is very similar to ONNX. Because they are interested in training as well as inference they next differentiate the graph to facilitate gradient decent training. This new graph contains the original and the derivative. TVM also generates a differentiable internal representation.   All of the others have similar high-level internal representation. Where they differ is in the layers of transformation and lowering steps.

High Level Graph Transformations

The next step is to do optimization transformations on the graph.   The nGraph compiler has a Reshape Elimination pass exploits the fact that matmul(A.t, B.t).t = matmul(B,A) and other algebraic identities to make tensor restructuring simplifications.   Common subexpression elimination and constant folding are standard compiler techniques that can be applied as transformations the high-level graph. For example when compiling a model to be used for inference most of the parameters in the high level nodes, such as various tensor dimensions are known integers and expressions involve address arithmetic can be simplified.

An important part of the “lowering” process is where some of the high-level nodes are broken down into more primitive linear algebra operations.   This step depends on the final target architecture: for a GPU certain transformation are appropriate and for a CPU, different choices are made. For example, with ResNet Glow has different strategies for different instances of the convolution operator depending on the size of the filter weights and these require different memory layouts. TVM, Glow and ONNC use a type of layer fusion to combine consecutive operators such as Convolution and Batchnormalization or ReLu into a single special operator.

Low Level Internal Representations

Now the graph is transformed into the Low-level internal representation. This layer is more specific about the representation of memory layout and important optimization can be made there. For example, if there are sequences of operation that must sweep across a large tensor, one can break the tensor into blocks so each block can be loaded once, and the operation sequence can be applied to the block. This is a classic locality optimization. Managing memory can involve other transformations. For example, ONNC uses layer splitting to handle memory constrains as shown in Figure 12 below.


Figure 12. From Skymizer ONNC tutorial.

Quantization is an issue that several compilers address. Glow also does profile-guided quantization so that floating point networks can be converted into efficient integer-based networks.   Finally, depending upon the backend architecture, code is generated from the final low-level graph.

Runtime Systems

Because the compilation system involves mapping the neural network onto hardware configurations which may have more than one processor that must communicate with each other, there must be a runtime system to handle the coordination of the execution.

Glow has a runtime system that is capable of partition networks into an acyclic graph of subgraphs and scheduled across multiple accelerators. As we have discussed previously the GraphCore PopArt runtime manages BSP-style execution across thousands of processor threads.

The Microsoft ONNX runtime focuses on CPU and CPU + GPU execution on Windows, Linux and Mac OS. For the GPU it supports CUDA, TensorRT and DirctML.   It also supports IOT/Edge applications using Intel OpenVMINO, ARM and Android Neural Networks API.

Final Thoughts

The explosion of computer architecture innovation exemplified by the new systems described here is very impressive. It is reminiscent of the boom in HPC innovation in the 1990s which led to the current generation of parallel supercomputer designs. The density and scale of some of the chips are very impressive.   In the 1980s we considered the impact of wafer-scale integration on parallel computing, so 40 years later, it is interesting to see it come to pass in systems like Cerebras.

There are many details of the compiler infrastructure that we covered here very superficially.   We will return to this topic in the future when we have more access to details and hardware.


Anomaly Detection: From the Edge to the AWS and Azure Cloud

There are now billions of sensors that monitor the world around us. Bio sensors are used to monitor every aspect of life. Environmental sensors measure temperature, humidity, pressure, chemical concentrations, vibrations, acceleration, light wavelengths and more. These sensors produce a constant stream of data that must be analyzed and when unusual behavior is detected these anomalies need to be reported. This alarming behavior may consist of spikes in sensor readings or device failures or other activity that should be flagged and logged. Often these sensors communicate with a nearby small edge computing device which can upload summary data to the cloud as illustrated in Figure 1. Typically, the edge computer is responsible for some initial data analysis or, if it has enough computing capacity, it may be responsible for detecting the anomalies in the data stream.


Figure 1. Edge sensors connected in clusters to an edge computing device which does initial data analysis prior sending aggregated information to the cloud for further analysis or action.

In this short note we look at two cloud services that provide anomaly detection. One is the Azure Cognitive Service anomaly detector and the other is from the Amazon Sagemaker AI services. In both cases these services can be (mostly) installed as Docker Containers which can be deployed on a modestly endowed edge computer.    We will illustrate them each with three example data streams.   One data stream is from an s02 sensor that was part of an early version of the Chicago-Argonne Array-of-thing edge device. The second is from the Global Summary of the Day (GSOD) weather from the National Oceanographic and Atmospheric Administration (NOAA) for 9,000 weather stations between 1929 and 2016. In particular, we will look at a sensor that briefly failed and we will see how well the anomaly detectors spot the problem. The second example is an artificial signal consisting of a sine wave of gradually lengthening period with several anomalous data spikes.

The two services each use a different algorithm to detect anomalies.   The Sagemaker algorithm uses a machine learning method called Random Cut Forest and the Azure detector uses a method which combines spectral analysis with a convolutional neural network. We will describe both algorithms in more detail at the end of this section, but first we go to the set-up and experiments.

Azure Cognitive Services.

To use the cognitive service, you need to go to the Azure portal and then to cognitive services. There you can use the search bar to look for the “Anomaly Detector” (at the time of this writing it is still in “preview”). You will need to create an instance and that will get you an API key and an endpoint for billing.   (You can use it for free until you use up the free quota.   After that you can switch to payments.   I did this and it did not cost me much: so far $0.75, for the work on this paper. )

Download and Launch the Container

You should go to this page to see what is currently required to launch the container. Assuming you have docker installed on a machine (your laptop or in the cloud), you must first pull the container.

docker pull

Next you will use your ApiKey and billing endpoint to launch the container.   This command works:

docker run --rm -it -p 5000:5000 Eula=accept Billing={ENDPOINT_URI} ApiKey={API_KEY}

We can now use the anomaly API to directly interact with the algorithm running on the container. We have supplied a Jupyter notebook with the details of the experiment that follows.


Rather than use the endpoint for the cloud resident service we need an endpoint on the container.

endpoint = 'http://localhost:5000/anomalydetector/v1.0/timeseries/entire/detect'

(which assumes your container is running on your local machine.   If it is remote you need to make sure port 5000 is open and substitute the host name for local host.) Notice the word “entire” in this endpoint.   The detector operates in two modes: entire and last.   Entire mode considers the entire history of a stream and spots the past anomalies.   Last mode is used to do real-time detection.

To illustrate its behavior on an example we will use a data stream captured from an SO2 sensor on an early version of the Argonne-Chicago “Array-of-Things” edge device. Running the detect method, decoding the output and plotting the results (see code in the Notebook) gives us the graph below. While hard to see, there are three things being plotted.   One is the value of the data, the other is a line of expected values and a region of boundary of uncertainty and finally the anomaly (red dot).   In this case the region of uncertainty is very narrow and not visible. We will see it more clearly in the next example.


Now the real-time monitoring case.

To run the algorithm in a continuous mode you need to use the “last” endpoint.

nendpoint = 'http://localhost:5000/anomalydetector/v1.0/timeseries/last/detect'

This allows the algorithm to look at a window of data and make a prediction about the last item in the window.   We can now send a new “sliding” window consisting of the last “window-size” data points to the service every time step.


The detector returns a dictionary and, if the last item in the window as anomalous, then the flag result[‘isAnomaly’] is True.   By keeping track of the anomalies (see code in the notebook), we can plot the result.


In this case we have spotted the true anomaly (red dot) and another that looks like a false alarm.   By lowering the sensitivity value, we may be able to eliminate some of the false alarms.

The Skagit Valley Temperature

Turning now to the NOAA GSOD data for the temperature sensor in the Skagit valley, WA station, we have one measurement per day for a year.   There are a few days in October where the sensor goes bad and signals a temperature of over 100 degrees Fahrenheit. (We looked at this case using Google’s data analysis tools in our book.   By looking at other nearby sensors we saw that this was not the temperature in this location.)  The figure below shows the batch anomaly detection for this data. In this case the region of expended value uncertainty is very clearly defined and the bad data is easily spotted.


Turning to the real-time detection mode, we see below that the red dots show the true anomaly, but it also flagged to other locations. One in the month of May, looks a false alarm and another in late November that is unclear.


Synthetic Oscillatory Data

We now look at the case where the data consists of a sine wave that has a slightly increasing period with a few added spikes.   The code to generate the data is


In this case the batch detector accurately tracks the sine wave and catches the big spike, but misses the first spike.  However, the real-time detector does a very good job.



The AWS Sagemaker Anomaly Detector.

The AWS Sagemaker service can be completely managed inside a container, but one difference with the Azure service is that the Sagemaker version does all the analysis in the cloud.   We have provided in the github repository the Docker file you need to create a container that will run in the edge device that will make the calls to the cloud.  In this case the job of the container is to gather the data, interact with the cloud service to set up the algorithm training, deploy a server that will host the trained model return inference results.   This container-based component is a script that makes calls to aws sagemaker. To better illustrate the details, we have a Jupyter notebook.   You can use the following script to build the container, run it and launch jupyter from inside.

docker build -t="yourname/sagemake" .
….   a great deal of build output follows
docker run -it -p 8888:8888 yourname/sagemake
…. Once container starts we are now running as ec2-user in the container.
ec2-user@29e378df61a9:~$ jupyter notebook
….. the output will tell you how to point your browser to see the notebook.

Note: to run the container, it must have your AWS credentials and your Sagemaker execution role identifier. Consequently, do not push your container to the docker repository and delete it when you are finished.

To get started you must log in to the AWS Sagemaker portal and create a user and an execution role. You will also need to create a bucket where Sagemaker will store your model data. The notebook shows the details for how to use this information to create a “session” and “role” object. We will use these to train the algorithm.

One of the ways the Sagemaker documents suggest for presenting data to the algorithm in cases where the size of the collection may be small is to create a “shingled” version. In other words, we take the data stream and for each time instance, create a set of the next “shingle-size” data values as follows.   Using the data from our Array-of-things S02 sensor.


To train the model we use the session and role objects as follows:


Next create a cloud instance that can be used for doing inference from this model.


We can now do a “real-time” analysis by doing a sequence of inferences on sliding windows of shingles. We will use 25 shingles each of width 48 so the window covers 73 time units (we used windows of size 100 in Azure examples). The code now looks like


Plotting the anomalous points with black dots we get the picture below.


As you can see,   it detected the anomaly at 400, but it flagged four other points.   Notice that if flags any point that has an anomaly score greater than 3 standard deviations from the others in the sliding window. Raising the threshold above 3 caused it to lose the actual anomaly but retained the three false alarms.

Applying the same procedure to the Skagit Valley temperature sensor and the artificial sinusoid al signal we get similar results.



Comparing the two anomaly detectors, we found it was easier to get accurate results from the Azure cognitive service than the AWS Sagemaker.   One the other hand, the Sagemaker method has a number of hyperparameters that, if tuned with greater care than we have given them, may yield results superior to the Azure experience.

Another important difference between the two detectors is that the Azure detector can be completely deployed in the container while the AWS detector relies on cloud hosted analysis.   (Of course the Azure system still keeps a record of your use for billing purposes, but it was not expensive: $0.14 for the experiments above. For the work using Sagemaker it was a total of $6.41) We expect that the Sagemaker team will make a container available that will run entirely in the edge device. They may have already done so, but I missed it. If a reader can help me find it, I will happily amend this article. One possibility is their excellent greengrass framework.

The Algorithms

The Random Cut Forest

An excellent github site with good details about using the Sagemaker service is here.   This is also what we modified to create our Jupyter notebook. A basic description of the algorithm is in “Machine Learning for Business” by Doug Hudgeon and Richard Nichol and available on-line.    However, for a full technical description one should turn to the paper by Guha, Mishra, Roy and Schrijvers, “Robust Random Cut Forest Based Anomaly Detection On Streams” published in 2016.

At the risk of over simplifying, Figure 2 below illustrates how a random cut forest can be built for one-dimensional data values.   We start by picking a point somewhere near the middle of the data set and then divide everything above that point into one group and everything below that in another group. Then for each group repeat the process until the points are each in groups of size 1. Now, for each point count the number of tree divisions it took to divide the tree to get isolate that point. Low counts indicate possible outliers.   Now do the tree construction many times. Compute the average of scores.  Again low indicates possible anomaly.


Figure 2.   Forest of Random Trees for a one-dimensional data collection.

Unfortunately, if you apply this technique to a large window of a data stream that has a wide range of values it will only capture the extreme ends of the values. If there is an anomaly in the mid range of values it may not be seen as anomalous when taken as a whole.   For example in the image below we considered the example of the SO2 sensor and apply the algorithm across the whole data set we see it completely misses the anomaly at 400 and flags the overall highs and lows.


But, as we showed above, when we applied a sequence of small windows of data to the same forest of trees it did capture the anomalous spike at 400.

To better illustrate this point, consider a variation on the sine wave fake data example. The random forest tree algorithm was able to detect the anomaly when the spikes were introduced. But a variation on this example can show its failure. Instead of introducing the spikes, we simply flatten part of the sine wave for a segment of the range.   The result is that that no anomaly is detected, and for the modified range, the anomaly score actually drops.   The situation is not helped by the sliding window test.


The Azure Spectral Residue CNN Anomaly Detector

Hansheng Ren, et al. published “Time-Series Anomaly Detection Service at Microsoft” at the KDD 2019 conference.  The paper describes the core algorithm in the Azure cognitive service anomaly detector. There are two part to the algorithm.

Part 1. Spectral Residual and Salience

Salience in image analysis is the property that allows some parts of an image to stand out and be easily identified. It is a technique that is often used in image segmentation.   Spectral residue is computed as follows. Applying an FFT to the stream sequence yields a measure of the frequency spectrum of the data.   The spectral residue is the difference between the log of the spectrum and an averaged version of the same.   Using the inverse FFT transforms the spectral residue back to physical space.   That result is the saliency map of the signal and locates the potentially anomalous parts.

Part 2. Applying a CNN to the saliency map

The novel feature of the algorithm is that it uses a convolutional neural network to do the final anomaly detection.   The network is trained on saliency maps that are generated by injecting artificial anomalies into a variety of real signal types. The paper describes this process in more detail.

To illustrate the power of this method, consider the flattened sine wave that defeated the Forest of Random Trees example above.   As shown below, the SR-CNN method captures this obvious anomaly perfectly.   As you can see, it projected the sinusoidal oscillations into its “expected value” window and the flat region certainly did not match this “expected” feature.


While this example is amusing, we note that in the cases (Skagit weather, So2) where we looked at the real-time sliding window analysis, both methods found the anomaly, even though the random tree method has more false alarms.

All of the data and Jupyter notebooks for these examples are in GitHub.


Modeling Natural Language with Transformers: Bert, RoBERTa and XLNet.

Chapter 10.4 of ‘Cloud Computing for Science and Engineering” described the theory and construction of Recurrent Neural Networks for natural language processing. In the three years since the book’s publication the field of language modeling has undergone a substantial revolution.   Forget RNNs. Transformers are now in charge, so this report is an update of that part of the chapter.

The original RNNs were not very good at capturing the context of very long passages. A technique was developed, called ‘attention’ that was an add-on to the RNNs to help them with this problem. In a landmark paper, “Attention is all you need”, Viswani, et. al. showed that the recurrence part was not really needed. This supplement will describe the basic transformer architecture and look at three examples.   The first is called BERT and it was the transformer that changed the field of natural language processing. We will briefly describe its architecture and demonstrate how to use it with an optimized version called RoBERTa.   Finally we will move onto a more recent transformer called XLNet that has drawn a lot of interest.

There are dozens of blog posts and articles on-line that describe transformers and the original papers do a great job describing the mechanical details.   The challenging part is understanding how the details of the network design capture the networks ability to model the probability distributions associated with natural language. What do I mean by that? Consider the sentence “He decided to walk across the lake.” A native English speaker would be troubled by that and, perhaps suggest that an error occurred and the sentence should have been to be “He decided to walk across the lane.” There is nothing grammatically wrong with the first sentence.   It just does not feel right. It is beyond normal in the internal language model we use for comprehension.

We can also consider ‘fill in the blank’ test to see how we draw on reasonable expectations of how words fit into context. For example, consider these three sentences.

  1. “Because she was a good swimmer, she decided to <mask> across the <mask>”
  2. “He went to the farmer’s <mask> and <mask> a bunch of green <mask>.”
  3. “Whenever <mask> go to the whiskey <mask>, <mask> have a lot of <mask>”

In sentence 1 the masked pair could have been “walked”, “road”, but context “swam”, “lake” is more natural because, in context, she is a swimmer. For sentence 2, the masked triple could be “home”, ”ate”, ”cars”, but “market”, “purchased”, “beans” feels better. We leave it to your imagination to decide what fits in sentence 3.

A masked language model is one that can input words in sentences like those above and any masked word is replaced by something that, with reasonably high probability, fits in the context of the sentence. Before launching into experiments with masked language modes let’s briefly look at the architecture of transformers as described in the original Viswani paper.

Bert and the Transformer Architecture

A transformer has two major components: an Encoder and a Decoder. The Transformer has an implicit model of language in that it has learned a probability distribution that is associated with “meaningful” sentences in the language. The encoder is a non-linear function that maps an input text object into an object in high dimension real space that is somewhat “near” very similar sentences.   To do this, Transformers have a special tokenizer that can convert text into token lists, where each token is an integer index into a vocabulary list for that transformer.  More specifically let s be a string, then let In_tokens = Tokenizer.encode(s) and n = length(in_tokens). The Encoder has an embedding function

Encoder.input_embedding: ZM -> Rk

Where M is the size of the model vocabulary and k is the model-specific dimension of the embedding space.     Hence the representation of the entire string is n vectors of dimension k (or vector of dimension Rk*n ). There is a now famous diagram that illustrates the architecture of the transformer.


Figure 1. The transformer model (from “Attention is all you need”, Viswani, et. al.)

The full encoder is the vertical stack on the left. So far, we have only described the Input Embeddings. Ignoring (for now) the Positional Encoding, note that our Rk*n input is fed to a stack of “N” blocks, each of which consists of a Mult-Head Attention and a Feed forward network, each of which generate a new Rk*n vector output. The Multi-Head Attention block consists of a parallel collection of scaled dot product blocks as illustrated in Figure 2.


Figure 2.   Multi-Head Attention and Scaled-Dot-Product Attention (again from Viswani et. al.)

The basic Attention part is the critical component.   It is a function of three tensors.   Q is called the query tensor and it is of dimension nxt for some t and K is called a key tensor and it is of the same dimension.   The keys are “associated” with values in the vector V which is of size nxk.   The attention function is then


Note that Q*KT is of size nxn. The softmax is applied to each row of the product (after normalizing by the square root of t), so the final result is nxn time nxk which is size nxk.   The way to think of this function is that each t dimensional query vector is being dot-product compared to each key vector. If those two vectors are nearly orthogonal then the dot-product is small and the corresponding row on the result of then attention function will be small.   Here is the motivational concept.   Let’s suppose the queries are associated with words in a sentence and we want to see what other words in the sentence are most related.   The other words are keys.   So if the sentence is “John got in his car then he went to the store”, then we would expect a strong link between “John”, “his” and “he” as illustrated below.


Figure 3. Dark lines indicate where we expect the attention between words to be the strongest. (Apologies to those who feel the reference to gendered pronouns is bad.)

Now to explain the Multi-Headed Attention.   We are going to replicate the scaled dot-product attention network h times where h is a divisor of k and we set t = k/h. Our embedded sentence is of dimension nxk, so to create the Q, K and V vectors we have to project these into vectors of size nxt (for Q and K) and nxk for V. These projections are the linear transformations in Figure 2.   Also we now have h outputs of size nxk. We concatenate those outputs and use a final linear map to project it back to a single nxk vector.

The idea of multiple heads is to allow different heads to attend to different subspaces of attention. It also slightly reduces the computational complexity of the attention convergence computation.   However, there is some question as to whether multiple heads helps that much (see “Are Sixteen Heads Really Better than One?” by Michel, Levy and Neubig.)

The final critical component of the network is the feed forward block.   This is a basic two-level network with one hidden layer.   Notice that this does not involve token-to-token analysis; that was the job of the attention blocks.   Hence the network can process the tokens independently, so its internal structure is independent of the token stream length.

There are other components of the Transformer including the add & norm steps and the Decoder.   The Decoder half has the same basic components as the Encoder, and it is the part that is critical for building language translation systems but we will not address that task here. (A good discussion of the Decoder is given in “Dissecting BERT Appendix: The Decoder”, a Medium post by Miguel Romero Calvo.)

There are several other ways to train a model like this and one is to turn it into a masked language model. To do this we wrap the basic Encoder with a “Masked Language Model Head”.     To explain this, we need to get into a few details related to the experiments we will show.

The original Transformer version, BERT (Bidirectional Encoder Representations from Transformers) was developed by Google, but we will use the more optimized version called RoBERTa (from Facebook and the University of Washington), which was released together with the paper a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. (RoBERTa was trained with mixed precision floating point arithmetic on DGX-1 machines, each with 8 × 32GB Nvidia V100 GPUs interconnected by Infiniband.) The Roberta Masked language model is shown in Figure 4 below. The RobertaMaskedLanguage model is composed of a Language Model head on top of the base language model.


Figure 4.   Vword = a vector of probabilities indexed by words in the vocabulary.

The LM head consists of a linear transformation normalized with GeLU (Gaussian Error Linear Unit) activation and then again with the a LayerNorm function ( (x – x.mean)/|x-x.mean| ) followed by a linear transformation mapping the result back into a list of vectors of probabilities each of which has length equal to the size of the vocabular. The length of the vector list is the number of tokens in the original input.   If the transformer is doing its job, the vector of probabilities associated with a given work-token has a maximum at the index of that word (or a better one) in the vocabulary.   (The Masked Language Header also returns cross entropy loss of the predicted values from the input string, but we will not use that here.)

Demo Time!

To illustrate the behavior of RoBERTa language model can load an instance as follows. We can use the PyTorch-Transformers by HuggingFace Team who have provided excellent implementations of many of the examples in the Transformer family.


Roberta-base has 12-layer, 768-hidden, 12-heads and 125M parameters. To use the model, one need only convert a text string to a tensor of input tokens, feed that to the model and pull out the list of prediction scores (which is returned as a tensor with shape string-length by vocabulary size). Taking the largest prediction as the likely correct word and converting that back to a token ( and removing an internally added character) yields the result.   In code this is:


To illustrate this on a simple sentence, we will use one that is not grammatically correct and see what the model comes up with.


Notice two things. The model split the word speller into a root and suffix. But the model generated a new sentence that was closer to what the probability distribution said a current sentence should look like. In this case that was a change from “were” to “was”.

To illustrate its use as a masked model we can substitute words in the text with a <mask> to see how the model replaces the masks.


In this case it matched the pronoun “she” and inferred that “swim” and “lake” was a good choice.

Changing the context slightly we have


Taking our first example from the list of three above,


And finally, our third sample sentence


The source notebook for this example is in the GitHub archive. Load it and try it out yourself.

XLNet and sentence generation.

Another important point we neglected to discuss in the discussion of BERT, involves the position of tokens in the input.   The Attention operation compares each work in the string with each other word “in parallel”, however the order of words in the string matters if we want to understand its meaning.   Hence, it is necessary to tag each token with position information.   The way this is done is to create an additional value (based on a clever use of sine waves of different frequency … the details are not important here) to encode position in the string.   These values are literally added to the embedding values before being sent to the Attention function.   This is a detail we do not see when we invokee the RobertaForMaskedLM model. It is handled internally. (You can see it referenced in Figure 1.)

BERT is an autoencoding (AE) language model: it is trained to recover masked tokens in its input. XLNet is a newer Transformer language model that is showing better performance than BERT or RoBERTa on many test cases.   BERT derives its power to predict a token at sequence position t from the fact that it looks at both the elements larger than t and smaller than t. But BERT has one subtle weakness.   Because the it is trying to predict masked tokens ‘in parallel’, that does not mean that it will predict them consistently. For example, “”he went to the <mask> <mask> and purchased lots of postage stamps.” gives BERT a hard time. It tries to replace the pair of masks with “store store”, instead of “post office”. (Replacing postage stamps with vegetables and it guesses “grocery store”)

An autoregressive model, like XLNet, operates differently.   Given a string of word-tokens X = [x1, x2, … , xt], an autoregressive language model will try compute a sequence of conditional probabilities to compute p(X)


The problem with this is that is that if we are going from the left to right, we miss the context for words that have important information that are to the right. BERT avoids this by doing all words in parallel. Developed by Yang, Dai, Yang, Carbonell, Sakakhutdinov and Le from CMU and Google, XLNet is referred to as a generalized autoregressive (AR) language model.   Rather than do the predictions from the left (x1) to right (xT ), XLNet uses all permutations of the tokens to do the computation.

The following is very abstract and a bit confusing, so feel free to skip this next part and go to the fun demos that follow.

Let ZT be the set of all possible permutations of the length-T index sequence [1, 2 ,… ,T]. Let z be a permutation in ZT. The notation zt refers to the tth element in the permutation. Similarly, z<t refers to the t-1 elements of the permutation that precede it. We can apply the same conditional probability sequence stated above to elements in the permutation order to compute the conditional probability


The key idea is that if we compute the probabilities for the correct word at each position in the string using every permutation of the words in the string then all contexts for the word in that position can be considered. For example the sentence “He is tall.” Looking at the permutations

He is tall –
tall He is +
is tall He +
tall is He +
is He tall –
He tall is –

If we compute the probabilities from left to right of each of these permutations we see that the permutations marked with a + have the word “tall” before “He”, so in calculating the conditional probabilities p(“He” | … ) we see the word “tall” is encountered in that computation.   If the sentence had been “He is talls”, these conditional probabilities would have been lower and the result would have been lower.

The XLNet designers decided to train a network that would use this principle as a goal and optimize the network parameters (theta) to find


The result is a sequence of M query operators g that are defined by the model parameters, such that for each permutation the conditional probability can expressed in a softmax form as


That is, for position zt in a permutation , g is a function of the elements that precede it and on the position itself (and not the value at that position.) Computing the sequence of query operators g we need to use what the authors call a two stream recurrence.   One sequence is called the content stream and it is computed as


Notice that the content at the next level up depends on the position zt as well as the value xzt . (Notice the <= in the KV pair. While this may seem like an obscure point, it is important that g depend only on the position in the permutation and not the value there because we are trying to compute the condition probability of that value!) We now define the query operator g by


For each permutation we initialized the recurrences with


Where w is a learned parameter and, in the case of strings longer than 512 words, the previous 512 blocks.

If these recurrences look too abstract, here is how they look in a (simplified) version of the attention stack.


What this fails to show is how values are averaged over multiple different permutations.

Demo Time Again!

This demo will be to use XLNet to generate stories. The way this works is that we will start with a sentence fragment and we will add an extra token at the end and get XLNet to predict the next word. We then add that word to the end of the sentence and repeat the process.


We have cast this process as a function which takes a string and an integer to indicate the length of the string we want to make.   XLNet likes to have a long string to start, so we have added a paragraph of PADDING_TEXT at the beginning.

To force the model to generate the new last word, we do a bit of black magic. We need to generate a permutation mask and a target mask. We are using the excellent HuggingFaces library of Transformer implementations. The API for their XLNetHeadModel can be found here. We first add a blank token to the end of the list of ids (this is just a 0).  If the length of the input_ids is m, then the perm_mask is of dimension 1xmxm and all zeros except the last column which is all ones. The target mapping is 1x1xm is all zeros except the last element which is a one. The target mask tells us which outputs we want and, in this case, it is only the last element.

The output is a list whose first element is a tensor of dimension 1x1x3200. This is the vector of token logit values for each word in the 3200 word dictionary.  To select the word, draw a sample from a multinomial distribution based on softmax of the logit vector. Because of this random draw the results are never the same twice when we run the function. Below are some samples.


The source notebook for this demo is xlnet-story-generator.ipynb in github.

Document classification

Document classification with Transformers require you to add a doc classifier head to the basic model.  In the case of both Bert and XLNet the 0th position of the last hidden state can be considered as a summary of the document as a vector of size 765 and a Tanh activation function is applied to that.

The Classifier basically consists of an additional linear layer of size 765xq where q is the number of classes as shown in figure 5 below.


Figure 5. Document classifier uses pooled output from Bert (or XLNet) as input to an additional linear layer to do get final classifier values.

In the experiments that follow we will use Thilina Rajapakse’s Simple transformers library which wraps the standard HuggingFaces library in a way that make the entire process very simple.

Demo—Classifying Scientific Abstracts

The documents we are going to classify are abstracts of papers from Cornel’s ArXIV.   The collection we are going to use is small (7100 paragraphs).   ArXIV is a collection of papers that are submitted by scientists and are self-classified into one or more of several dozen categories.   In our case there are papers from 138 subtopics and we have grouped them into 5 broad categories: Math, Computer Science, Biology, Physics and Finance. These categories are not uniformly represented and to make things very complicated there are many papers that could be classified in several of these broad groups.   This is not surprising. Science is now very multidisciplinary.   It is not unusual to find a mathematics paper about computational biology that uses techniques from computer science.   Or a physics paper that uses neural networks with an application to finance.    We first experimented with this data collection in Nov. of 2015.  We used a very early version of the Azure ML Studio.   The confusion matrix for the result is shown below.   As you can see, it is not very impressive.


We looked at this problem again in Nov of 2017, but this time we used document analysis with the package genism.


And once again the results were underwhelming. If want to classify these documents with a Transformer, we must first train the top layer to fit our data.   These two lines are sufficient to create the model and train it.


The input to the train function is a Pandas data frame with one column the text of the abstract and the second column the classification into one of our five categories. The training on an Intel Core-7 takes about 20 minutes. Creating and training the same model with XLNet takes about 30 minutes.

We trained the model with 4500 of the abstracts and evaluated it with 2600 of the remaining abstracts. To do the evaluation you use


Result is a 2600×5 tensor where each row a vector of length 5 that is the softmax of the model predictions. Wrong predictions is a tensor that described where the model failed.   This information is very interesting and it illustrates the types of “interdisciplinary” confusion that can arise.   Here are three examples of failures.

A Math paper predicted to be Computer Science.


A Physics paper predicted to be Math.


Finally a Finance paper predicted to be Physics


It is the last sentence that is the clue that this is finance related.   We now can use the results output to compute the confusion matrix.    Here are the results for BART and XLNet.


As can be seen the XLNet results were, on average, slightly more accurate and both methods were superior to the older approaches described above.

The interesting data here is the frequency with which Math papers are labeled as Computer Science.   This is largely due to the fact that the major of papers about neural networks are in the computer science category, but there are also many mathematicians that are looking at this topic.

Because we have the softmax ‘probabilities’ for the classification of each document, we can ask about the 2nd most likely choice and compute a “best-of-2” score for each paper.   In other words, an X paper is classified as X if one of the two top predictions is an X.   The results are shown below.   As you can see, it resolves the math-cs confusion as well as bio-physics.


The data and notebook for this demo are available in GitHub.

Final Thoughts

The experiments on document classification are simple illustrations.   The Transformer scientific literature has a host of tests to which all of these methods have been subjected.  The careful reader will notice that in the examples we have illustrated above, we only used the smallest of the available language models of each type.   This was to allow our Notebooks to run in a relatively standard laptop.  Of course XLNet is not the last word in Transformers.   Some of the new models are huge.   For example Microsoft had just announced Turing-NLG, with 17 billion parameters.

One of goals of Turing-NLG and many other Transformer models is to improve performance around some important tasks we did not discuss above. For example, question answering and document summarization.   These are very important examples. For example, the Stanford Question Answering Dataset (SQuAD 2.0) is often cited. Another application of Transformers is uncovering the structure of language expressions. John Hewitt has done some interesting experiments along these lines in Finding Syntax with Structural Probes,   In Oct of 2019 we posted an article “A ‘Chatbot’ for Scientific Research: Part 2 – AI, Knowledge Graphs and BERT” where we discussed the role Transformers are playing in only search and the role they will play in smart digital assistants. We concluded there that it was necessary to extend the analysis from sentences and paragraphs to incorporate additional information from knowledge graphs.   Chen Zhao from Microsoft consider aspects of this problem in the paper Transformer-XH: Multi-Evidence Reasoning with Extra Hop Attention.    We are excited by the progress that has been made in this area and we are convinced that many problems remain to be solved.


Doing Deep Learning in Parallel with PyTorch.

This is a small tutorial supplement to our book ‘Cloud Computing for Science and Engineering.


Machine learning has become one of the most frequently discussed application of cloud computing.  The eagerness of cloud vendors to provide AI services to customers is matched only by their own interest in pushing the state of the art for their own internal use.   In Chapter 10 of our book we discussed several “classical” machine learning algorithms and we introduced some the methods of deep leaning using three different toolkits:  MXNet, Tensorflow and CNTK.   We also described several cloud AI services available on AWS and Azure.    However, one topic that we did not address at all was the training of neural nets that use the parallel computing capabilities available in the cloud.  In this article we will do so using another deep learning toolkit, PyTorch, that has grown to be one of the most popular frameworks.

In the simple tutorial that follows, we will first describe PyTorch in enough detail to construct a simple neural network.   We will then look at three types of parallelism that can be used while training a neural net.  The easiest to use is GPU parallelism based on Nvidia-style parallel accelerators.  We will illustrate this with an example based on the PageRank algorithm.   Next, we will consider distributed parallelism where multiple processes collaborate and synchronize around the training of a single neural network.   The obvious extension of this is to use multiple processes, each with a GPU to accelerate performance.   Finally, we will briefly describe the use of multiple GPUs in a single thread to pipeline the training of a network.

A Tiny Intro to PyTorch.

PyTorch 1.0, which was open sourced by Facebook in 2018, has become one of the standards for deep learning.   The website is well documented with some excellent tutorials, so we will not duplicate them here.   However, to make this readable, we will introduce some basic Torch ideas here and refer to the tutorials for in-depth learning.   PyTorch is deeply integrated with Python. As with all deep-learning frameworks, the basic element is called a tensor. At a superficial level, a PyTorch tensor is almost identical to a Numpy array and one can convert one to the other very easily.  The primary difference is the set of operators possible on a PyTorch tensor and the fact that a tensor can retain the history of operators that created it.   This history can be used to create derivative and gradients that are essential for training neural networks.   We will return to this property later, but first let’s look at some basic tensor properties.

In the example below we create a Numpy array of 5 rows and 3 columns with 1’s for each element and use it to create a PyTorch tensor.


PyTorch is extremely flexible.  For example, we can create a tensor from a python list of values and use this tensor to create a diagonal matrix.


Operations such as matrix-matrix multiply ( and matrix-vector multiply ( are based on reasonable standard algebraic rules. We will illustrate this with our vector z and matrix M from above.  Matrix-Matrix multiply requires 2-d tensors, but we can reshape our vector into a 2-d array.  Once it is 2-D we can multiply it by our matrix.   Or we can do a matrix vector multiply using M and the original 1-d vector z.  The result of the matrix multiply is a 2D 5×1 matrix.   The matrix vector multiply yields a 1-d vector.


Regular multiplication and addition are different.  if you multiply a tensor of the form [a, b]*[c,d] the result is the elementwise product [ac, bd].   In the case one of the operands is a matrix the result is shown below.  Let’s start with M and add 1 to each element then do a point wise multiply.


These are only a few of the PyTorch operators, but they are sufficient for describing the material that follows.

Using the GPU

If you have access to a server with a GPU, PyTorch will use the Nvidia Cuda interface.  Working with the GPU is not very elegant, but it is simple and explicit.   If you have a server with three GPUs, they are named “cuda:0”, “cuda:1”, ‘’cuda:2”.   To do computation on the GPUs you must move all the associated data explicitly to the GPUs.  To create a tensor on “cuda:1”  one writes


If there is only one GPU, the name “cuda” is sufficient.   To move a tensor to the GPU from the CPU memory to the GPU you write


Moving a GPU resident tensor back to the CPU memory one uses the operator .to(‘cpu’).

GPU parallelism:  The PageRank algorithm

To illustrate the programming and behavior of PyTorch on a server with GPUs, we will use a simple iterative algorithm based on PageRank. It is not necessary to understand the derivation below because we will only be concerned with the performance results.   But, for completeness we describe it anyway.

PageRank is an algorithm introduce in 1996 by Larry Page and Sergey Brin to rank web pages in their early version of the Google search engine.   The rank can be considered the relative importance of a page.  Consider a graph G of N nodes.  Let link(i) by the set of nodes that point to node i.   And let out(j) be the number of links out of node j.  The formula for computing the rank Pr(i) for each node i in g is given by


Where the scalar d is called the damping factor and it is a number between 0 and 1.  This formula is for the special case that there is at most one link from a node to another.   (You can read the Wikipedia article to learn more about this equation.)  We can recast this in matrix algebra terms as follows.    Let G(I,j) be the adjacency matrix of G , i.e.  G(i,j) = 1 if there is a link from j to i in G and zero otherwise.   Let Out be a diagonal matrix with the value 1/out(j) on the jth diagonal element when out(j) is not zero and zero otherwise.   The formula above can be written in matrix form as


Where Pr is the vector of Pr(j)’s and the “dot” is matrix or vector multiply depending on the association (G*Out)*Pr or G*(Out*Pr).   (As we shall see the choice can make the difference in computational cost substantial.) As you can see, Pr is closely related to an eigenvector of G*Out.   In fact if d = 1, it is an eigenvector.  To compute it, we can turn this into a simple iteration


Fortunately, this converges to the “principle eigenvector” solution rather rapidly.    We will use this to compare GPU computation to regular CPU computing.

Our goal is to demonstrate how to do PyTorch computation on the GPU and compare the performance to running on the cpu and we will use the iterative page rank algorithm to do this. To build a sample network and adjacency graph, we will use the NetworkX python package to create a slightly modified random (binomial) Erdős-Rényi graph (see this page for details).  The graph was modified so that every node has at least one outgoing node.  To use this graph with PyTorch, we  use the lovely DGL library from library from  NYU and Shanghai  by Minjie Wang, Jake Zhao, Prof. Zheng Zhang and Quan Gan.

We considered 4 different implementations of the page rank algorithm and run them on a single cpu and a single GPU.  (We describe multiple CPUs and GPUs in the neural net example in the next section.)  Version-0 is one that is described in the DGL documents and we will not reproduce it here because it uses many of the features DGL that are not important for our discussion.  We ran it (and the others) on AWS with a p2.xlarge server which has a single NVIDIA K80 GPU.  The graph we created and tested with has N= 10000 nodes.  We used K=200 iterations and a value of d=0.85.   In all our tests the results converged to the same result within roundoff error.  (The complete source for these examples and experiments are available in gitub.)

Version-1 of the algorithm uses the fact that the DGL library of a sparse graph interoperates fully with the PyTorch library.    Hence the PyTorch matrix-matrix multiply and matrix-vector multiply  work when one of the arguments is a sparse matrix representation of our graph.   The core of the algorithm is shown below. This function takes the DGL representation of the graph,  the number K of iterations and a parameter device that is either the text “cuda” or “cpu”.


Our pv vector is created as all ones on the selected device.   Lines 2 and 4 are DGL operators.  In line 2 and 3 we construct the Out array as a Nx1 matrix.  (To simplify things, there is no divide-by-zero here because of the modification to the graph above.)  In line 4 we extract the adjacency matrix as edgesIdmp is a Nx1 matrix that is (1-DAMP)/N for each element where DAMP is the value for d.   It is important to notice we are using the sparse representation of the matrix edges in the matrix-matrix multiply.

Version-2 is almost identical to Version-1  except that we convert the edges matrix to a dense form  prior to executing the loop  as shown below.


Notice that in version-2 above there is a pointwise product pv*Out inside the loop prior to the matrix vector multiply.  Because of the associativity we described above, we can move Out outside the loop and convert it to a diagonal matrix and multiply it with gm as shown in version-3 below.


The performance of all four of these Pagerank algorithms is shown in the table below.  Each algorithm was run with the parameter device set to “cpu” and “cuda”.  As one can see the CPU-only time for version-0 is the second best, but it does not profit from the GPU as much as version 2 and 3.  Version-1 is not competitive at all.   It is most interesting to note the difference between Version-2 and Version-3.  The behavior can be explained easily enough.   Replacing K (200) vector point products with an NxN matrix-matrix multiply is not very smart if N=10000 and matrix multiply is an O(N3) algorithm.  The surprising thing is how well the matrix multiply and matrix vector multiply are optimized on the GPU.   In these cases speed-ups of 2 and 3 orders of magnitude are not uncommon.   Unfortunately, the DGL sparse matrix-vector multiply in Version-1 is not as well optimized.


This example illustrates that the GPU is capable of substantial performance improvements in a matrix-vector computation in PyTorch.  The source code for this example as a Jupyter notebook is in github along with the other examples from this chapter.   Now we return to Neural nets.

Neural Net training with the PyTorch and the GPU

One of the advantages of frameworks like PyTorch is how easy it is to build a neural network.  To illustrate this point, we will build network that takes real, scalar values for x and learns to compute an arbitrary function f(x).   To do this we will train the network by selecting a sequence of values xi in an interval [a, b] for i=0,….N and supplying it with f(xi).   The result will be a network that is very good at computing f(z) for any z in [a, b].   Of course, this is not as exciting as deep neural networks that take inputs x that represent an image and f(x) is a description of the content of the image, but the concept is exactly the same.   The only difference is that when x is a 64×64 pixel color image it is not a single real number but a vector of real numbers in a space of dimension 64x64x3, which is pretty big.  For our purposes, dimension = 1 is sufficient.

We will construct the network as four layers with an input layer of size 80, two middle layers of size 80 and an output layer of size 40.  (80 and 40 are somewhat arbitrary here.  There was no attempted to find optimal values.)  These are linear layers so the connections to the previous ones are complete as illustrated in the figure bellow.


This is an extremely simple type of network that has enough layers we can say it is “deep-ish”.  (A really deep network for a serious imaging problem will have around 50 layers.)   The PyTorch code to specify this network is shown below.


The network is created as a subclass of torch.nn.Module and each instance contains instances of our four layers.  It also contains an instance of a rectified linear operator defined by ReLU(x) = max(0, x).   The method forward(x) defines how an input scalar tensor x is processed through the network.  Each linear layer takes a vector of length 80 and produced an output of the same shape.  However, we apply the ReLU operator to the input of all interior nodes except the input layer.  ReLU is an “activation” function that decides when a “neuron” turns on (in this case when the input is positive).  In more mathematical terms, this allows the network to behave like an arbitrary continuous non-linear function.  Without it, the network is a completely linear transformation.  (The mathematical proof if this “universality” can be seen in the early work of George Cybenko and others.)

To train the network we will provide inputs and  target values and adjust the parameters that define the links. More specifically each layer has an associated matrix W and offset vector b.  The forward(x) function computes the following.   It is the training step determines the “best values” of the Ws and bs.


We use the standard least squares error between the values for Out and the target data to determine “best values” and we use stochastic gradient decent to do the optimization.   But to do this we need to compute the gradient of Out with respect to each of the Ws and bs.  As mentioned previously, torch tensors are capable of recoding the history of their creation and we can work backward to compute the derivate of values computed with them.   To force a tensor to begin a chain of computations we create it with the flag requires_grad=True as in the example below.


Then to compute the gradient of out with respect to x we invoke the backward() operator as follows.


To verify this, we can get out the old derivative chain rule and compute this by hand as


and plug in the numbers to get.


You don’t want to compute the gradient of our neural network output by hand.   Let Backward do it automatically.

Training the network is now easy.   We first gather together values for the input and target.   In our trivial test case we will use the input to be numbers from the interval [0.0, 20.0] and the function is f(x) = sqrt(x)*sin(3.14*x/5.0).

The training loop, in its simplest form, is shown below with an SGD optimizer and the mean squared error loss function.  The optimizer is given the learning rate of 0.001.  Like our choice of 80 for the size of the network layers, this was somewhat arbitrary, and it worked.


Running this with various parameters to make it converge, one can test the network by providing values for the input and recoding the output.  Plotting the result (blue) against the target function (orange) one can see the convergence.


A more conventional approach to the training is to divide the input and target vectors into small “mini-batches” and passing each mini-batch through the loss-gradient-generation-optimization step independently.  This generates substantially faster convergence rates.  As we shall see, this independence also allows us to compute them in parallel and merge the results later.

Basic distributed computing with PyTorch

We have already described how PyTorch exploits GPU parallelism.   If a server has multiple cores or if you have a cluster of servers available, it is also possible to use MPI-like message passing style to coordinate multiple thread of computation.  One must create a master process that forks off child processes that do the work.   One way to do this that works for Jupyter notebooks but not GPUs looks like the following.


The Process function creates an instance of a processes that is passed an initialization function and an identifier “rank”, the total “size” of the collection of threads and a function “run” that will be invoked once the initialization is complete.  The init_process is run on each created thread and it initializes a communication backend that the spawned processes share.   In the case above it is called “gloo” and it works for a single multicore server.   However, one can also MPI or NCCL.  If you have a cluster with Infiniband,  NCCL is the one to use.   We will not discuss NCCL  here. See this page for more details.

If you are running on a multi-GPU system you must  use a different method to launch the processes from a python program and not a Jupyter notebook.   The program looks like


We will call this the mp.spawn(…) version of the launch program.  It is run from the command line and not Jupyter.

In this case the run(rank, size) function must call setup(rank, size) first then then clean-up later.  The Github site has three programs: distributed_learn_multicore-final.ipynb which illustrates the training of our neural net using the first method described above, which uses a server with 8 GPUs to do the training,   and distributed_local.ipynb which illustrates the basic communication primitive.   We  discuss these basic communication methods next and then give some performance results of the distributed learning cases.

We illustrate the communication system with three different “run” function examples.   The first one illustrates basic MPI-style message passing.  This one simply passes a tensor from process 0 to 1 and then from 1 to 2 and then to 3.  Each time it adds 1 to the tensor.


The output from each process is routed to the master and is what you would expect (except the order in which the lines are received are random).  Running this with P=3 child process gives the result below.


PyTorch also support many standard collective communication routines.   For example, to compute the sum of elements from each process we use a reduction operator.   The input is a Torch data structure of identical shape on each process.   The reduction overwrites the input tensor with the sum of the corresponding elements of all processes.


In this case process 0 has a scalar tensor with value 1, process 1 has a tensor with value 2 and process 2 has a tensor with value 3.   The output of the reduction is


The notebook distributed-local.ipynb has complete code for these communication examples.

Parallelizing the Training of our Neural Net

In this section we will describe the results of using two different approaches to parallelize the training phase of our demo neural network.  The first method will be to use a server with multiple GPUs. The second method use multiple cores on a multicore server without GPUs.

The multi-GPU method

In this case we are using an AWS p2.8xlarge server which has 8 NVIDIA K80 GPUs and 16 cores with 488 GBytes of memory.  Because we are using GPUs we have to use the mp.spawn( … ) version of the launch program described above. The network is the same one we described above.   We created a dataset of size 80000.   The variable M is the number of processes and BS is the batch size.  A function batchtodev(rank, device) delivers a list of 80000/(M*BS) unique batches of data to the process identified by the local variable rank.   The run function now takes the form


The initialization function setup(…) is described with the mp.spawn(…) code described above.   The way in which the GPUs are allocated to threads is a bit confusing.   It is designed so that if you have D GPUs and M threads, each thread is assigned a unique list of D/M GPUs.  We only use the first GPU on the in each thread.

The training loop in this version now iterates through each of the elements in the batch list for each epoch.  There is function sync_init_weights( ), a conditional controlling a periodic call to sync_gradients( ) and every 200 epochs we call a special function average_mode( ).   Sync_init_weights() uses a simple broadcast to copy the initial random-state network to the other threads.   While it is possible to allow each thread to create its own converged model they will not converge to anything much because each thread has only one fraction of the full data set.   We need a way to periodically tie the independently evolved models together.  That is done every 200 epochs with the call to average_model( ) which  uses a global reduce sum operation.   See for the details of these functions.

Another approach to blending the results of multiple training threads is to average the gradients from each parallel set of mini-batches before the optimizer step.   This is what the function sync_gradients( ) accomplishes.  But doing this every mini-batch step adds a great deal of thread synchronization overhead. Is it really necessary to do this every step?  Doing it often does improve the rate of convergence, but what happens if we do it periodically?  In the table below, we run the training for our network of size N=80000 with BS = 1000 until the loss is below 0.05.   This is done for doing it every mini-batch step, then for every 2, 4,10 and never mini-batch steps and finally “never” to see what happens if we only rely on average_model().

frequency 1 2 4 10        never
time 330 183 136 99 88
loss 0.039 0.05 0.04 0.025         0.03
Epochs x 1000 4 6 6 7 8
total time 1320 1098 816 693 704

This version was run on an Intel core-i7  with M = 4 threads using the no-GPU method (described in the next section).   The results from run-to-run vary greatly because of the randomness in the processes, but it was clear that doing it every mini-batch step produced excellent convergences, but at high cost.  It must also be noted that the penalty for the overhead depends on the size of the network (which equals the size of the gradient) and many other factors.

Running on 8 GPUs on one server.

Turning to the performance of our training code on the AWS 8-cuda core server, we ran the same configuration (N=80000, BS=1000) but rather than compare convergence rates, we ran it for 10000 epochs for 1 GPU, 2 GPUs, 4 GPUs and 8 GPUs.   At 10000 epochs they were all converged, but using the same number of epochs means that the total amount of work was the same in each case.   The results are below.  If we compare the performance of 1 GPU to 8 GPU we realize a speed-up of 6.74.

gpus 1 2 4 8
elapsed time 934 489 249 139
cuda  speed-up 1 1.91 3.75 6.74

Our test case is extremely small.   Much greater performance gains are possible with greater numbers of GPUs when larger problems are used.

The no-GPU method

This version uses multi-core servers without GPUs.   We ran this on an AWS c5n4xl server with 32 gig of memory with 8 real cpus (16 virtual cpus).  The network is the same one we described above.   We created a dataset of size 80000.   The variable M is the number of processes and BS is the batch size.  As before, A function batchtodev(rank, ‘cpu’) delivers a list of 80000/(M*BS) unique batches of data to the process identified by the local variable rank.   The run function now takes the form


In this case we had a surprise.   Using PyTorch multiprocessing and increasing the number of processes thread did not increase performance.   In fact, it got worse!  The table below is for 8, 4, 2 and 1 processes with the best performance for 1 process.  We also give the results for the Intel Core i7 running ubuntu in a Docker container.

processes 8 4 2 1
cpu elapsed time 4801 3003 1748 1403
cpu speed-up 1 1.7 1.8 2
cuda/cpu speedup 5.1 6.1 7 10.1
corei7-docker 1282 1274 1064 1151

There is an interesting explanation for this behavior.  The fact is that for PyTorch 1.3.0 and 1.3.1, the basic neural net functions (model evaluation, backward differentiation, optimization stepping) are all optimized to use all available cores.   So in the case of one process thread, all 16 cores are dividing the work.   In the case of 2 processes, there are 2 groups of 8 cores each working and the multithreading is pure overhead.  In all cases all cores are busy and the number of process threads only adds overhead.  This was verified by running the Linux “top” command.

Of course, if you run this on a cluster of multicore servers using the MCCI or MPI runtime, one can expect different behavior.  There have been numerous studies of the parallel training performance.  One we like  is Parallel and Distributed Deep Learning , by Vishakh Hegde and Sheema Usmani.

Model Parallelism

In the examples above we have used multiple copies of a neural net running on separate threads and then merged the results together.   As we have shown this works with multiple GPU on a single server, but without a GPU, we did not get performance that beat the native PyTorch built-in multi-core parallelism.

There is another approach to parallelizing the training and model evaluation computation that is in some sense, orthogonal to the method we described above.   This can be very valuable if the model does not fit in a single GPU memory.  One can slice the network so that the first few layers are in one GPU and the rest are in a second GPU.   Then you can feed inputs into the first, transfer the data from the last layer in the first GPU to the first layer in the second GPU.   Of course doing this, by itself, does not buy you any parallelism:  one GPU is idle while the other is work.  However, one can “pipline” batches through the model.   In this way batch0 goes to GPU0 and then to GPU1.  A soon as batch0 clears GPU0, push batch1 to GPU0 while GPU1 is working on batch0.   You need to accumulate the set of batches at the end and them concatenate them together as described in this PyTorch best practices tutorial.

However, one can do even more with pipelining as part of parallelization.  The back propagation algorithm is based on computing the gradients of the error from the last layer to the first layer it is possible to push to gradient back through our pipeline  of GPUs.   A group at Microsoft Research, CMU and Stanford has pushed this idea and produced a very nice paper at SOSP19.   The figure below, from that paper, illustrates the idea.  You distribute the layers among the GPUs (using data parallelism for very dense  layers) and pipeline batches as described above.  Then you also let the back propagation push back up the pipeline. After a startup, there is plenty of work to fill the available gpu slots.   However, to get this to work requires a sophisticated resource scheduler and care must be taken to make sure the model is consistent and converges correct.    Read their paper.   It is very nice.


Figure from