Author Archives: dbgannon

About dbgannon

Dennis Gannon is a computer scientist involved with the application of cloud supercomputing to data analysis for science. From 2008 until he retired in 2014 he was with Microsoft Research as the Director of Cloud Research Strategy. In this role he helped provide access to Azure cloud computing resources to over 300 projects in the research and education community. Gannon is professor emeritus of Computer Science at Indiana University and the former science director for the Indiana Pervasive Technology Labs. His research interests include cloud computing, large-scale cyberinfrastructure, programming systems and tools, distributed computing, parallel programming, data analytics and machine learning, computational science, problem solving environments and performance analysis of scalable computer systems. His publications include more than 100 refereed articles and three co-edited books. Gannon received his PhD in computer science from the University of Illinois Urbana-Champaign and a PhD in mathematics from the University of California, Davis.

Moving Cloud AI to the Edge

Edge computing describes the movement of computation away from cloud data centers so that it can be closer to instruments, sensors and actuators where it will be run on “small” embedded computers or nearby “micro-datacenters”.  The primary reason to do this is to avoid the network latency in cases where responding to a local event is time critical.  This is clearly the case for robots such as autonomous vehicles, but it is also true of controlling many scientific or industrial apparatuses.  In other cases, privacy concerns can prohibit sending the data over an external network.

We have now entered the age where advances in machine learning has made it possible to infer much more knowledge from a collection of the sensors than was possible a decade ago.    The question we address here is how much deep computational analysis can be moved to the edge and how much of it must remain in the cloud where greater computational resources are available.

The cloud has been where the tech companies have stored and analyze data.  These same tech companies, in partnership with the academic research community, have used that data to drive a revolution in machine learning.  The result has been amazing advances in natural language translation, voice recognition, image analysis and smart digital assistants like Seri, Cortona and Alexa.   Our phones and smart speakers like Amazon Echo operate in close connection with the cloud.   This is clearly the case when the user’s query requires a back-end search engine or database, but it is also true of the speech understanding task.  In the case of Amazon’s Echo, the keyword “Alexa” starts a recording and the recorded message is sent to the Amazon cloud for speech recognition and semantic analysis.   Google cloud, AWS, Azure, Alibaba, Tencent,  Baidu and other public clouds all have on-line machine learning services that can be accessed via APIs from client devises.

While the cloud business is growing and maturing at an increasingly rapid rate, edge computing has emerged as a very hot topic.  There now are two annual research conferences on the subject: the IEEE Service Society International conference on Edge computing and the ACM IEEE Symposium on Edge computing.  Mahadev Satyanarayanan from CMU, in a keynote at the 2017 ACM IEEE Symposium and in the article “The Emergence of Edge Computing” IEEE Computer, Vol. 50, No. 1, January 2017, argues very strongly in favor of a concept called a cloudlet which is a server system very near or collocated with edge devices under its control.   He observes that applications like augmented reality require real-time data analysis and feedback to be usable.  For example, the Microsoft Hololens mixed reality system integrates a powerful 32bit Intel processor with a special graphics and sensor processor.   Charlie Catlett and Peter Beckman from Argonne National Lab have created a very powerful Edge computing platform called Waggle (as part of the Array of Things project) that consists of a custom system management board for keep-alive services and a powerful ODROID multicore processor and a package of instruments that measure Carbon Monoxide, Hydrogen Sulphide, Nitrogen Dioxide, Ozone, Sulfur Dioxide, Air Particles, Physical Shock/Vibration, Magnetic Field, Infrared Light, Ultraviolet Intensity, RMS Sound Level and a video camera.  For privacy reasons the Waggle vision processing must be done completely on the device so that no personal identifying information goes over the network.

Real time computer vision tasks are among the AI challenges that are frequently needed at the edge.  The specific tasks range in complexity from simple object tracking to face and object recognition.   In addition to Hololens and Waggle there are several other small platforms designed to support computer vision at the edge.   As shown in Figure 1, these include the humble RaspberryPi with camera,  the Google vision kit and the AWS DeepLens.

fourdevices

Figure 1.  From the left is a RaspberryPi with an attached camera, ANL Waggle array, the Google AIY vision kit and the AWS DeepLens.

The Pi system is, by far, the least capable with a quad core ARMv7 processor and 1 GB memory.   The Google vision kit has a Raspberry Pi Zero W (single core ARMv7 with 512MB memory) but the real power lies in the Google VisionBonnet which uses a version of the Movidius Myriad 2 vision processing chip which has 12 vector processing units and a dual core risc cpu.   The VisionBonnet runs TensorFlow from a collection of pretrained models.    DeepLens has a 4 megapixel camera, 8 GB memory, 16 GB storage and an intel Atom process and Gen9 graphics engine which supports model built with Amazon SageMaker that is pre-configured to run TensorFlow and Apache MXNet.

As we stated above many applications that run on the edge many must rely on the cloud if only for storing data to be analyzed off-line.   Others, such as many of our phone apps and smart speakers, use the cloud for backend computation and search.   It may be helpful to think of the computational capability of edge devices and the cloud as a single continuum of computational space and an application as an entity that has components distributed over both ends.   In fact, depending upon the circumstances parts of the computation may migrate from the cloud to the device or back to optimize performance.   As illustrated in Figure 2, AWS Greengrass accomplishes some of this by allow you to move Lambda “serverless” functions from the cloud to the device to form a network of long running functions that can interact with instruments and securely invoke AWS services.

greengrass

Figure 2.   AWS Greengrass allows us to push lambda functions from the cloud to the device and for these functions to communicate seamlessly with the cloud and other functions in other devices.  (Figure from https://aws.amazon.com/greengrass/ )

The Google vision kit is not available yet and DeepLens will ship later in the spring and we will review them when they arrive.   Here we will focus on a few simple experiments with the Raspberry Pi and return to these other devices in a later post.

Deep Learning Models and the Raspberry PI 3.

In a previous post we looked at several computer vision tasks that used the Pi in collaboration with cloud services.   These included simple object tracking and doing optical character recognition and search for information about book covers seen in an image.   In the following paragraphs we will focus on the more complex task of recognizing objects in images and we will try to understand the limitations and advantages of using the cloud as the backend computational resource.

As a benchmark for our experiments we use the Apache MXNet deep learning kit with a model based on the resnet 152-layer neural network that was trained on a  collection  of over a  10 million images and over 11 thousand labels.    We have packaged this MXNet with this model into a Docker container dbgannon/mxnet which we have used for these experiments.  (the details of the python code in the container are in the appendix to this blog.

Note: If you want to run this container and if you have dockerand Jupyter installed you can easily test the model with pictures of your own.  Just download the jupyter notebook send-to-mxnet-container.ipynb and follow the instructions there.

How fast can we do the image analysis (in image frames per second)?

Running the full resnet-152 model on an installed version  of MXNet  on more capable machines (Mac mini and the AWS Deeplearning AMI c5.4xlarge, no GPU)  yields an average performance of about 0.7 frame/sec.     Doing  the same experiment on the same machines, but using the docker container and a local version of the Jupyter notebook driver we see the performance degrade a bit to an average of about 0.69 frame/sec (on a benchmark set of images we described in the next paragraph).    With a GPU one should be able to go about 10 times faster.

For the timing tests we used a set of 20 images from the internet that we grabbed and reduced so they average about 25KB in size.  These are stored in the Edge device.  Loading one of these images takes about the same amount of time as grabbing a frame from the camera and reducing it to the same size.  Two of images from the benchmark set and the analysis output is shown in figure 3 below.

bugs

Figure 3. Two of the sample images together with the output analysis and call time.

How can we go faster on the Pi 3?   We are also able to install MXNet on the Pi 3, but it is a non-trivial task as you must build it from the source.   Deployment details are here, however, the resnet 152 model is too large for the 1MB memory of the Pi 3, so we need to find another approach.

The obvious answer is to use a much smaller model such as the Inception 21 layer network which has a model database of only 23MB (vs 310MB  for resnet 152), but it has only 1000 classes vs   the 11,000 of the full rennet 152. We installed Tensorflow on the Pi3. (there are excellent examples of using it for image analysis and recognition provided by Matthew Rubashkin of Silicon Valley Data Science.) We ran the Tensorflow Inception_2015_12_05 which fit in memory on the Pi.  Unfortunately, it was only able to reach 0.48 frames per second on the same image set described above.

To solve the, we need to go to the cloud.    In a manner similar to the Greengrass model,  we will have the Pi3 sample the camera and downsize the image and send it to the cloud for execution.   To test it we ran the MXNet container on a VM in AWS and pointed the Pi camera at various scenes.  The results are shown in Figure 4.

pipict1

pipict2

Figure 4.  The result for the toy dinosaur result is as it is logged into the AWS DynamoDB.   With the bottom two images show only the description string.

The output of the model gives us likelihood of various labels.  In a rather simple minded effort to be more conversational we translate the likelihood results as follows.  If a label X is  more than 75% likely  the container returns a value of “This certainly looks like a X”.   If the likely hood is less than less than 35%  it returns “I think this is an X, but I am not sure” (the code is below).    We look at the top 5 likely labels and they are listed in order.

The Pi device pushes jpeg images to AWS S3 as a blob.   It then pushes the metadata about the image (a blob name and time stamp) to the AWS Simple Queue Service.   We modified the MXNet container to wait for something to land in the queue. When this happens, it takes the image meta data and pulls the image from S3 and does the analysis and finally stores the result in an AWS DynamoDB table.

However we can only go as fast as we can push the images  and metadata to the cloud from the Pi device.   With repeated tries we can achieve 6 frames/sec.   To speed up the analysis to match this input stream we spun up a set of analyzers using the AWS Elastic Container Service (ECS).  The final configuration is shown in Figure 5.

mxnet-cluster

Figure 5.   The full Pi 3 to Cloud image recognition architecture. (The test dataset is shown in the tiny pictures in S3)

To conduct the experiments, we included a time stamp from the edge device with the image metadata. When the MXNet container puts the result in the DynamoDB table it includes another timestamp.  This allows us to compute the total time from image capture to result storage for each image in the stream. If the device sends the entire collection as fast as possible then the difference between the earliest recorded time stamp and the most recent gives us a good measure of how long it takes to complete the entire group.

While the Pi device was able to fill S3 and the queue at 6 frames a second having only one MXNet container instance yielded the result that the total throughput was only about 0.4 frames/sec.   The servers used to host the container are relatively small.   However, using the ECS it is trivial to boost the number of servers and instances.  Because of the size of the container instance is so large only one instance can fit on each of the 8 GB servers.   However as shown in Figure 6 we were able to match the device sending throughput with 16 servers/instances.  At this point messages in the queue were being consumed as fast as they were arriving. Using a more powerful device (a laptop with a core I7 processor) to send the images we were about to boost the input end up to just over 7  frames per second and that  was matched with 20 servers/instances.

framespersec

Figure 6.  Throughput in Frames/second measured from the Pi  device to the final results in the DynamoDB instance.   In the 20 instance case, a faster core I7 laptop was used to send the images.

Final Thoughts

This exercise does not fully explore the utility of AI method deployed at the edge or between the edge and the cloud.  Clearly this type of full object recognition at real-time frame rates is only possible if the edge device has sophisticated accelerator hardware.  On the other hand, there are many simple machine learning models that can be used for more limited applications.   Object motion tracking is one good example.  This can be done in real-time.    This is typically done by comparing a frame to a previous one and looking for the differences.   Suppose you need to invoke fire suppression when a fire is detected.   It would not be had to build a very simple network that can recognize fire but not simple movement of ordinary objects.   Such a network could be invoked whenever movement is detected and if it is fire the appropriate signal can be issued.

Face detection and recognition is possible with the right camera.   This was done  with the Microsoft Xbox-1 and it is now part of the Apple IPhone X.

There are, of course, limits to how much we want our devices to see and analyze what we are doing.   On the other hand it is clear that advances in automated scene analysis and “understanding” are moving very fast.   Driverless cars are here now and will be commonplace in a few years.   Relatively “smart” robots of various types are under development.   It is essential that we understand how the role of these machines in society can benefit the human condition along the lines of the open letter from many AI experts.

 

Notes about the MXNet container.

The code is based on a standard example of using MXNet to load a model and invoke it.   To initialize the model, the container first loads the model files into the root file system.  That part is not show here. The files are full-resnet-152-0000.parms (310MB), full-resnet-152-symbols.json (200KB) and full-synset.txt (300KB) . Once loaded into into memory  the full network is well over 2GB and the container requires over 4GB.

Following the load, the model is initialized.

import mxnet as mx
# 1) Load the pretrained model data
with open (' full-synset.txt ','r ') as f:
        synsets = [l.rstrip() for l in f]
sym, arg _params , aux_pa ram s = mx . model .load _checkpoint( 'full-resnet-152' ,0)
# 2) Build a model from the data
mod = mx.mod .Module (symbol =sym , context =mx. gpu ())
mod. bind ( for_training =False, data_shapes=[( 'data ',(1,3,224,224))])
mod. set_params ( arg_params , aux_params )

The function used for the prediction is very standard.  It takes three parameters: the image object, the model and synsnet (the picture labels).   The image is modified to fit the network and then fed to the forward end.  The output is a Numpy array which is sorted and the top five results are returned.

def predict(img, mod, synsets):
    img = cv2.resize(img, (224, 224))
    img = np.swapaxes(img, 0, 2)
    img = np.swapaxes(img, 1, 2)  
    img = img[np.newaxis, :] 
    mod.forward(Batch([mx.nd.array(img)]))
    prob = mod.get_outputs()[0].asnumpy()
    prob = np.squeeze(prob)
    a = np.argsort(prob)[::-1]
    result = []
    for i in a[0:5]:
          result.append( [ prob[i], synsets[i][synsets[i].find(' '):]])
    return result

The container runs as a webservice on port 8050 using the Python “Bottle” package.  When it receives a web POST message to “call_predict” it invokes the call_predict function below.  the image has been passed as a jpeg attachment with is extracted with the aid of the request package.   It is saved in a temporary file and then read by the OpenCV read function.  Unfortunately there was no way to avoid the save followed by read because of limitations to the API.  However we measured the cost of this step and it was less than 1% of the total time of the invocation.

The result of the predict function is a two dimensional array with each row consisting of a probability and the associated label.   The call returns the most likely labels as shown below.

@route('/call_predict', method='POST')
def call_predict():
    t0 = time.time()
    result = ''
    request.files.get('file').save('yyyy.jpg', 'wb')
    image = cv2.cvtColor(cv2.imread('yyyy.jpg'), cv2.COLOR_BGR2RGB)
    t1 = time.time()
    result = predict(image,  mod, synsets)
    t2 = time.time()
    answer = "i think this is a "+result[0][1]+" or it may be a "+result[1][1]
    if result[0][0] < 0.3:          answer  = answer+ ", but i am not sure about this."     if result[0][0] > 0.6:
         answer = "I see a "+result[0][1]+"."
    if result[0][0] > 0.75:
         answer = "This certainly looks like  a "+result[0][1]+"."
    answer = answer + " \n total-call-time="+str(t2-t0)
    return(answer)
run(host='0.0.0.0', port=8050)

The version of the MXNet container used in the ESC experiment replace the Bottle code and call_predict with loop that polls the message queue, pulls a blob from S3 and pushes the result to DynamoDB

Algorithmia Part 2. Document Embedding with Gensim and Deploying a Trained Model in the Algorithmia Cloud

 

In a previous post we described Algorithmia,  a cloud service for discovering, invoking and deploying algorithms.  In this short article we look at Algorithmia as a tool to deploy trained machine learning models.   We used a tool called Gensim to build a model of scientific documents and then create an Algorithmia service that uses the model to predict the topic categories of scientific article.

A Review of Word Vectors and Document Vectors.

This technology has been around for a while now, so this review is more of a history lesson and not a deep technical review.  However, we will give you links to some of the important papers on the subject.

If you want to do document analysis through machine learning, you need a way to represent words in a vector form.  Given a collection of documents you can extract all the words to create a vocabulary.  If the size of the vocabulary is 100,000 words, you can represent each word as a “one-shot” vector in which the i-th word in the vocabulary is a vector of zeros except for a 1 in the i-th position.   Then if your then each document in your collection can be represented as the sum of vectors corresponding the words in that document.   If you have M documents, then the collection is represented by the sparse matrix of size M x 100,000.  Using this “bag of words” representation, there are a variety of traditional techniques such as Latent Sematic Analysis that can be used to extract the similarities between documents.

About five years ago, a team from Google found a much better way to create vectors from words so that words that are used in similar semantic context are nearer to each other as vectors.   Described in the paper by Tomas Mikolov et. all.,  the method, often referred to as Word2Vec, can be considered a map m() of our 100,000 dimension space of word to a dense space of much smaller dimension, say 50, with some remarkable properties.   In particular, there is the now-famous analogy linearity relationships. For example “man is to king as woman is to queen” is expressible (approximately) as

m( king) – m(man) + m(woman)   ≈   m(queen)

There is an excellent set of technical explanations of why Word2Vec work on Quora and we won’t go into them here.  One of the best papers that address this issue is by Golberg and Levy.

Le and Mikolov have shown that the basic methods of Word2Vec generalized to paragraphs, so that we now have a map p() from a corpus of paragraphs to vectors. In other words, given a corpus of documents D of size N, then for any doc d in D, p(d) is a vector of some prespecified length that “encodes” d.  At the risk of greatly oversimplifying, the paragraph vector is a concatenation of a component that is specific to the paragraph’s ID with word vectors sampled from the paragraph.   (As with Word2Vec, there are actually two main versions of this model.  Refer to the Le and Mikolov paper for details.)  It turns out that the function p can be extended to arbitrary documents x so that p(x) is an “inferred” vector in the same space vector space.  We can then use p(x) to find the documents d such that p(d) is nearest to p(x).      If we know how to classify the nearby documents, we can make a guess at the classification of x.  That is what we will do below.

Using Doc2Vec to Build a Document Classifier

Next we will use a version of the Paragraph vectors from Gensim’s Doc2Vec model building tools and show how we can use it to build a simple document classifier.  Gensim is a product of  Radim Řehůřek’s RaRe Technologies.    An excellent tutorial for Gensim is this notebook from RaRe.  To initialize Gensim Doc2vec we do the following.

import gensim
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)

This creates a model that, when trained will have vectors of length 50.  The training will use 2 word minimum from each doc for each iteration and there will be 55 iterations.

Next we need to ready a document corpus.   What we will use is 7000 science journal article abstracts from the Cornell University archive ArXiv.org . We have recorded the titles, abstracts and the topic classifications assigned by the authors.   There are several dozen topic categories but we partition them into five major topics: physics, math, computer science, biology and finance.      We have randomly selected 5000 for the training set and we use the remainder plus another 500 from recently posted papers for testing.  We must first convert the text of the abstracts into the format needed by Doc2Vec.  The files are “sciml_train” and “sciml_test”.   The function below preprocesses each of the document abstracts to create the correct corpus.

def read_corpus(fname, tokens_only=False):
        with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
             for i, line in enumerate(f):
                   doc = gensim.utils.simple_preprocess(line)
                   if tokens_only:
                       yield doc
                   else:
                       # For training data, add tags
                       yield gensim.models.doc2vec.TaggedDocument(d, [i])
train_corpus = list(read_corpus("sciml_train"))
test_corpus = list(read_corpus("sciml_test", tokens_only=True))

We next build a vocabulary from the words in the training corpus.   This is a dictionary of all the words together with the counts of the word occurrences.  Once that is done we can train the model.

model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)

The training takes about 1 minutes and a simple 4-core server. We can now save the model so that it can be restored for use later with the Python statement model.save(“gensim_model”). We will use this later when building the version we will install in Algorithmia.
The model object contains the 5000 vectors of length 50 that encode our documents. To build our simple classifier we will extract this into an array mar of size 5000 by 50 and normalize each vector to be of unit length. (The normalization will simplify our later computations.)

import Numpy as np
mar = np.zeros((model.docvecs.count, 50))
for i in range(m.count):
    x = np.linalg.norm(model.docvecs[i])
    mar[i] = model.docvecs[i]/x

An interesting thing to do with the mar matrix is to visualize it in 2-d using the t-distributed stochastic neighbor embedding (t-SNE) algorithm. The result is shown in the figure below. The points have been color coded based on topic: 1(dee purple) = “math”, 2(blue gray) = “Physics”, 3(blue green) = “bio”, 4(green) = “finance” and 5(yellow) = “compsci”.

tsne-word2vec2

There are two things to note here.   First, the collection is not well balanced in terms of numerical distribution.  About half the collect is physics and there are only a small number of bio and finance papers.   That is the nature of academic science:  lots of physicists publishing papers and not so many quantitative finance or quantitative bio papers in the open literature.   It is interesting to note that the Physics papers divide clearly into two or  three clouds.  (it turns out these separate clouds could be classed as “astrophysics”  and “other physics”.)  Computer science and math have a big overlap and bio has a strong overlap with cs because these are all “quantitative bio” papers.

The classification algorithm is very simple. Our model has a function infer_vector(doc) that will use stochastic methods to interpret the doc into the model vector space. Using that inferred vector we can compute the nearest k documents to it in the model space with the function below.

def find_best(k, abstract):
	preproc = gensim.utils.simple_preprocess(abstract)
	v = model.infer_vector(preproc)
       v0 = v/np.linalg.norm(v)
	norms = []
 	for i in range(5000):
     		norms.append([np.dot(v0,mar[i]), i])
       return norms[0:k]

The dot product of the two normalized vectors is the cosine distance. Because the infer_vector is stochastic in nature, our final version of the classifier calls the find_best ten times and computes an average ranking. (The details are in this notebook. and an Html version.) Selecting one of the more recent abstracts and subjecting it to the classifier gives the result pictured below.

sample-classifier1

The analysis gives the abstract a score of 80 for computer science and 20 for bio. Note that the title contains the detailed ArXiv category, so we see this is correct, but reading the article it could also be cross listed as bio.
On the other hand, there are many examples that easily confuse the system. For example, the one below is classified as quantitative biology in arXiv, but the system can’t decide if it is math, cs or physics.

gensim_result2.JPG

In general we can take the highest ranking score for each member of the test set and then compute a confusion matrix.  The result is shown below.  Each row of the table represents the percent of the best guesses from the system for the row label.

gensim_confusion

One interesting observation here is that in the cases where there is an error in the first guess, the most common mistake was to classify an abstract as mathematics.

Moving the model to Algorithmia

Moving the model to Algorithmia is surprisingly simple.  The first step is to create a data collection in the Algorithmia data cloud.   We created one called “gensim” and it contains the three important files: the gensim model,  topicdict, the dictionary that translates ArXiv topics to our major topics, and the ArXiv topics associated with each of the training documents.   The Algorithmia collection is shown below.  We also loaded the training document titles but they are not necessary.

gensim_algorithmia_data

The main difference between running a trained model in Algorithmia and that of a “normal” algorithm is the part where you load the model from the data container.   The skeleton of the python code now includes a function load_model()which you write and a line that invokes this function as shown below.  Now when your algorithm is loaded into the microservice it first calls the load_model()before invoking the apply(input) function.   For all subsequent invocations of you algorithm while it running in that microservice instance the model is already loaded. (The full source code is here. )

import Algorithmia
import gensim
From gensim.models.doc2vec import Doc2Vec
client = Algorithmia.client()
def load_model():
    file_path = 'data://dbgannon/gensim/gensim_model'
    file_path = client.file(file_path).getFile().name
    model = Doc2Vec.load(file_path)
    # similarly load train_sites and topicdict
    # and create mar by normalizing model data
    return model, mar, topicdict, train_sites

model, mar, topicdict, train_sites = load_model()
def find_best_topic(abstract):
    #body of find_best_topic
def apply(input):
    out = find_best_topic(input)
    return out

Deploying the algorithm follows the same procedure as before.   We add the new algorithm from the Algorithmia portal and clone it.   Assuming the SciDocClassifier.py contains our final version of the source,  we execute the following commands.

git add SciDocClassifier.py
git commit -m "second commit"
git push origin master

Returning to the Algorithmia portal, we can go to the project source editor. From there we need to add the code dependencies. In this case, we wanted exactly the same versions of gensim and Numpy we used in our development environment. As shown below that was easy to specify.

algorithmia-dependencies

The final version has been published as dbgannon/SciDocClassifer and is available for anyone to use.  Once again, our experience with using Algorithmia’s tools have been easy to use and fun to experiment with. There are many algorithms to try out and a starter free account is all you need.

Algorithmia™: A Cloud Marketplace for Algorithms and Deep Learning

 

One area of great frustration encountered by application developers involves the challenge of integrating new algorithms into a code base.  There are many reasons for this.   For example, the algorithm may be described in a journal article where many details of the implementation are omitted or it is available only in a programming language different from the one being used.  The code may have software dependencies that are hard to resolve.  The new algorithm may also have hardware dependencies, such as reliance on a GPU to get performance and you may not have access to this hardware.  On the other hand, if you are the author of a great new algorithm you may be disappointed that your new invention is not being used for these very same reasons.     

About 18 months ago a company called Algorithmia™  was founded in Seattle that provides an elegant solution to these problems.  They provide a very simple multi-language API that can be used to invoke any of their catalog of 3,500 different cloud-based algorithms. While we may be getting tired of reading about X-as-a-Service for different versions of X, there is one binding for X that has been around for a while in various forms and, as much as it pains me to do so, it begs to be called Algorithms as a Service.   And AaaS is just one of the things Algorithmia provides.      

AaaS is indeed not a new idea.  Jack Dongarra and his ICL team at the University of Tennessee created NetSolve/GridSove in 2003  to provide scientists and engineers with access to state-of-the-art numerical algorithms running on a distributed network of high performance computers.   As cool as NetSolve is, Algorithmia goes several steps beyond this concept. 

One of Algorithmia’s cofounders and CEO,  Diego Oppenheimer has a deep background in building business intelligence tools.   While working on that he developed an appreciation of the power of being able to call out to powerful algorithms from inside a user facing application.  This capability allows the application to have access to deeper knowledge and more powerful computational resources than available on the user’s device.  A key insight from this experience is that algorithms must be discoverable an invokable from any user application runtime.   These ideas are all central to Algorithmia.  In the following paragraphs we will look at Algoritmia’s marketplace,  explore building a new  algorithm and discuss a bit of the system microservice architecture. 

Algorithmia is a marketplace.  

There are over 50,000 developers that use Algorithmia services and the platform encourages these developers to contribute new algorithms to the collection.   Invoking an algorithm is dead simple and it can be done from any programming language that can formulate a JSON doc and send a REST message.   We will provide some detailed illustrations at the end of this document.  

To use it, you need to set up an account.   Doing so will get you a starter award of 5000 or so “credits”.   When you invoke an algorithm, credits are deducted from your account.   Typically, there is a “royalty” cost of about 10 credits and then the cost is usually around one credit per second of execution.   A fun example from their library of deep learning collection is an image colorizer.   Input is a PNG file of a black and white image and the returned value is a link to the output colorized image.  We took a color image from a visit to Red Square a few years ago.   We converted it to a grayscale image and gave that to the colorizer.  The result is shown illustrated below.  The original is on the left, grayscale in the middle and the colorized image on the right.   While it is not as good as the best hand-colored photos, it is not too bad.     It lost the amazing color of St. Bazil’s Cathedral which is not too surprising,  but it was great with sky and skin tones of those people in foreground.   (It seemed to think the bricks of the square would look better with some grass color.)

colorized

The Python code to upload the grayscale image and invoke the service was incredibly simple.

import Algorithmia
client = Algorithmia.client(‘youruserkeyfromaccountrecation’)
input = bytearray(open("path_to_grayscale.png", "rb").read())
result = client.algo("deeplearning/ColorfulImageColorization/1.1.6")
        .pipe(input).result
path_to_local_copy_of_result_image= client.file(result[‘output’]).getFile()

The cost in credits was 154.   The exchange rate for credits is 1$ = 10,000 credits (approximately) so this invocation would have cost about 1.5 cents.  

This algorithm is from their extensive machine learning and AI collection.  A related algorithm is one that computes the salience of objects in an image.  Salience is the degree to which an object in the image attracts the attention of the viewer’s eye.   The algorithm is called SalNet and it is based on ideas from the paper, Shallow and Deep Convolutional Networks for Saliency Prediction by Pan et. al.  (see arXiv:1603.00845v1).

As with the colorizer, salnet it is easy to invoke.

input = { "image": "data://.algo/deeplearning/SalNet/perm/an-uploaded-image.png" }
result2 = client.algo("deeplearning/SalNet/0.2.0").pipe(input).result

Note that in this case we have loaded the image from one that we uploaded to Algorithmia’s data cloud.  In fact, it is the same grayscale image of red square.  As you can see below, the algorithm picks out the woman in the foreground and also notices the church in the background. 

Salience computation can be very helpful in identifying and labeling objects in an image.   Image tagging is also something that Algorithmia supports.   Running the same image through their tagger returned the observations that the image was “safe” and that there were multiple boys and multiple girls and sky and clouds and it seem to be near a palace. 

salience2

There are many other AI related image algorithms such as nudity detection, character recognition, face detection and a very impressive car make and model recognition algorithm.   A quick look at https://algorithmia.com/use-cases will show many other fascinating use cases. 

Another very cool capability of Algorithmia is its ability to host your trained machine learning model.  Suppose you have a model you have built with MsXNet, TensorFlow, Scikit-Learns, CNTK or any of the other popular ML frameworks, you can upload your model to Algorithmia so that it can be available as a service.   This is explained in here. We will explore this capability in a later post.

While the main emphasis and attraction of the Algorithmia collection is machine learning and AI, there are many more algorithm categories represented there.  For example, there is an excellent collection of utilities for managing data and making certain programming tasks extremely easy: such as extracting text from web pages, Wikipedia search tools, computing the timezone and elevation from lat, lon coordinates.

There is also a large collection of time series analysis algorithms.   These include forecasting, outlier detection, Fourier filters, auto-correlation computation and many more.

Algorithmia is cloud of microservices

In an excellent talk at the 2017 Geekwire cloud summit, Oppenheimer described some key elements of Algorithmia’s architecture.  In this talk he makes the critically important observation that two phases of machine learning,  training and prediction, if used in production require very different execution environments.   Training is often done on a dedicated system consuming many hours of compute and as much memory as is available.   The result of training is a model codified as data.   Prediction (also called Inference) uses the model to make predictions or inferences about a sample case.   Prediction can be done on the same hardware platform that was used for the training, but if the model is to be used to make predictions concerning thousands of cases for thousands of concurrent users,  one need a completely different design.  

Their approach to the scale problem for predictions (and for any high demand algorithm in their collection) is based on serverless microservices.    They use a Kubernetes microservice foundation with algorithms deployed in Docker containers.  Requests from remote client applications are load balanced across API servers who dispatch requests to container instances for the requested function.  The challenge is making the latency from request to reply very low.  If a container for an algorithm is already in system memory, it requires very little time to spawn a new instance on Kubernetes.  Another technique they use it to dynamically load algorithms into running containers.  (We don’t know the exact mechanism Algorithmia uses here, but we expect it is exploiting these facts.) 

They have made some very interesting optimizations.   For example, if the data used in the computation is stored in one of their cloud regions, the docker instance will be instantiated nearby.   Just as important, if an algorithm invokes another algorithm they will attempt to co-locate the two containers and reduce the inter-process latency.  Composability of algorithms is one of their guiding concepts.  

Turning your own algorithm into a microservice

The process of turning your own algorithm into a microservice is remarkably simple.   From the Algorithmia portal there is a “+” symbol in the upper right-hand corner.   This give you a dialog box to fill out.   You provide a name of your algorithm, the programming language you are using (from a long list .. but sorry, no Fortran or Julia but there are lots of alternatives), and several other choices including: your source license policy, does your function invoke other Algorithmia functions, does your function invoke things on the open internet?

Answering these questions causes Algorithmia to create a nice GitHub repo for your function.   Your next step is to install the Algorithmia command line interface and then you can clone your functions GitHub repo.  Once you have done that you can edit the function so that it does what you want.   The basic skeleton is already there for you in the “src” directory.   Here is the basic skeleton in Python rendered as a hello world function.

import Algorithmia
# API calls will begin at the apply() method, 
# with the request body passed as 'input'
# For more details, see algorithmia.com/developers/algorithm- 
# development/languages
def apply(input):
    return "hello {}".format(input)

You can edit the function directly from an editor built into the Algorithmia portal or, now that you have a clone of the repo you can use your own tools to transform this skeleton into your algorithm.   If you have done this work on your clone you need to use the Github commands to push your code back to the master.

We tried this with a small experiment.   We built a function called KeyPhrases that takes English language text as input and breaks it down into subjects (s), actions (a) which are like verb clauses and objects (o).   The algorithm is not very useful or sophisticated.   In fact, it uses another Algorithmia microservice  called Parsey McParseface which was originally released by Goolge (see https://arxiv.org/ pdf/1603.06042v1.pdf) .   This is truly a deep parser that build a very sophisticated tree.  For example the figure below illustrates the tree for a pars of the sentence

Einstein’s general theory of relativity explains gravity in terms of the curvature of spacetime.

parsey-einstein

Parsey McParseface tree output.

Our function KeyPhrases walks the tree and groups the terms, subjects(s), objects(o) and actions (a) and returns a JSON document with the original string and the list of phrases.  It also breaks out separate subphrases with “/” marks.  In this case it returns

{"phrases":[
       "s: Einstein's general theory /of relativity ",
       "a: explains ",
       "s: gravity /in terms /of the curvature /of spacetime. "
        ],
  "text":"Einstein's general theory of relativity explains gravity in terms of the curvature of spacetime."
}

A more complex example is

Facebook Incs chief security officer warned that the fake news problem is more complicated to solve than the public thinks.

The phrase output is

['s: Facebook Incs chief security officer ',
 'a: warned ',
 'o: that the fake news problem ',
 'a: is more ',
 'o: complicated and dangerous /to solve /than the public thinks ']

This is clearly not as rich in detail as the Parsey output, but it does extract some useful key phrases. 

To complete the creation of the microservice for this algorithm one need only issue the git commands

$ git add src/KeyPhrases.py
$ git commit -m "added src mods"
$ git push origin master

The last push causes a compile step to happen and the microservice is now created.   Algorithmia also provides an easy template to add documentation and instructions about how to invoke your function.  From the Algorithmia editor there is a function that allows you to “publish” your algorithm.   After pushing that button, the KeyPhrase example was put in their library.   You can see it here: https://algorithmia.com/algorithms/dbgannon/KeyPhrases (If you use it, remember it has not been tested very well, so it may break.)

Algorithmia as an enterprise platform

The Algorithmia serverless microservice platform is robust enough that they offer it as an enterprise product.   This allows enterprises to host their own version on one of the public clouds or on their own clusters or across multiple cloud in a hybrid system.    This allows their own internally used algorithm to be hosted and invoked by their in-house analytics tools and pipelines in a totally scalable way.   This enterprise version comes with a management dashboard and monitoring tools.

Conclusions

Algorithmia is a fascinating company with very interesting products.   It is extremely easy to sign up for a free account and it is fun to use.   The team was extremely helpful when we had questions.  A Jupyter Notebook with some of the examples mentioned above will be posted very soon.   We found experimenting with the various algorithms from an interactive notebook was a pleasure.   Creating the hosted version of the KeyPhrases algorithm took less than an hour after the original python code was debugged.   In our next experiment we will explore hosting deep learning models with Algorithmia.

Observations about Serverless Computing

With a few examples from AWS Lambda, Azure Functions and OpenWhisk.

Cloud computing is going through an interesting evolution.  It has gone from a platform for deploying virtual machines to planet-scale systems with extensive collections of data storage, analysis and machine learning services.   Most recently we have seen the emergence of “cloud native” computing, which in its most basic form involves a design pattern of microservices where big applications are decomposed into hundreds of basic stateless components that run on  clusters managed by tools like Kubernetes, Mesos and Swarm.

Serverless computing is the next step in this evolution.  It addresses the following challenges.  Suppose I have a small computation that I want to run against some database at the end of each month. Or suppose I want to have the equivalent of a computational daemon that wakes up and executes a specific task only when certain conditions arise.  For example, when an event arrives in a particular stream or when a file in a storage container has been modified or when a timer goes off.  How do I automate these tasks without paying for a continuously running server?   Unfortunately, the traditional cloud computing infrastructure model would require me to allocate computing resources such as virtual machines or a microservice cluster and my daemon would be a continuously running process.   While I can scale my cluster of VMs up and down, I can’t scale it to zero without my daemon becoming unresponsive.  I only want to pay for my computing WHEN my computation is running.

This is not a totally new idea.  Paying only for the compute that we use goes back to early timesharing and persists with compute-hour “allocations” on supercomputers today.  And there are cloud services such as Azure Data Lake Analytics, Amazon Kinesis or the AWS API gateway, that charge you only for the computation that you use or data that you move and they do not require you to deploy server infrastructure to use them.

However, there is something deeper going on here and it has to do with triggers and another computational paradigm called “Function-as-a-Service”(FaaS). Unlike the serverless examples above which depend upon me invoking a specific well-defined service, FaaS allows a cloud user to define their own function, and then “register” it with the cloud and specify the exact events that will cause it to wake up and execute. As mentioned above, these event triggers can be tied to changes in state of a storage account or database, events associated with queues or streams of data from IoT devices, web API invocations coming from mobile apps.  Triggers can even be defined by steps in the execution of a workflow.   And, of course, the user only pays when and while the function is executing.

faas-diagram

Figure 1.  Function, triggers and output concept. (from markoinsights.com)

There have been two conferences that have focused on the state of serverless computing. The paper “Status of Serverless Computing and Function-as-a-Service (FaaS) in Industry and Research” by Geoffrey Fox, Vatche Ishakian, Vinod Muthusamy and Aleksander Slominski provides an excellent overview of many of the ideas, concepts and questions that surround serverless computing that surfaced at these conferences. In that paper they refer to an IBM tutorial that defines serverless FaaS as
1. short-running, stateless computation
2. event-driven applications
3. scales up and down instantly and automatically
4. based on charge-by-use

Notice that there is a distinction between FaaS and serverless FaaS and it has to do with item 4 above.   A good example of this is Google’s App Engine, which was arguably the first FaaS available from a commercial cloud.   In its current form App Engine can run in one of two modes.  In its standard mode, your applications run in a sandbox and you are charged only when the app is running.  In the “flexible” mode you deploy a container  and then specify the compute infrastructure needed in terms of CPU power, memory, disk and you are charged by the hour.  You could say that App Engine in running Flexible mode is server-lite, but clearly not fully serverless, while standard mode is truly serverless.

What are the serverless FaaS choices?

There are a number of FaaS implementations.  Some of these are used for research while others are commercial products.  The Status report refers to many of these and the slides for the workshop are on-line.   A good example of the research work is OpenLambda from the University of Wisconsin and first introduced dat HotCloud ’16.  Based on this experience the Wisconsin team described Pipsqueak, an experiment to reduce the deployment latencies caused by Python library initializations.   Ryan Chard described Ripple which is an excellent example of distributing event trigger management from the source to the cloud. Ripple has been designed and use for several significant science applications including beamline science (ALS and APS).  Another related technology is if-this-then-that IFTTT that is a service for chaining together other service.

Two other open source projects raise an interesting question about what is behind the curtain of serverless. Funktion  and Fission are both implementations of FaaS on top of Kubernetes. As we discuss serverless computing we must remember that there is a “server” somewhere. The basic infrastructure for serverless computing needs to run somewhere as a persistent service and hence a microservice platform like Kubernetes is a reasonable choice. This relates to the economics of severless computing and we return to that at the end of this report.

The most commonly referenced open source FaaS service is Apache OpenWhisk which was developed by IBM and is available on their Bluemix cloud as a service. The other commercial services include Google Function, Microsoft Azure Functions and Amazon Lambda. At the end of this article we will show some very simple examples of using some of these systems.

When can FaaS replace a mesh of microservices for building an app?

The Status paper also makes several other important observations. For example, they note that while serverless is great for the triggered example described above, it is not good for long-running or statefull applications like databases, deep learning training, heavy stream analytics, Spark or Hadoop analysis and video streaming. In fact, many large-scale cloud-native applications that have thousands of concurrent users require continuous use of massive networks of microservices. This will will not be based on serverless FaaS. However, there may be many cases where a user-facing application running in the cloud could be easily implemented with serverless FaaS rather than as a big microserivce deployment. What we do not know is where the cross-over point from a serverless FaaS implementation of an app to a full kubernetes based massive microservice deployment lies. This relates to the economic of FaaS (discussed briefly at the end of this article). The Cloud Native Computing Foundation has a working group on serverless computing that is addressing this topic.

There is one interesting example of using serverless FaaS to do massively parallel computing and it called pywren. The lovely paper “Occupy the Cloud: Distributed Computing for the 99%”  by Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica and Benjamin Recht describes the concepts in pywren which allow it to scale computation to thousands of concurrent function invocations achieving 40teraflops of compute performance. Pywren uses AWS Lambda in a very clever way: it serializes the computational function which are then passed to a lambda function to execute. We will return to Pywren in another post.

Computing at the Edge?

Perhaps the hottest topic in the general area of cloud computing is when the computing spills out of the cloud to the edge of the network. There are many reasons for this but most boil down to latency and bandwidth. A good example is the use cases that motivate the Ripple system described above. In order to generate a stream of events from sensors at the edge of the network, one needs very light weight computing that can monitor them and generate the events. In many cases it is necessary to preprocess the data in order to send the message to the cloud where a function will be invoked to respond. In some cases, the response must return to the source sooner than a remote invocation can respond because of the latencies involved. The computing at the edge may need to execute the function there and the actual communication with the cloud may be just a final log event or a trigger for some follow-up action.

Another possibility is that the functions can migrate to the places where they are needed. When you deploy the computing at the edge you also deploy a list of functions that that must be invoked. When the edge process starts up it could cache some of the function it needs locally. We expect there will be many variations on these ideas in the future.

A Tiny FaaS Tutorial

By their very nature FaaS systems are very simple to use: you write short, stateless function and tie them to triggers. We will take a quick look at three of these: AWS Lambda, Microsoft Function and IBM Bluemix OpenWhisk.

AWS Lambda Functions

Amazon was the first to introduce serverless functions as a service and it is very well integrated into their ecosystem of other services. Called Lambda functions, they can be easily created in a number of standard programming languages. Lambda functions can be associated with a variety of trigger events including changes to the state of a storage account, web service invocations, stream events and even workflow events.

We will illustrate Lambda with a simple example of a function that responds to Kinesis Stream events and for each event it adds an item to a dynamoDB table. Here is the python code for the function that accomplishes this task.

from __future__ import print_function
#import json
import base64
import boto3
def lambda_handler(event, context):
    dyndb = boto3.resource('dynamodb', region_name='us-west-2')
    table = dyndb.Table("lambdaTable")
    
    for record in event['Records']:
       #Kinesis data is base64 encoded so decode here
       payload=base64.b64decode(record["kinesis"]["data"])
       x = eval(str(payload))
       metadata_item = {'PartitionKey': x['PartitionKey'], 
                        'RowKey': x['RowKey'], 'title': x['text']}
       table.put_item(Item=metadata_item)

There are several critical items that are not explicit here.   We need to invoke the AWS Identity and Access Management (IAM) system to delegate some pemissions to our function and we will need to grab copies of the Amazon Resource Names (ARNs) for this and other objects. First we create a IAM role that will allow access to Kineses streams and DynamoDB. Using the IAM portal we created a role called:  “lambda-kinesis-execution-role” and attached two policies ‘AmazonDynamoDBFullAccess” and “AWSLambdaKinesisExecutionRole”. We then made a copy of the Role ARN.

The next step is to install this function into the lambda system. To do that we put the function above into a file called “ProcessKinesisRecords.py” and then zipped it. We then uploaded the zipped file to S3 (in the us-west-2) in the bucket “dbglambda”. With that we can create our function with the boto3 call from our laptop as:

lambdaclient = boto3.client('lambda')

response = lambdaclient.create_function(
    FunctionName='lambdahandler',
    Runtime='python2.7',
    Role='arn:aws:iam::066xx534:role/lambda-kinesis-execution-role',
    Handler='ProcessKinesisRecords.lambda_handler',
    Code={
        'S3Bucket': 'dbglambda',
        'S3Key': 'ProcessKinesisRecords.zip',
    }
    )

The final item we need is the link between this function and the Kinesis Stream service.
To do that we went to the portal for Kinesis and created a stream called “mylambdastream” and grabbed its ARN. Creating the binding is accomplished with the following.

response = lambdaclient.create_event_source_mapping(
    EventSourceArn=
        'arn:aws:kinesis:us-west-2:06xxx450734:stream/mylambdastream',
    FunctionName='lambdahandler',
    BatchSize=123,
    StartingPosition='TRIM_HORIZON',
)

We can verify the properties of the function we have created by looking at the AWS Lambda portal pages. As shown below, we can verify that our lambdahandler function does indeed have our stream as its trigger.

lambda1

Figure 2. AWS Lambda Function dashboard showing our trigger.

Finally we can invoke it by pushing an event to the kinesis stream. This is shown below.

client = boto3.client('kinesis')
record = "{'PartitionKey':'part2', 'RowKey': '444', 'text':'good day'}"
resp = client.put_record(StreamName='mylambdastream',
    Data=bytearray(record),
    PartitionKey='a')

Checking the DynamoDB portal will verify that the function has picked the message from Kinesis and deposited it in the database. The full details are in the notebook “simple-kinesis-lambda-sender.ipynb”.

Azure Functions

The Azure function portal has a large number of basic templates we can use to build our function as shown below.

azure-funcs1

Figure 3.  Azure Function Template samples

We have selected one of the Python examples that creates a simple web service. Bringing this up on the portal we see the code below.

import os
import json

postreqdata = json.loads(open(os.environ['req']).read())
response = open(os.environ['res'], 'w')
response.write("hello world to "+postreqdata['name'])
response.close()

To get this running, we now must go to the “integrate” page to provide one more detail: we set the authorization level to “anonymous” as shown below.

azure-func2

Figure 4.  Setting the authorization level for the trigger to “anonymous”.

The client is also simple and shown below.

import os
import json
import requests
import json

data = {}
data['name'] = 'goober'
json_data = json.dumps(data)
r = requests.post("https://pylook.azurewebsites.net/api/pylook", 
                   data=json_data)
print(r.status_code, r.reason)
print r.text

Using Azure Functions to pull data from a queue.

We now give another version of the function that reads messages from a queue and puts them in a table. There is no Python template for this one yet, so we will use JavaScript. To use template this we must first create a new storage account or use an existing one. Go to the storage account page and you will seefunction5

Figure 5. Creating a new table and queue in a storage account.

We click on Table and create a new table and remember its name. Then we go to Queues and create a new one and remember its name. Looking in the storage explorer should show for these items. Clicking on the table name should give us a picture of an empty table.

Go back to the portal main page and click + and look for “Function App” and click create. There is a table to fill in like the one below.

function4

Figure 6.  Creating a new function.

We give it a name and allow it to create a new resource group. For the storage we want to use the dropdown and look for the storage account name. (It is important that the storage account is in the same location as the function. We click create and wait for the function to appear on the function portal page. You should see it in your resource groups. Follow that link and you will see that it is running.

Go to the functions tab and hit “+”. It will ask you to pick one of the templates. At the top where it says “language” select Javascript and pick the one that is called QueueTrigger. This will load a basic template for the function. Now edit the template so that it looks like the following example.

function2

 

The main difference between this and the template is that we have added an output table and instructions to push three items into the table. The function is assuming that the items in the queue are of the form

{‘PartitionKey’: ‘part1’, ‘RowKey’: ’73’, ‘content’: ‘some data ‘}

Next we need to tie the queue to our queue and the table to our table. So click on “Integrate” on the left. You need to fill in the form so that it ties it to your stuff as illustrated below.

function3

Figure 8.  The association between the storage account queue and table service with the variable in our function.  Here we have highlighted the “show value” to verify that it has the right storage account.

You should see your storage account in the dropdown menu. Select it. And they add the Table name. You need to do the same for the AzureQueueStorage. Once this is done and your function is saved and the system is running your function should be instantiated and invoked as soon as you send it queue items. For that we have a simple python script in a Jupyter notebook. You can get from  https://SciEngCloud.github.io/py-functions-queue-driver.ipynb.
You will need to fill in your account key for the storage account, but then you should be able to step through the rest. The notebook runs a few tests and then sends 20 items to the queue. Using the Azure storage explorer we see the results in the table as shown below.

function1

Figure 9.   The view of the table after running the Jupyter Notebook

OpenWhisk and IBM Bluemix

OpenWhisk is the open source serverless function system developed by IBM that is also supported as a commercial service on their Bluemix cloud.  Like the others it is reasonably easy to use from their command-line tools, but the Bluemix portal provides an even easier solution.   Creating a function is as easy as selecting the runtime and downloading a template.  Figure xx below illustrates a simple function derived from template for a Python3 function.

action

Figure 10.  An OpenWhisk function that decodes a dictionary input and returns a dictionary.

Notice that the parameter to the function is a python dictionary (all Openwhisk messages are actually Json objects, which for Python are rendered as dictionaries).  While this template can be run as a web service in its present form, it is also possible to connect it to an external trigger.  To illustrate this we connected it to a trigger  that monitors activity in a Github account.  Using the portal it was easy to create the trigger (called “mygithubtrigger” and bind it to push and delete actions on a repository called dbgannon/whisk.   All this required was an access token which was easy available by logging in to the GitHub portal.   In the case of GitHub triggers the event returns a massive Json object that is a complete description of the repository.   In the action “myaction” we go two levels down in the dictionary and extract the official repository description and make that the response of the action to the event.

When a trigger fires you need a rule which bind the firing to an action.   We bound it to the trivial “myaction” example above.  The view from the portal of the rule is below

ruleview

Figure 11.   The rule view of our association of the trigger to the function

We next added a new file to the repository.  This activated the trigger and the rule invoked the action.   The portal has a nice monitor facility and the image is below.

monitor
Figure 12.  This is the “Monitor” view after the trigger has fired.

Finally, drilling down on the “myaction” log event we see the description of the GitHub repository we created.

json-data
Figure 13.  The output of the “myaction” function after a file was added to the GitHub repository.

Finally

These examples above are all very trivial.   The next thing to explore is how functions can be composed into workflows.   Each of the three systems has its own way of doing that and if we have time later we will show some examples of this capability.   We have also not disussed performance or cost which is greatly dependent on the rate at which your triggers are firing and the amount of work in each function execution.

The economics of serverless computing are also very interesting.  As we pointed out  earlier, for a cloud provider to offer a FaaS serverless capability it must be supported by actual infrastructure.   How can the provider afford this if it is only charging by the second?  There are two possible answers.   First if your FaaS is serving very large numbers of function per second then it will certainly pay for itself.   But there is another consideration.   Many of the big cloud providers are running vast microservice frameworks that support most, if not all of their big internal and publiclly available applications.  Running  a FaaS system on top of that is as easy as running any other microservice based application.  One only needs to set the cost per second of function evaluation high enough to cover the cost of that share of the underlying platform the FaaS system is using.   For Amazon Lambda “The price depends on the amount of memory you allocate to your function. You are charged $0.00001667 for every GB-second used.”  This amounts to $0.06 per GB hour.  So for a function that take 8 GB of memory to execute that is $.48 per hour, which is 4.8 times greater than the cost of an 8GB m4.large ec2 instance.   Consequently a heavily used FaaS system is a financial win and a lightly used one may have little impact on your infrastructure.

Of course, single executions of a FaaS function are limited to five minutes, so you would need 12 5 minute concurrent executions to reach an hour of execution time. Furthermore AWS give you 1 million invocations (at 100ms) each month for free or 400,000 GB-seconds per month free.  That is 111 GB hours per month for free.   This is not a bad deal and it would seem to indicate that Lambda is not a big drain on their infrastructure yet.

Acknowledgement:  Thanks go to Ryan Chard for helpful comments and suggestions on a draft of this post.

Azure’s new CosmosDB Planet-Scale Database

The Microsoft Azure team has recently released CosmosDB, an new entry to the cloud data storage management marketplace. Cosmos is actually the name of a data storage system that has been used internal to Microsoft for many years. That original Cosmos has now morphed into Azure Data Lake Storage (ADLS) and Analytics. ADLS is focused on Hadoop/HDFS compatible scalable analytics and, at the present time, it is available in only two of the US data centers. CosmosDB is related to the original Cosmos in name only and it is available at all of the data centers.

There is an excellent technical description [https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-cosmos-db], so we will hit only a few highlights here.  CosmosDB has four main API components and a number of interesting properties. From the user’s perspective these components are

  1. DocumentDB; “a NoSQL document database service designed from the ground up to natively support JSON and JavaScript directly inside the database engine.”
  2. MongoDB; CosmosDB implements the MongoDB API so that users of that popular NoSQL database can easily move to CosmosDB.
  3. Azure Table Storage: a simple key-value table mechanism that has been part of Azure from it earliest days.
  4. Gremlin Graph Query API; a graph traversal language based on functional programming concepts.

There is a common “resource model” that unifies all of these capabilities. A database consists of users with varying permissions and containers. A container holds one of the three content types, document collection, tables or graphs. Special resources like stored procedures, triggers and user-defined-functions (UDFs) are also stored within the container. At the top level the user creates a cosmosDB database account and in doing so the user must pick one of these four APIs. In the examples we show here we focus on the DocumentDB API. In a later post we will look at Gremlin.

There are four important properties that every CosmosDB databased has.

  1. Global distribution: your database is replicated in any of 30+ different regions and you can pick these from a map on the Azure portal.
  2. Elastic scale-out: you can scale throughput of a container by programmatically provisioning throughput at a second or minute granularity. provision throughput (measured in using a currency unit called, Request Unit or RU).
  3. Guaranteed low latency: For a typical 1KB item, Azure Cosmos DB guarantees end-to-end latency of reads under 10ms and indexed writes under 15ms at the 99th percentile within the same Azure region.
  4. Five consistency models.
  5. A Comprehensive Service Level Agreement (SLA).

Global Distribution

CosmosDB has two forms of distribution. Each database is composed of one or more collections and every data collection is stored in a logical container that is distributed over one or more physical server partitions. And, of course, everything is replicated. Every item in a collection has a partition key and a unique ID. The partition key is hashed and the hash is used to make the assignment of that item to one of the physical servers associated with that container. This level of partitioning happens automatically when you create a collection and start filling it.

Global distribution is more interesting. When you first create a database it has an initial “write location” which refers to one of the 30 or so Azure regions around the globe. You can use the portal to say how your data is distributed to other regions. In the example we discuss at the end of this document our original region “North Central US”. We have used the “Replicate Data Globally” function, which gives us a map in the portal, where we selected three additional regions, “West Europe”, “South India” and “Brazil South” as shown in Figure 1 below.

geodistributed-db

Figure 1. Selecting three additional locations for a database using a map tool.

With a small (170MB) database of documents it took only a few minutes to replicate the data to these locations. The data collections in the replicate sites are considered “read locations”. Which means that a user in these locations will by default read data from the local replica. However if that remote user attempts a write the request is routed to the “write location”. To understand how long a remote reader takes to see an update from the remote write location we need to discuss the consistency models.

Elastic scale-out and The Cost of Computing

Every operation you do in CosmosDB requires bandwidth and computation. You can control the rate at which “energy” is consumed for your database with something called Request Units (RU) and RUs/sec and RUs/min. You specify the RUs/sec you are willing to “spend” on your database and CosmosDB will provision the resources to meet your throughput requirements.

Consistency Models

Consistency is one of the most difficult topics in distributed systems. In the case of distrubted or replicated databases, consistency the consistency model tell the user how changes in one copy of a database are reflected updates to other copies. By insisting that consistency be very strong, this may have an adverse impact on latencies and throughput. So the choice of consistence model can have a profound impact on application performance and cost.

One of the truly unique features of CosmosDB is that it gives the user a choice of five different consistency models. In order from the weakest to the strongest and in terms of RUs, the cheapest to most expensive they are:

  1. Eventual consistency. Eventual consistency is the weakest form of consistency wherein a client may get the values which are older than the ones it had seen before, over time. In the absence of any further writes, the replicas within the group will eventually converge.
  2. Consistent prefix. Consistent prefix level guarantees that reads never see out of order writes. If writes were performed in the order `A, B, C`, then a client sees either `A`, `A,B`, or `A,B,C`, but never out of order like `A,C` or `B,A,C`.
  3. Session consistency. When you connect to a cosmosDB database through its URL you are creating a session. Session consistency guarantees monotonic reads, monotonic writes, and read your own writes (RYW) guarantees for the duration of the session. In the Python API you create client object that encapsulates your connection to the database. The parameter “Consistency_level” has the default value “Session”. Reads in session consistency take a bit longer than consistent prefix which takes longer than eventual consistency.
  4. Bounded stateness. This model has two ways to insist on consistency. You can define a time interval such that beyond that interval of time from the present, the system will guarantee consistency. Alternatively you can specify a the upper bound on the number of reads that lag behind writes your reader can be. Beyond that point consistency is guaranteed. Of course the smaller you make the window, the more computing you may consume and delay you may encounter.
  5. Strong consistency. The most expensive and most complete. it is linearizable in that reads always returns the most recent writes. But it is limited to collections that are not geodistributed.

A word about the performance of these models. In the example at the end of this section we put the write region of the database in North America and replicated in three other locations with one in Western Europe. We then created a virtual machine in Western Europe and connected it to the database and verified by the IP address that it was connecting to the local replica. We then fired changes at the North America region and read with each of the consistency models. We were unable to detect any out of order reads and the responses were extremely fast. This is obviously not a full scientific study, but it was clear we had a system that performed very well. We also attempted a write to the local Western Europe “read location”. We were able to detect that the API call immediately dropped the read location connection and reconnected to the North American copy to do the write.

Comprehensive SLA

The service level agreement (SLA) for CosmosDB is actually a contract that Azure and you agree to when you create a database. It is indeed comprehensive. It covers guarantees for availability, throughput, consistency and latency giving you upper bounds for each measurable attribute of performance. As you are the one to specify how many RUs/sec you are willing to allocate, some of the temporal guarantees may be dependent upon this allocation not being exceeded. We will not go into it in detail here. Read it with your lawyer.

A look at Azure DocumentDB using Python and GDELT Data

To illustrate the basics of DocumentDB we will use a tiny part of the amazing GDELT event collection. This data collection is available on AWS S3 as well as Google’s BigQuery. The GDELT project (http://gdeltproject.org) is the brainchild of Kalev Leetaru from Georgetown University and it is an serious “Big Data” collection. GDELT’s collection is the result of mining of hundreds of thousands of broadcast print and online new sources from every corner of the world every day. What we will look at here is a microscopic window into the daily collection of news items. In fact, we will take a look at the news from from one day: June 30, 2017.

AWS keeps the data on S3 and downloading it is dead easy. The command

$ aws s3 cp s3://gdelt-open-data/events/20170630.export.csv . 

will download the 20MB June 30 2017 dataset as a CSV file. Each row of the file is a record of an “event” consisting of some 60 attributes that catalog a publication of a news item. Among the fields in the record are a unique identifier, a timestamp and the name and geolocation of two “actors” that are part of the event. In addition, there is a url a published news item about the event. The actors are derived from the even. Actor1 is called an initiator of the event and actor2 is a recipient or victim of the event. These actors are usually represented by the names of the cities or countries associated with the actual actors.

For example the story “Qatar’s defense minister to visit Turkey amid base controversy” (http://jordantimes.com/news/local/qatar’s-defence-minister-visit-turkey-amid-base-controversy) appeared in the Jordan Times. It describes an event where the Qatar defense minister visited Turkey as Turkey resists pressure from Saudi Arabia, the United Arab Emirates, Egypt and Bahrain to close bases after the five other nations pressed sanctions agains Qatar on June 5. This URL appears three time in the database for June 30. In one case, actor1 is Doha Qatar and actor2 is Ankora Turkey. In the second case actor1 is Riyadh, Saudi Arabia actor2 is Turkey. In the third case actor1 and actor2 are both Riyadh. Though the identity of the actor cities is based on an automated analysis of the story, their may be three different stories to analyze. However, only one URL has been selected to represent all three.

For our purposes we will describe this as a single event that links the three cities, Doha, Ankora and Riyadh. Many researchers have used this GDELT data to do deep political and social analysis of the world we live in, but that goal is way beyond what we intend here. What we will try to do in this example is to use CosmosDB to graphically illustrate different ways we can represent the events of June 30 and the cities they impact.

Creating the DocumentDB from the CSV file.

We will begin by creating a simple document database where each document is the JSON record for one line of the CSV file. Actually we will only include the URL, the two actor cities and their geolocations. To begin we must connect you client program to the CosmosDB system.

We use Python and we assume that the documented tools have been installed.

#run this if documentdb is not installed
#!pip install pydocumentdb
import csv
import pydocumentdb;
import pydocumentdb.document_client as document_client

We next create the database and the collection. But first we need an Azure CosmosDB account and is best done on the Azure portal. One that is there you can retrieve the account key so that we can create the collection.
Our account is called bookdockdb and our database is called ‘db3’ and the new collection will be called ‘gdelt’.

config = { 
    'ENDPOINT': 'https://bookdocdb.documents.azure.com',
    'MASTERKEY': 'your db key here',
    'DOCUMENTDB_DATABASE': 'db3',
    'DOCUMENTDB_COLLECTION': 'gdelt'
};
#we create the database, but if it is already there the "CreateDatabase will fail"
#In that case, we can query for it. 
#this is handy because we can have more than one collection in  a given database.
try:
    db = client.CreateDatabase({ 'id': config['DOCUMENTDB_DATABASE'] }, )
except:
    db_id = config['DOCUMENTDB_DATABASE']
    db_query = "select * from r where r.id = '{0}'".format(db_id)
    db = list(client.QueryDatabases(db_query))[0]

Next we will create the collection. Again, we may already have the collection there, so we can add new items to an existing collection but we need a different method to retrieve the collection handle.

options = {
    'offerEnableRUPerMinuteThroughput': True,
    'offerVersion': "V2",
    'offerThroughput': 400
}
try:
    collection = client.CreateCollection(db['_self'], 
         { 'id': config['DOCUMENTDB_COLLECTION'] }, options)
except:
    collection = next((coll for coll in client.ReadCollections(db['_self']) \
         if coll['id']==config['DOCUMENTDB_COLLECTION']))

The next step is to read the CSV file and add the new documents to the database collection.
We are going to do this one line at a time and keep only the time stamps, actor cities and geolocations (latitude and longitude) and the associated URL. We will also keep a separate list of the URLs for later use.

url_list = []
with open('20170630.export.csv') as csvfile:
    readCSV = csv.reader(csvfile, delimiter='\t')
    for row in readCSV:
        url_list.append(row[57])
        document1 = client.CreateDocument(collection['_self'],
            { 
             'id': row[0],
             'time_stamp': row[1],
             'time_stamp2': row[4],
             'city0_lat': row[46],
             'city0_lon': row[47],
             'city0_name': row[43],
              'city1_lat': row[53],
              'city1_lon': row[54],
              'city1_name': row[50],
              'url': row[57] 
              }
            )

The next thing we will do is to look at the URLs that involve a large number of cities (because they are often more interesting). Because we have the geolocations of each city we can draw a map and link the cities associated with the URL.

To do this we need to remove all the duplicates in our URL list using a standard Python trick: convert the list to a set and then back to a list.

s = set(url_list)
url_list = list(s)

We next query the database collection to find the cities associated with each URL. The best way to do this in the cloud is a simple map-reduce operation where we map each record to a pair consisting of a URL and a list of cities. We then invoke a reduce-by-key function to reduce all pairs with the same key to a single pair with that key with all of the lists associated with that key concatenated.
If we were doing more than a single day’s worth of data, It would be worth bringing Spark into the picture and using that for the map-reduce step. Instead we will do it sequentially using a data base query for each URL. The following function accomplishes that task.

def get_cities_for_url(url):
    db_query = {'query':"select * from r where r.url = '{0}'".format(url) }
    options = {}
    options['enableCrossPartitionQuery'] = True
    options['maxItemCount'] = 1000
    q = client.QueryDocuments(collection['_self'], db_query, options)
    results = list(q)
    cities = []
    for x in results:
        if x['city1_name'] != '':
            cities.append(x['city1_name']+" lat="+x['city1_lat'] + ' lon='+x['city1_lon'])
        if x['city0_name'] != '':
            cities.append(x['city0_name']+" lat="+x['city0_lat'] + ' lon='+x['city0_lon'])
    city_set = set(cities)
    cities = list(city_set)
    return cities

The function returns a list of strings where each string is a city and its geolocation. We have used the matplotlib Basemap tools to render the most interesting sets of cities where we define “interesting” to be those URL sets with more than 5 cities. In this case we have only plotted a few hundred of the city clusters.

gdelt-geoimage

Figure 2.   Cities linked by shared stories.

While this is colorful, it is not very informative. We can tell there was some events in Australia that involved Europe and the US (this turned out to be a scandal involving an important person in the catholic church) and there was a number of stories that involve North Korea, the US and China. Something happened in the Northwest that involved cities in the east coast of the US.n general the information content is low.

To create a different view, let’s transform the collection to one that reflects the set of cities associated with each URL. To accomplish that we call the get_cities_for_url() function above and create a new document that only three attributes: an ID, the URL and a list of cities associated with the URL. The new collection is called “cities”, but it is really more like a dictionary that associates the URLs with a list of cities.  (The Jupyter Notebook used to build this collection and illustrate the example below is available here.)

One thing we can do is to search for special words or phrases in the URLs and list the cities that are associated. To do this we can make use of another feature of DocumentDB and create a user defined function to do the string search. This is extremely easy and can be done from the Azure Portal. The functions are all java script. The data explorer tab for the “cities” collection and select the “New User Defined Function” tab. In our case we create a function “findsubstring” that searches for a substring “sub” in a longer string “s”.

function  findsubstring(s, sub){
    t = s.indexOf(sub)
    return t
}

We can now call this function in a database query as follows.

def checkInDB(name):
    db_query = {'query':"select r.url, r.cities from r where udf.findsubstring(r.url,'{0}')>0".format(name) }
    options = {}
    options['enableCrossPartitionQuery'] = True
    options['maxItemCount'] = 10000
    q = client.QueryDocuments(collection['_self'], db_query, options)
    results = list(q)
    return results

Picking a name from the news, “Ivanka”, we can see what stories and cities mentioned “Ivanka” on that day. The call checkInDB(“Ivanka”) returns

[{'cities': ['Oregon, United States', 
'White House, District of Columbia, United States',
'Buhari, Kano, Nigeria',
'Canyonville Christian Academy, Oregon, United States ', 
'Chibok, Borno, Nigeria', 
'Virginia, United States'],
'url': 'http://www.ghanaweb.com/GhanaHomePage/world/Trump-and-Ivanka-host-two-Chibok-girls-at-the-White-House-553963'}]

which is a story about a reception for two girls that were rescued from the Boko Haram in Africa. The list of associated cities include Canyonville Christian Academy, a private boarding school that has offered to continue their education.

A More Interactive Abstract View of the Data.

Another interesting approach to looking at the data is to consider stories that link together smaller cities in some ways. We are going to look for “local” story connections rather than big news items. More specifically, we will look at cities that appear in at least 3 but no more than 5 stories in the days news. This leaves out big cities like washington DC and big global headlines. Next we will build a graph that we can use to explore the stories and cities that share these more regional stories. There are actually 245 cities that appear in our restricted set.

In order to make our graph we need a dictionary that maps cities into the list of stories that mention it. Call that dictionary “stories” Our visualization has nodes that are cities and we will connect them by an edge if they appear in a story together. Let “linkcities” be the list of 245 cities describe above. We can then compute the edges using as follows.

edges = []
for i in range(0, len(linkcities)):
    for j in range(0, len(linkcities)):
        if i < j:             
		iset = set(stories[linkcities[i]])             
		jset = set(stories[linkcities[j]])             
		kset = iset.intersection(jset)             
		if len(kset)> 0:
                	edges.append((i,j))

The complete code for this example is given in this Jupyter notebook. We use the Plotly package to build and interactive graph. The nodes are distributed approximately over the surface of a sphere in 3D. You can rotate the sphere and mouse over nodes Doing so, shows the URLs for the stories associated with that town. The sphere appears as the image below.

city-graph

Figure 3.  Interactive city graph

A better way to see this is to interact with it. You can do this by going to the bottom panel in this file [https://SciEngCloud.github.io/city-graph.html].

Conclusion

Our experience with CosmosDB has been limited to using the DocumentDB API, but we were impressed with the experience. In particular, the ease with which one can globally distribute the data was impressive as were the selection of consistency protocols. As we said, we were unable to force the database to show temporary inconsistency even with the weakest protocol choice using very distant copies of the database. This should not be considered a fault. Our tests were too easy.  The examples above do not begin to stretch all the capabilities of CosmosDB, but we hope they provide a reasonable introduction for the Python programmer.)

One final comment. It seems CosmosDB was built on top of the Azure microservice container system. It would be very interesting to see more details of the implementation.

 

 

Containers for HPC. A look at Singularity.

Vanessa Sochat and I recently completed a little study of Singularity, the container system that brings the advantages of Docker-style encapsulation to HPC.  If you are interested this is now a supplementary chapter to the book “Cloud Computing for Science and Engineering” that is being published by MIT Press.  You can see the chapter here: https://cloud4scieng.org/singularity-a-container-system-for-hpc-applications/  The chapter has a high level overview and a brief tutorial showing how to build a singularity container for an MPI application and how to run it on an small cluster on AWS. 

Manifold Learning and Deep Autoencoders in Science

One way to think about machine learning is to view it as building a model of a system based on samples of data that are artifacts of that system.  This view does not resonate very well when the problem at hand is identifying photos containing sail boats, but it is apt when ML is applied to scientific data.

The data from scientific experiments is often takes the form of vectors in a very high dimensional space and we are looking for an underling organization of that data that reflects properties our system.   Manifold learning is based on the assumption that the system you are trying to model generates data that lies on or near a lower dimensional surface in the higher dimension coordinate space of the data.   Picture the surface of a sphere or a curve in 3-D.   If this manifold assumption about the data is true, it may be possible to “unfold’’ the surface so that a projection or other linear analysis makes the data easier to understand.

Autoencoders are deep neural networks that can be used to discover properties of the manifold of data that characterizes the system you are trying to model.  Illustrated below, autoencoders have an encoder phase and a decoder phase with a space of much lower dimension than the input in the middle.  By studying the properties of this lower dimensional space, we have a better view of the data manifold. You train an autoencoder to be the identity function.  

autoencoder

Recently, a lovely blog article A look at deep learning for science by Prabhat gave us an excellent overview of some uses of deep learning technology in science applications and several of these were applications of autoencoders.  Partly inspired by that article and by our own experiments with some very interesting images of neuron cells collected by Maryana Alegro at UCSF, we put together a little tutorial on autoencoders in science.   In the article we discuss two types of autoencoders: denoising and variational.  The variational autoencoder we examine is applied to the cell images in an effort to create a model that can be used to both classify samples as well generate new samples by following a path along the surface of the manifold. 

We have published this little study as a supplement to our chapter on machine learning in the book “Cloud Computing for Science and Engineering”.   The link to the article is here.

Cloud Computing for Science and Engineering

Ian Foster and I have just completed a final draft of a book that is designed to introduce the working scientist, engineer or student to cloud computing.  It surveys the technology that underpins the cloud, new approaches to technical problems enabled by the cloud, and the concepts required to integrate cloud services into scientific work.  Many of the blog posts that have appeared here have been reworked and, we hope, greatly improved and integrated into the text.  In addition the book contains introductions to the Globus data management and services infrastructure that has become widely used by the research community.

We have a website for the book https://Cloud4SciEng.org that contains draft chapters,  jupyter notebooks that illustrate most of the concepts and a collection of lecture slides for the tutorial at the IEEE International Conference on Cloud Engineering based on the material in the Book.  This collection will grow over time.  The book website will also contain updates to the book material as the current cloud technology evolves.

The Table of contents for  the book is below.   We look forward to your feedback.

Table of Contents

Acknowledgments
Preface
1 Orienting in the cloud universe

Part I. Managing data in the cloud

2 Storage as a service
3 Using cloud storage services

Part II. Computing in the cloud

4 Computing as a service
5 Using and managing virtual machines
6 Using and managing containers
7 Scaling deployments

Part III. The cloud as platform

8 Data analytics in the cloud
9 Streaming data to the cloud
10 Machine learning in the cloud
11 The Globus research data management platform

Part IV. Building your own cloud

12 Building your own cloud with Eucalyptus (with Rich Wolski)
13 Building your own cloud with OpenStack (with Stig Telfer)
14 Building your own SaaS

Part V. Security and other topics

15 Security and privacy
16 History, critiques, futures
18 Afterword: A discovery cloud?

Big Data Congress 1017- CFP

I wanted to help get the word out concerning the final call for papers for the 2017 IEEE Big Data Congress.   The deadline is February 28, 2017.  This meeting is part of an annual gang of meetings including the IEEE Cloud conference, the IEEE International Conference on Web Services and others.   The conferences are going to be in Honolulu, Hawaii June 25 – June 30, 2017.

The Big Data meeting will have four tracks: research, applications, short papers and special topics.  The special topics including vision papers to point out emerging challenges, papers that describe new  data sets and benchmarks,   experience and  surveys.

The full call for papers is here: http://www.ieeebigdata.org/2017/cfp.html.

Should be a good meeting.