Algorithmia Part 2. Document Embedding with Gensim and Deploying a Trained Model in the Algorithmia Cloud

 

In a previous post we described Algorithmia,  a cloud service for discovering, invoking and deploying algorithms.  In this short article we look at Algorithmia as a tool to deploy trained machine learning models.   We used a tool called Gensim to build a model of scientific documents and then create an Algorithmia service that uses the model to predict the topic categories of scientific article.

A Review of Word Vectors and Document Vectors.

This technology has been around for a while now, so this review is more of a history lesson and not a deep technical review.  However, we will give you links to some of the important papers on the subject.

If you want to do document analysis through machine learning, you need a way to represent words in a vector form.  Given a collection of documents you can extract all the words to create a vocabulary.  If the size of the vocabulary is 100,000 words, you can represent each word as a “one-shot” vector in which the i-th word in the vocabulary is a vector of zeros except for a 1 in the i-th position.   Then if your then each document in your collection can be represented as the sum of vectors corresponding the words in that document.   If you have M documents, then the collection is represented by the sparse matrix of size M x 100,000.  Using this “bag of words” representation, there are a variety of traditional techniques such as Latent Sematic Analysis that can be used to extract the similarities between documents.

About five years ago, a team from Google found a much better way to create vectors from words so that words that are used in similar semantic context are nearer to each other as vectors.   Described in the paper by Tomas Mikolov et. all.,  the method, often referred to as Word2Vec, can be considered a map m() of our 100,000 dimension space of word to a dense space of much smaller dimension, say 50, with some remarkable properties.   In particular, there is the now-famous analogy linearity relationships. For example “man is to king as woman is to queen” is expressible (approximately) as

m( king) – m(man) + m(woman)   ≈   m(queen)

There is an excellent set of technical explanations of why Word2Vec work on Quora and we won’t go into them here.  One of the best papers that address this issue is by Golberg and Levy.

Le and Mikolov have shown that the basic methods of Word2Vec generalized to paragraphs, so that we now have a map p() from a corpus of paragraphs to vectors. In other words, given a corpus of documents D of size N, then for any doc d in D, p(d) is a vector of some prespecified length that “encodes” d.  At the risk of greatly oversimplifying, the paragraph vector is a concatenation of a component that is specific to the paragraph’s ID with word vectors sampled from the paragraph.   (As with Word2Vec, there are actually two main versions of this model.  Refer to the Le and Mikolov paper for details.)  It turns out that the function p can be extended to arbitrary documents x so that p(x) is an “inferred” vector in the same space vector space.  We can then use p(x) to find the documents d such that p(d) is nearest to p(x).      If we know how to classify the nearby documents, we can make a guess at the classification of x.  That is what we will do below.

Using Doc2Vec to Build a Document Classifier

Next we will use a version of the Paragraph vectors from Gensim’s Doc2Vec model building tools and show how we can use it to build a simple document classifier.  Gensim is a product of  Radim Řehůřek’s RaRe Technologies.    An excellent tutorial for Gensim is this notebook from RaRe.  To initialize Gensim Doc2vec we do the following.

import gensim
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)

This creates a model that, when trained will have vectors of length 50.  The training will use 2 word minimum from each doc for each iteration and there will be 55 iterations.

Next we need to ready a document corpus.   What we will use is 7000 science journal article abstracts from the Cornell University archive ArXiv.org . We have recorded the titles, abstracts and the topic classifications assigned by the authors.   There are several dozen topic categories but we partition them into five major topics: physics, math, computer science, biology and finance.      We have randomly selected 5000 for the training set and we use the remainder plus another 500 from recently posted papers for testing.  We must first convert the text of the abstracts into the format needed by Doc2Vec.  The files are “sciml_train” and “sciml_test”.   The function below preprocesses each of the document abstracts to create the correct corpus.

def read_corpus(fname, tokens_only=False):
        with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
             for i, line in enumerate(f):
                   doc = gensim.utils.simple_preprocess(line)
                   if tokens_only:
                       yield doc
                   else:
                       # For training data, add tags
                       yield gensim.models.doc2vec.TaggedDocument(d, [i])
train_corpus = list(read_corpus("sciml_train"))
test_corpus = list(read_corpus("sciml_test", tokens_only=True))

We next build a vocabulary from the words in the training corpus.   This is a dictionary of all the words together with the counts of the word occurrences.  Once that is done we can train the model.

model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)

The training takes about 1 minutes and a simple 4-core server. We can now save the model so that it can be restored for use later with the Python statement model.save(“gensim_model”). We will use this later when building the version we will install in Algorithmia.
The model object contains the 5000 vectors of length 50 that encode our documents. To build our simple classifier we will extract this into an array mar of size 5000 by 50 and normalize each vector to be of unit length. (The normalization will simplify our later computations.)

import Numpy as np
mar = np.zeros((model.docvecs.count, 50))
for i in range(m.count):
    x = np.linalg.norm(model.docvecs[i])
    mar[i] = model.docvecs[i]/x

An interesting thing to do with the mar matrix is to visualize it in 2-d using the t-distributed stochastic neighbor embedding (t-SNE) algorithm. The result is shown in the figure below. The points have been color coded based on topic: 1(dee purple) = “math”, 2(blue gray) = “Physics”, 3(blue green) = “bio”, 4(green) = “finance” and 5(yellow) = “compsci”.

tsne-word2vec2

There are two things to note here.   First, the collection is not well balanced in terms of numerical distribution.  About half the collect is physics and there are only a small number of bio and finance papers.   That is the nature of academic science:  lots of physicists publishing papers and not so many quantitative finance or quantitative bio papers in the open literature.   It is interesting to note that the Physics papers divide clearly into two or  three clouds.  (it turns out these separate clouds could be classed as “astrophysics”  and “other physics”.)  Computer science and math have a big overlap and bio has a strong overlap with cs because these are all “quantitative bio” papers.

The classification algorithm is very simple. Our model has a function infer_vector(doc) that will use stochastic methods to interpret the doc into the model vector space. Using that inferred vector we can compute the nearest k documents to it in the model space with the function below.

def find_best(k, abstract):
	preproc = gensim.utils.simple_preprocess(abstract)
	v = model.infer_vector(preproc)
       v0 = v/np.linalg.norm(v)
	norms = []
 	for i in range(5000):
     		norms.append([np.dot(v0,mar[i]), i])
       return norms[0:k]

The dot product of the two normalized vectors is the cosine distance. Because the infer_vector is stochastic in nature, our final version of the classifier calls the find_best ten times and computes an average ranking. (The details are in this notebook. and an Html version.) Selecting one of the more recent abstracts and subjecting it to the classifier gives the result pictured below.

sample-classifier1

The analysis gives the abstract a score of 80 for computer science and 20 for bio. Note that the title contains the detailed ArXiv category, so we see this is correct, but reading the article it could also be cross listed as bio.
On the other hand, there are many examples that easily confuse the system. For example, the one below is classified as quantitative biology in arXiv, but the system can’t decide if it is math, cs or physics.

gensim_result2.JPG

In general we can take the highest ranking score for each member of the test set and then compute a confusion matrix.  The result is shown below.  Each row of the table represents the percent of the best guesses from the system for the row label.

gensim_confusion

One interesting observation here is that in the cases where there is an error in the first guess, the most common mistake was to classify an abstract as mathematics.

Moving the model to Algorithmia

Moving the model to Algorithmia is surprisingly simple.  The first step is to create a data collection in the Algorithmia data cloud.   We created one called “gensim” and it contains the three important files: the gensim model,  topicdict, the dictionary that translates ArXiv topics to our major topics, and the ArXiv topics associated with each of the training documents.   The Algorithmia collection is shown below.  We also loaded the training document titles but they are not necessary.

gensim_algorithmia_data

The main difference between running a trained model in Algorithmia and that of a “normal” algorithm is the part where you load the model from the data container.   The skeleton of the python code now includes a function load_model()which you write and a line that invokes this function as shown below.  Now when your algorithm is loaded into the microservice it first calls the load_model()before invoking the apply(input) function.   For all subsequent invocations of you algorithm while it running in that microservice instance the model is already loaded. (The full source code is here. )

import Algorithmia
import gensim
From gensim.models.doc2vec import Doc2Vec
client = Algorithmia.client()
def load_model():
    file_path = 'data://dbgannon/gensim/gensim_model'
    file_path = client.file(file_path).getFile().name
    model = Doc2Vec.load(file_path)
    # similarly load train_sites and topicdict
    # and create mar by normalizing model data
    return model, mar, topicdict, train_sites

model, mar, topicdict, train_sites = load_model()
def find_best_topic(abstract):
    #body of find_best_topic
def apply(input):
    out = find_best_topic(input)
    return out

Deploying the algorithm follows the same procedure as before.   We add the new algorithm from the Algorithmia portal and clone it.   Assuming the SciDocClassifier.py contains our final version of the source,  we execute the following commands.

git add SciDocClassifier.py
git commit -m "second commit"
git push origin master

Returning to the Algorithmia portal, we can go to the project source editor. From there we need to add the code dependencies. In this case, we wanted exactly the same versions of gensim and Numpy we used in our development environment. As shown below that was easy to specify.

algorithmia-dependencies

The final version has been published as dbgannon/SciDocClassifer and is available for anyone to use.  Once again, our experience with using Algorithmia’s tools have been easy to use and fun to experiment with. There are many algorithms to try out and a starter free account is all you need.

Algorithmia™: A Cloud Marketplace for Algorithms and Deep Learning

 

One area of great frustration encountered by application developers involves the challenge of integrating new algorithms into a code base.  There are many reasons for this.   For example, the algorithm may be described in a journal article where many details of the implementation are omitted or it is available only in a programming language different from the one being used.  The code may have software dependencies that are hard to resolve.  The new algorithm may also have hardware dependencies, such as reliance on a GPU to get performance and you may not have access to this hardware.  On the other hand, if you are the author of a great new algorithm you may be disappointed that your new invention is not being used for these very same reasons.     

About 18 months ago a company called Algorithmia™  was founded in Seattle that provides an elegant solution to these problems.  They provide a very simple multi-language API that can be used to invoke any of their catalog of 3,500 different cloud-based algorithms. While we may be getting tired of reading about X-as-a-Service for different versions of X, there is one binding for X that has been around for a while in various forms and, as much as it pains me to do so, it begs to be called Algorithms as a Service.   And AaaS is just one of the things Algorithmia provides.      

AaaS is indeed not a new idea.  Jack Dongarra and his ICL team at the University of Tennessee created NetSolve/GridSove in 2003  to provide scientists and engineers with access to state-of-the-art numerical algorithms running on a distributed network of high performance computers.   As cool as NetSolve is, Algorithmia goes several steps beyond this concept. 

One of Algorithmia’s cofounders and CEO,  Diego Oppenheimer has a deep background in building business intelligence tools.   While working on that he developed an appreciation of the power of being able to call out to powerful algorithms from inside a user facing application.  This capability allows the application to have access to deeper knowledge and more powerful computational resources than available on the user’s device.  A key insight from this experience is that algorithms must be discoverable an invokable from any user application runtime.   These ideas are all central to Algorithmia.  In the following paragraphs we will look at Algoritmia’s marketplace,  explore building a new  algorithm and discuss a bit of the system microservice architecture. 

Algorithmia is a marketplace.  

There are over 50,000 developers that use Algorithmia services and the platform encourages these developers to contribute new algorithms to the collection.   Invoking an algorithm is dead simple and it can be done from any programming language that can formulate a JSON doc and send a REST message.   We will provide some detailed illustrations at the end of this document.  

To use it, you need to set up an account.   Doing so will get you a starter award of 5000 or so “credits”.   When you invoke an algorithm, credits are deducted from your account.   Typically, there is a “royalty” cost of about 10 credits and then the cost is usually around one credit per second of execution.   A fun example from their library of deep learning collection is an image colorizer.   Input is a PNG file of a black and white image and the returned value is a link to the output colorized image.  We took a color image from a visit to Red Square a few years ago.   We converted it to a grayscale image and gave that to the colorizer.  The result is shown illustrated below.  The original is on the left, grayscale in the middle and the colorized image on the right.   While it is not as good as the best hand-colored photos, it is not too bad.     It lost the amazing color of St. Bazil’s Cathedral which is not too surprising,  but it was great with sky and skin tones of those people in foreground.   (It seemed to think the bricks of the square would look better with some grass color.)

colorized

The Python code to upload the grayscale image and invoke the service was incredibly simple.

import Algorithmia
client = Algorithmia.client(‘youruserkeyfromaccountrecation’)
input = bytearray(open("path_to_grayscale.png", "rb").read())
result = client.algo("deeplearning/ColorfulImageColorization/1.1.6")
        .pipe(input).result
path_to_local_copy_of_result_image= client.file(result[‘output’]).getFile()

The cost in credits was 154.   The exchange rate for credits is 1$ = 10,000 credits (approximately) so this invocation would have cost about 1.5 cents.  

This algorithm is from their extensive machine learning and AI collection.  A related algorithm is one that computes the salience of objects in an image.  Salience is the degree to which an object in the image attracts the attention of the viewer’s eye.   The algorithm is called SalNet and it is based on ideas from the paper, Shallow and Deep Convolutional Networks for Saliency Prediction by Pan et. al.  (see arXiv:1603.00845v1).

As with the colorizer, salnet it is easy to invoke.

input = { "image": "data://.algo/deeplearning/SalNet/perm/an-uploaded-image.png" }
result2 = client.algo("deeplearning/SalNet/0.2.0").pipe(input).result

Note that in this case we have loaded the image from one that we uploaded to Algorithmia’s data cloud.  In fact, it is the same grayscale image of red square.  As you can see below, the algorithm picks out the woman in the foreground and also notices the church in the background. 

Salience computation can be very helpful in identifying and labeling objects in an image.   Image tagging is also something that Algorithmia supports.   Running the same image through their tagger returned the observations that the image was “safe” and that there were multiple boys and multiple girls and sky and clouds and it seem to be near a palace. 

salience2

There are many other AI related image algorithms such as nudity detection, character recognition, face detection and a very impressive car make and model recognition algorithm.   A quick look at https://algorithmia.com/use-cases will show many other fascinating use cases. 

Another very cool capability of Algorithmia is its ability to host your trained machine learning model.  Suppose you have a model you have built with MsXNet, TensorFlow, Scikit-Learns, CNTK or any of the other popular ML frameworks, you can upload your model to Algorithmia so that it can be available as a service.   This is explained in here. We will explore this capability in a later post.

While the main emphasis and attraction of the Algorithmia collection is machine learning and AI, there are many more algorithm categories represented there.  For example, there is an excellent collection of utilities for managing data and making certain programming tasks extremely easy: such as extracting text from web pages, Wikipedia search tools, computing the timezone and elevation from lat, lon coordinates.

There is also a large collection of time series analysis algorithms.   These include forecasting, outlier detection, Fourier filters, auto-correlation computation and many more.

Algorithmia is cloud of microservices

In an excellent talk at the 2017 Geekwire cloud summit, Oppenheimer described some key elements of Algorithmia’s architecture.  In this talk he makes the critically important observation that two phases of machine learning,  training and prediction, if used in production require very different execution environments.   Training is often done on a dedicated system consuming many hours of compute and as much memory as is available.   The result of training is a model codified as data.   Prediction (also called Inference) uses the model to make predictions or inferences about a sample case.   Prediction can be done on the same hardware platform that was used for the training, but if the model is to be used to make predictions concerning thousands of cases for thousands of concurrent users,  one need a completely different design.  

Their approach to the scale problem for predictions (and for any high demand algorithm in their collection) is based on serverless microservices.    They use a Kubernetes microservice foundation with algorithms deployed in Docker containers.  Requests from remote client applications are load balanced across API servers who dispatch requests to container instances for the requested function.  The challenge is making the latency from request to reply very low.  If a container for an algorithm is already in system memory, it requires very little time to spawn a new instance on Kubernetes.  Another technique they use it to dynamically load algorithms into running containers.  (We don’t know the exact mechanism Algorithmia uses here, but we expect it is exploiting these facts.) 

They have made some very interesting optimizations.   For example, if the data used in the computation is stored in one of their cloud regions, the docker instance will be instantiated nearby.   Just as important, if an algorithm invokes another algorithm they will attempt to co-locate the two containers and reduce the inter-process latency.  Composability of algorithms is one of their guiding concepts.  

Turning your own algorithm into a microservice

The process of turning your own algorithm into a microservice is remarkably simple.   From the Algorithmia portal there is a “+” symbol in the upper right-hand corner.   This give you a dialog box to fill out.   You provide a name of your algorithm, the programming language you are using (from a long list .. but sorry, no Fortran or Julia but there are lots of alternatives), and several other choices including: your source license policy, does your function invoke other Algorithmia functions, does your function invoke things on the open internet?

Answering these questions causes Algorithmia to create a nice GitHub repo for your function.   Your next step is to install the Algorithmia command line interface and then you can clone your functions GitHub repo.  Once you have done that you can edit the function so that it does what you want.   The basic skeleton is already there for you in the “src” directory.   Here is the basic skeleton in Python rendered as a hello world function.

import Algorithmia
# API calls will begin at the apply() method, 
# with the request body passed as 'input'
# For more details, see algorithmia.com/developers/algorithm- 
# development/languages
def apply(input):
    return "hello {}".format(input)

You can edit the function directly from an editor built into the Algorithmia portal or, now that you have a clone of the repo you can use your own tools to transform this skeleton into your algorithm.   If you have done this work on your clone you need to use the Github commands to push your code back to the master.

We tried this with a small experiment.   We built a function called KeyPhrases that takes English language text as input and breaks it down into subjects (s), actions (a) which are like verb clauses and objects (o).   The algorithm is not very useful or sophisticated.   In fact, it uses another Algorithmia microservice  called Parsey McParseface which was originally released by Goolge (see https://arxiv.org/ pdf/1603.06042v1.pdf) .   This is truly a deep parser that build a very sophisticated tree.  For example the figure below illustrates the tree for a pars of the sentence

Einstein’s general theory of relativity explains gravity in terms of the curvature of spacetime.

parsey-einstein

Parsey McParseface tree output.

Our function KeyPhrases walks the tree and groups the terms, subjects(s), objects(o) and actions (a) and returns a JSON document with the original string and the list of phrases.  It also breaks out separate subphrases with “/” marks.  In this case it returns

{"phrases":[
       "s: Einstein's general theory /of relativity ",
       "a: explains ",
       "s: gravity /in terms /of the curvature /of spacetime. "
        ],
  "text":"Einstein's general theory of relativity explains gravity in terms of the curvature of spacetime."
}

A more complex example is

Facebook Incs chief security officer warned that the fake news problem is more complicated to solve than the public thinks.

The phrase output is

['s: Facebook Incs chief security officer ',
 'a: warned ',
 'o: that the fake news problem ',
 'a: is more ',
 'o: complicated and dangerous /to solve /than the public thinks ']

This is clearly not as rich in detail as the Parsey output, but it does extract some useful key phrases. 

To complete the creation of the microservice for this algorithm one need only issue the git commands

$ git add src/KeyPhrases.py
$ git commit -m "added src mods"
$ git push origin master

The last push causes a compile step to happen and the microservice is now created.   Algorithmia also provides an easy template to add documentation and instructions about how to invoke your function.  From the Algorithmia editor there is a function that allows you to “publish” your algorithm.   After pushing that button, the KeyPhrase example was put in their library.   You can see it here: https://algorithmia.com/algorithms/dbgannon/KeyPhrases (If you use it, remember it has not been tested very well, so it may break.)

Algorithmia as an enterprise platform

The Algorithmia serverless microservice platform is robust enough that they offer it as an enterprise product.   This allows enterprises to host their own version on one of the public clouds or on their own clusters or across multiple cloud in a hybrid system.    This allows their own internally used algorithm to be hosted and invoked by their in-house analytics tools and pipelines in a totally scalable way.   This enterprise version comes with a management dashboard and monitoring tools.

Conclusions

Algorithmia is a fascinating company with very interesting products.   It is extremely easy to sign up for a free account and it is fun to use.   The team was extremely helpful when we had questions.  A Jupyter Notebook with some of the examples mentioned above will be posted very soon.   We found experimenting with the various algorithms from an interactive notebook was a pleasure.   Creating the hosted version of the KeyPhrases algorithm took less than an hour after the original python code was debugged.   In our next experiment we will explore hosting deep learning models with Algorithmia.

Observations about Serverless Computing

With a few examples from AWS Lambda, Azure Functions and OpenWhisk.

Cloud computing is going through an interesting evolution.  It has gone from a platform for deploying virtual machines to planet-scale systems with extensive collections of data storage, analysis and machine learning services.   Most recently we have seen the emergence of “cloud native” computing, which in its most basic form involves a design pattern of microservices where big applications are decomposed into hundreds of basic stateless components that run on  clusters managed by tools like Kubernetes, Mesos and Swarm.

Serverless computing is the next step in this evolution.  It addresses the following challenges.  Suppose I have a small computation that I want to run against some database at the end of each month. Or suppose I want to have the equivalent of a computational daemon that wakes up and executes a specific task only when certain conditions arise.  For example, when an event arrives in a particular stream or when a file in a storage container has been modified or when a timer goes off.  How do I automate these tasks without paying for a continuously running server?   Unfortunately, the traditional cloud computing infrastructure model would require me to allocate computing resources such as virtual machines or a microservice cluster and my daemon would be a continuously running process.   While I can scale my cluster of VMs up and down, I can’t scale it to zero without my daemon becoming unresponsive.  I only want to pay for my computing WHEN my computation is running.

This is not a totally new idea.  Paying only for the compute that we use goes back to early timesharing and persists with compute-hour “allocations” on supercomputers today.  And there are cloud services such as Azure Data Lake Analytics, Amazon Kinesis or the AWS API gateway, that charge you only for the computation that you use or data that you move and they do not require you to deploy server infrastructure to use them.

However, there is something deeper going on here and it has to do with triggers and another computational paradigm called “Function-as-a-Service”(FaaS). Unlike the serverless examples above which depend upon me invoking a specific well-defined service, FaaS allows a cloud user to define their own function, and then “register” it with the cloud and specify the exact events that will cause it to wake up and execute. As mentioned above, these event triggers can be tied to changes in state of a storage account or database, events associated with queues or streams of data from IoT devices, web API invocations coming from mobile apps.  Triggers can even be defined by steps in the execution of a workflow.   And, of course, the user only pays when and while the function is executing.

faas-diagram

Figure 1.  Function, triggers and output concept. (from markoinsights.com)

There have been two conferences that have focused on the state of serverless computing. The paper “Status of Serverless Computing and Function-as-a-Service (FaaS) in Industry and Research” by Geoffrey Fox, Vatche Ishakian, Vinod Muthusamy and Aleksander Slominski provides an excellent overview of many of the ideas, concepts and questions that surround serverless computing that surfaced at these conferences. In that paper they refer to an IBM tutorial that defines serverless FaaS as
1. short-running, stateless computation
2. event-driven applications
3. scales up and down instantly and automatically
4. based on charge-by-use

Notice that there is a distinction between FaaS and serverless FaaS and it has to do with item 4 above.   A good example of this is Google’s App Engine, which was arguably the first FaaS available from a commercial cloud.   In its current form App Engine can run in one of two modes.  In its standard mode, your applications run in a sandbox and you are charged only when the app is running.  In the “flexible” mode you deploy a container  and then specify the compute infrastructure needed in terms of CPU power, memory, disk and you are charged by the hour.  You could say that App Engine in running Flexible mode is server-lite, but clearly not fully serverless, while standard mode is truly serverless.

What are the serverless FaaS choices?

There are a number of FaaS implementations.  Some of these are used for research while others are commercial products.  The Status report refers to many of these and the slides for the workshop are on-line.   A good example of the research work is OpenLambda from the University of Wisconsin and first introduced dat HotCloud ’16.  Based on this experience the Wisconsin team described Pipsqueak, an experiment to reduce the deployment latencies caused by Python library initializations.   Ryan Chard described Ripple which is an excellent example of distributing event trigger management from the source to the cloud. Ripple has been designed and use for several significant science applications including beamline science (ALS and APS).  Another related technology is if-this-then-that IFTTT that is a service for chaining together other service.

Two other open source projects raise an interesting question about what is behind the curtain of serverless. Funktion  and Fission are both implementations of FaaS on top of Kubernetes. As we discuss serverless computing we must remember that there is a “server” somewhere. The basic infrastructure for serverless computing needs to run somewhere as a persistent service and hence a microservice platform like Kubernetes is a reasonable choice. This relates to the economics of severless computing and we return to that at the end of this report.

The most commonly referenced open source FaaS service is Apache OpenWhisk which was developed by IBM and is available on their Bluemix cloud as a service. The other commercial services include Google Function, Microsoft Azure Functions and Amazon Lambda. At the end of this article we will show some very simple examples of using some of these systems.

When can FaaS replace a mesh of microservices for building an app?

The Status paper also makes several other important observations. For example, they note that while serverless is great for the triggered example described above, it is not good for long-running or statefull applications like databases, deep learning training, heavy stream analytics, Spark or Hadoop analysis and video streaming. In fact, many large-scale cloud-native applications that have thousands of concurrent users require continuous use of massive networks of microservices. This will will not be based on serverless FaaS. However, there may be many cases where a user-facing application running in the cloud could be easily implemented with serverless FaaS rather than as a big microserivce deployment. What we do not know is where the cross-over point from a serverless FaaS implementation of an app to a full kubernetes based massive microservice deployment lies. This relates to the economic of FaaS (discussed briefly at the end of this article). The Cloud Native Computing Foundation has a working group on serverless computing that is addressing this topic.

There is one interesting example of using serverless FaaS to do massively parallel computing and it called pywren. The lovely paper “Occupy the Cloud: Distributed Computing for the 99%”  by Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica and Benjamin Recht describes the concepts in pywren which allow it to scale computation to thousands of concurrent function invocations achieving 40teraflops of compute performance. Pywren uses AWS Lambda in a very clever way: it serializes the computational function which are then passed to a lambda function to execute. We will return to Pywren in another post.

Computing at the Edge?

Perhaps the hottest topic in the general area of cloud computing is when the computing spills out of the cloud to the edge of the network. There are many reasons for this but most boil down to latency and bandwidth. A good example is the use cases that motivate the Ripple system described above. In order to generate a stream of events from sensors at the edge of the network, one needs very light weight computing that can monitor them and generate the events. In many cases it is necessary to preprocess the data in order to send the message to the cloud where a function will be invoked to respond. In some cases, the response must return to the source sooner than a remote invocation can respond because of the latencies involved. The computing at the edge may need to execute the function there and the actual communication with the cloud may be just a final log event or a trigger for some follow-up action.

Another possibility is that the functions can migrate to the places where they are needed. When you deploy the computing at the edge you also deploy a list of functions that that must be invoked. When the edge process starts up it could cache some of the function it needs locally. We expect there will be many variations on these ideas in the future.

A Tiny FaaS Tutorial

By their very nature FaaS systems are very simple to use: you write short, stateless function and tie them to triggers. We will take a quick look at three of these: AWS Lambda, Microsoft Function and IBM Bluemix OpenWhisk.

AWS Lambda Functions

Amazon was the first to introduce serverless functions as a service and it is very well integrated into their ecosystem of other services. Called Lambda functions, they can be easily created in a number of standard programming languages. Lambda functions can be associated with a variety of trigger events including changes to the state of a storage account, web service invocations, stream events and even workflow events.

We will illustrate Lambda with a simple example of a function that responds to Kinesis Stream events and for each event it adds an item to a dynamoDB table. Here is the python code for the function that accomplishes this task.

from __future__ import print_function
#import json
import base64
import boto3
def lambda_handler(event, context):
    dyndb = boto3.resource('dynamodb', region_name='us-west-2')
    table = dyndb.Table("lambdaTable")
    
    for record in event['Records']:
       #Kinesis data is base64 encoded so decode here
       payload=base64.b64decode(record["kinesis"]["data"])
       x = eval(str(payload))
       metadata_item = {'PartitionKey': x['PartitionKey'], 
                        'RowKey': x['RowKey'], 'title': x['text']}
       table.put_item(Item=metadata_item)

There are several critical items that are not explicit here.   We need to invoke the AWS Identity and Access Management (IAM) system to delegate some pemissions to our function and we will need to grab copies of the Amazon Resource Names (ARNs) for this and other objects. First we create a IAM role that will allow access to Kineses streams and DynamoDB. Using the IAM portal we created a role called:  “lambda-kinesis-execution-role” and attached two policies ‘AmazonDynamoDBFullAccess” and “AWSLambdaKinesisExecutionRole”. We then made a copy of the Role ARN.

The next step is to install this function into the lambda system. To do that we put the function above into a file called “ProcessKinesisRecords.py” and then zipped it. We then uploaded the zipped file to S3 (in the us-west-2) in the bucket “dbglambda”. With that we can create our function with the boto3 call from our laptop as:

lambdaclient = boto3.client('lambda')

response = lambdaclient.create_function(
    FunctionName='lambdahandler',
    Runtime='python2.7',
    Role='arn:aws:iam::066xx534:role/lambda-kinesis-execution-role',
    Handler='ProcessKinesisRecords.lambda_handler',
    Code={
        'S3Bucket': 'dbglambda',
        'S3Key': 'ProcessKinesisRecords.zip',
    }
    )

The final item we need is the link between this function and the Kinesis Stream service.
To do that we went to the portal for Kinesis and created a stream called “mylambdastream” and grabbed its ARN. Creating the binding is accomplished with the following.

response = lambdaclient.create_event_source_mapping(
    EventSourceArn=
        'arn:aws:kinesis:us-west-2:06xxx450734:stream/mylambdastream',
    FunctionName='lambdahandler',
    BatchSize=123,
    StartingPosition='TRIM_HORIZON',
)

We can verify the properties of the function we have created by looking at the AWS Lambda portal pages. As shown below, we can verify that our lambdahandler function does indeed have our stream as its trigger.

lambda1

Figure 2. AWS Lambda Function dashboard showing our trigger.

Finally we can invoke it by pushing an event to the kinesis stream. This is shown below.

client = boto3.client('kinesis')
record = "{'PartitionKey':'part2', 'RowKey': '444', 'text':'good day'}"
resp = client.put_record(StreamName='mylambdastream',
    Data=bytearray(record),
    PartitionKey='a')

Checking the DynamoDB portal will verify that the function has picked the message from Kinesis and deposited it in the database. The full details are in the notebook “simple-kinesis-lambda-sender.ipynb”.

Azure Functions

The Azure function portal has a large number of basic templates we can use to build our function as shown below.

azure-funcs1

Figure 3.  Azure Function Template samples

We have selected one of the Python examples that creates a simple web service. Bringing this up on the portal we see the code below.

import os
import json

postreqdata = json.loads(open(os.environ['req']).read())
response = open(os.environ['res'], 'w')
response.write("hello world to "+postreqdata['name'])
response.close()

To get this running, we now must go to the “integrate” page to provide one more detail: we set the authorization level to “anonymous” as shown below.

azure-func2

Figure 4.  Setting the authorization level for the trigger to “anonymous”.

The client is also simple and shown below.

import os
import json
import requests
import json

data = {}
data['name'] = 'goober'
json_data = json.dumps(data)
r = requests.post("https://pylook.azurewebsites.net/api/pylook", 
                   data=json_data)
print(r.status_code, r.reason)
print r.text

Using Azure Functions to pull data from a queue.

We now give another version of the function that reads messages from a queue and puts them in a table. There is no Python template for this one yet, so we will use JavaScript. To use template this we must first create a new storage account or use an existing one. Go to the storage account page and you will seefunction5

Figure 5. Creating a new table and queue in a storage account.

We click on Table and create a new table and remember its name. Then we go to Queues and create a new one and remember its name. Looking in the storage explorer should show for these items. Clicking on the table name should give us a picture of an empty table.

Go back to the portal main page and click + and look for “Function App” and click create. There is a table to fill in like the one below.

function4

Figure 6.  Creating a new function.

We give it a name and allow it to create a new resource group. For the storage we want to use the dropdown and look for the storage account name. (It is important that the storage account is in the same location as the function. We click create and wait for the function to appear on the function portal page. You should see it in your resource groups. Follow that link and you will see that it is running.

Go to the functions tab and hit “+”. It will ask you to pick one of the templates. At the top where it says “language” select Javascript and pick the one that is called QueueTrigger. This will load a basic template for the function. Now edit the template so that it looks like the following example.

function2

 

The main difference between this and the template is that we have added an output table and instructions to push three items into the table. The function is assuming that the items in the queue are of the form

{‘PartitionKey’: ‘part1’, ‘RowKey’: ’73’, ‘content’: ‘some data ‘}

Next we need to tie the queue to our queue and the table to our table. So click on “Integrate” on the left. You need to fill in the form so that it ties it to your stuff as illustrated below.

function3

Figure 8.  The association between the storage account queue and table service with the variable in our function.  Here we have highlighted the “show value” to verify that it has the right storage account.

You should see your storage account in the dropdown menu. Select it. And they add the Table name. You need to do the same for the AzureQueueStorage. Once this is done and your function is saved and the system is running your function should be instantiated and invoked as soon as you send it queue items. For that we have a simple python script in a Jupyter notebook. You can get from  https://SciEngCloud.github.io/py-functions-queue-driver.ipynb.
You will need to fill in your account key for the storage account, but then you should be able to step through the rest. The notebook runs a few tests and then sends 20 items to the queue. Using the Azure storage explorer we see the results in the table as shown below.

function1

Figure 9.   The view of the table after running the Jupyter Notebook

OpenWhisk and IBM Bluemix

OpenWhisk is the open source serverless function system developed by IBM that is also supported as a commercial service on their Bluemix cloud.  Like the others it is reasonably easy to use from their command-line tools, but the Bluemix portal provides an even easier solution.   Creating a function is as easy as selecting the runtime and downloading a template.  Figure xx below illustrates a simple function derived from template for a Python3 function.

action

Figure 10.  An OpenWhisk function that decodes a dictionary input and returns a dictionary.

Notice that the parameter to the function is a python dictionary (all Openwhisk messages are actually Json objects, which for Python are rendered as dictionaries).  While this template can be run as a web service in its present form, it is also possible to connect it to an external trigger.  To illustrate this we connected it to a trigger  that monitors activity in a Github account.  Using the portal it was easy to create the trigger (called “mygithubtrigger” and bind it to push and delete actions on a repository called dbgannon/whisk.   All this required was an access token which was easy available by logging in to the GitHub portal.   In the case of GitHub triggers the event returns a massive Json object that is a complete description of the repository.   In the action “myaction” we go two levels down in the dictionary and extract the official repository description and make that the response of the action to the event.

When a trigger fires you need a rule which bind the firing to an action.   We bound it to the trivial “myaction” example above.  The view from the portal of the rule is below

ruleview

Figure 11.   The rule view of our association of the trigger to the function

We next added a new file to the repository.  This activated the trigger and the rule invoked the action.   The portal has a nice monitor facility and the image is below.

monitor
Figure 12.  This is the “Monitor” view after the trigger has fired.

Finally, drilling down on the “myaction” log event we see the description of the GitHub repository we created.

json-data
Figure 13.  The output of the “myaction” function after a file was added to the GitHub repository.

Finally

These examples above are all very trivial.   The next thing to explore is how functions can be composed into workflows.   Each of the three systems has its own way of doing that and if we have time later we will show some examples of this capability.   We have also not disussed performance or cost which is greatly dependent on the rate at which your triggers are firing and the amount of work in each function execution.

The economics of serverless computing are also very interesting.  As we pointed out  earlier, for a cloud provider to offer a FaaS serverless capability it must be supported by actual infrastructure.   How can the provider afford this if it is only charging by the second?  There are two possible answers.   First if your FaaS is serving very large numbers of function per second then it will certainly pay for itself.   But there is another consideration.   Many of the big cloud providers are running vast microservice frameworks that support most, if not all of their big internal and publiclly available applications.  Running  a FaaS system on top of that is as easy as running any other microservice based application.  One only needs to set the cost per second of function evaluation high enough to cover the cost of that share of the underlying platform the FaaS system is using.   For Amazon Lambda “The price depends on the amount of memory you allocate to your function. You are charged $0.00001667 for every GB-second used.”  This amounts to $0.06 per GB hour.  So for a function that take 8 GB of memory to execute that is $.48 per hour, which is 4.8 times greater than the cost of an 8GB m4.large ec2 instance.   Consequently a heavily used FaaS system is a financial win and a lightly used one may have little impact on your infrastructure.

Of course, single executions of a FaaS function are limited to five minutes, so you would need 12 5 minute concurrent executions to reach an hour of execution time. Furthermore AWS give you 1 million invocations (at 100ms) each month for free or 400,000 GB-seconds per month free.  That is 111 GB hours per month for free.   This is not a bad deal and it would seem to indicate that Lambda is not a big drain on their infrastructure yet.

Acknowledgement:  Thanks go to Ryan Chard for helpful comments and suggestions on a draft of this post.

Azure’s new CosmosDB Planet-Scale Database

The Microsoft Azure team has recently released CosmosDB, an new entry to the cloud data storage management marketplace. Cosmos is actually the name of a data storage system that has been used internal to Microsoft for many years. That original Cosmos has now morphed into Azure Data Lake Storage (ADLS) and Analytics. ADLS is focused on Hadoop/HDFS compatible scalable analytics and, at the present time, it is available in only two of the US data centers. CosmosDB is related to the original Cosmos in name only and it is available at all of the data centers.

There is an excellent technical description [https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-cosmos-db], so we will hit only a few highlights here.  CosmosDB has four main API components and a number of interesting properties. From the user’s perspective these components are

  1. DocumentDB; “a NoSQL document database service designed from the ground up to natively support JSON and JavaScript directly inside the database engine.”
  2. MongoDB; CosmosDB implements the MongoDB API so that users of that popular NoSQL database can easily move to CosmosDB.
  3. Azure Table Storage: a simple key-value table mechanism that has been part of Azure from it earliest days.
  4. Gremlin Graph Query API; a graph traversal language based on functional programming concepts.

There is a common “resource model” that unifies all of these capabilities. A database consists of users with varying permissions and containers. A container holds one of the three content types, document collection, tables or graphs. Special resources like stored procedures, triggers and user-defined-functions (UDFs) are also stored within the container. At the top level the user creates a cosmosDB database account and in doing so the user must pick one of these four APIs. In the examples we show here we focus on the DocumentDB API. In a later post we will look at Gremlin.

There are four important properties that every CosmosDB databased has.

  1. Global distribution: your database is replicated in any of 30+ different regions and you can pick these from a map on the Azure portal.
  2. Elastic scale-out: you can scale throughput of a container by programmatically provisioning throughput at a second or minute granularity. provision throughput (measured in using a currency unit called, Request Unit or RU).
  3. Guaranteed low latency: For a typical 1KB item, Azure Cosmos DB guarantees end-to-end latency of reads under 10ms and indexed writes under 15ms at the 99th percentile within the same Azure region.
  4. Five consistency models.
  5. A Comprehensive Service Level Agreement (SLA).

Global Distribution

CosmosDB has two forms of distribution. Each database is composed of one or more collections and every data collection is stored in a logical container that is distributed over one or more physical server partitions. And, of course, everything is replicated. Every item in a collection has a partition key and a unique ID. The partition key is hashed and the hash is used to make the assignment of that item to one of the physical servers associated with that container. This level of partitioning happens automatically when you create a collection and start filling it.

Global distribution is more interesting. When you first create a database it has an initial “write location” which refers to one of the 30 or so Azure regions around the globe. You can use the portal to say how your data is distributed to other regions. In the example we discuss at the end of this document our original region “North Central US”. We have used the “Replicate Data Globally” function, which gives us a map in the portal, where we selected three additional regions, “West Europe”, “South India” and “Brazil South” as shown in Figure 1 below.

geodistributed-db

Figure 1. Selecting three additional locations for a database using a map tool.

With a small (170MB) database of documents it took only a few minutes to replicate the data to these locations. The data collections in the replicate sites are considered “read locations”. Which means that a user in these locations will by default read data from the local replica. However if that remote user attempts a write the request is routed to the “write location”. To understand how long a remote reader takes to see an update from the remote write location we need to discuss the consistency models.

Elastic scale-out and The Cost of Computing

Every operation you do in CosmosDB requires bandwidth and computation. You can control the rate at which “energy” is consumed for your database with something called Request Units (RU) and RUs/sec and RUs/min. You specify the RUs/sec you are willing to “spend” on your database and CosmosDB will provision the resources to meet your throughput requirements.

Consistency Models

Consistency is one of the most difficult topics in distributed systems. In the case of distrubted or replicated databases, consistency the consistency model tell the user how changes in one copy of a database are reflected updates to other copies. By insisting that consistency be very strong, this may have an adverse impact on latencies and throughput. So the choice of consistence model can have a profound impact on application performance and cost.

One of the truly unique features of CosmosDB is that it gives the user a choice of five different consistency models. In order from the weakest to the strongest and in terms of RUs, the cheapest to most expensive they are:

  1. Eventual consistency. Eventual consistency is the weakest form of consistency wherein a client may get the values which are older than the ones it had seen before, over time. In the absence of any further writes, the replicas within the group will eventually converge.
  2. Consistent prefix. Consistent prefix level guarantees that reads never see out of order writes. If writes were performed in the order `A, B, C`, then a client sees either `A`, `A,B`, or `A,B,C`, but never out of order like `A,C` or `B,A,C`.
  3. Session consistency. When you connect to a cosmosDB database through its URL you are creating a session. Session consistency guarantees monotonic reads, monotonic writes, and read your own writes (RYW) guarantees for the duration of the session. In the Python API you create client object that encapsulates your connection to the database. The parameter “Consistency_level” has the default value “Session”. Reads in session consistency take a bit longer than consistent prefix which takes longer than eventual consistency.
  4. Bounded stateness. This model has two ways to insist on consistency. You can define a time interval such that beyond that interval of time from the present, the system will guarantee consistency. Alternatively you can specify a the upper bound on the number of reads that lag behind writes your reader can be. Beyond that point consistency is guaranteed. Of course the smaller you make the window, the more computing you may consume and delay you may encounter.
  5. Strong consistency. The most expensive and most complete. it is linearizable in that reads always returns the most recent writes. But it is limited to collections that are not geodistributed.

A word about the performance of these models. In the example at the end of this section we put the write region of the database in North America and replicated in three other locations with one in Western Europe. We then created a virtual machine in Western Europe and connected it to the database and verified by the IP address that it was connecting to the local replica. We then fired changes at the North America region and read with each of the consistency models. We were unable to detect any out of order reads and the responses were extremely fast. This is obviously not a full scientific study, but it was clear we had a system that performed very well. We also attempted a write to the local Western Europe “read location”. We were able to detect that the API call immediately dropped the read location connection and reconnected to the North American copy to do the write.

Comprehensive SLA

The service level agreement (SLA) for CosmosDB is actually a contract that Azure and you agree to when you create a database. It is indeed comprehensive. It covers guarantees for availability, throughput, consistency and latency giving you upper bounds for each measurable attribute of performance. As you are the one to specify how many RUs/sec you are willing to allocate, some of the temporal guarantees may be dependent upon this allocation not being exceeded. We will not go into it in detail here. Read it with your lawyer.

A look at Azure DocumentDB using Python and GDELT Data

To illustrate the basics of DocumentDB we will use a tiny part of the amazing GDELT event collection. This data collection is available on AWS S3 as well as Google’s BigQuery. The GDELT project (http://gdeltproject.org) is the brainchild of Kalev Leetaru from Georgetown University and it is an serious “Big Data” collection. GDELT’s collection is the result of mining of hundreds of thousands of broadcast print and online new sources from every corner of the world every day. What we will look at here is a microscopic window into the daily collection of news items. In fact, we will take a look at the news from from one day: June 30, 2017.

AWS keeps the data on S3 and downloading it is dead easy. The command

$ aws s3 cp s3://gdelt-open-data/events/20170630.export.csv . 

will download the 20MB June 30 2017 dataset as a CSV file. Each row of the file is a record of an “event” consisting of some 60 attributes that catalog a publication of a news item. Among the fields in the record are a unique identifier, a timestamp and the name and geolocation of two “actors” that are part of the event. In addition, there is a url a published news item about the event. The actors are derived from the even. Actor1 is called an initiator of the event and actor2 is a recipient or victim of the event. These actors are usually represented by the names of the cities or countries associated with the actual actors.

For example the story “Qatar’s defense minister to visit Turkey amid base controversy” (http://jordantimes.com/news/local/qatar’s-defence-minister-visit-turkey-amid-base-controversy) appeared in the Jordan Times. It describes an event where the Qatar defense minister visited Turkey as Turkey resists pressure from Saudi Arabia, the United Arab Emirates, Egypt and Bahrain to close bases after the five other nations pressed sanctions agains Qatar on June 5. This URL appears three time in the database for June 30. In one case, actor1 is Doha Qatar and actor2 is Ankora Turkey. In the second case actor1 is Riyadh, Saudi Arabia actor2 is Turkey. In the third case actor1 and actor2 are both Riyadh. Though the identity of the actor cities is based on an automated analysis of the story, their may be three different stories to analyze. However, only one URL has been selected to represent all three.

For our purposes we will describe this as a single event that links the three cities, Doha, Ankora and Riyadh. Many researchers have used this GDELT data to do deep political and social analysis of the world we live in, but that goal is way beyond what we intend here. What we will try to do in this example is to use CosmosDB to graphically illustrate different ways we can represent the events of June 30 and the cities they impact.

Creating the DocumentDB from the CSV file.

We will begin by creating a simple document database where each document is the JSON record for one line of the CSV file. Actually we will only include the URL, the two actor cities and their geolocations. To begin we must connect you client program to the CosmosDB system.

We use Python and we assume that the documented tools have been installed.

#run this if documentdb is not installed
#!pip install pydocumentdb
import csv
import pydocumentdb;
import pydocumentdb.document_client as document_client

We next create the database and the collection. But first we need an Azure CosmosDB account and is best done on the Azure portal. One that is there you can retrieve the account key so that we can create the collection.
Our account is called bookdockdb and our database is called ‘db3’ and the new collection will be called ‘gdelt’.

config = { 
    'ENDPOINT': 'https://bookdocdb.documents.azure.com',
    'MASTERKEY': 'your db key here',
    'DOCUMENTDB_DATABASE': 'db3',
    'DOCUMENTDB_COLLECTION': 'gdelt'
};
#we create the database, but if it is already there the "CreateDatabase will fail"
#In that case, we can query for it. 
#this is handy because we can have more than one collection in  a given database.
try:
    db = client.CreateDatabase({ 'id': config['DOCUMENTDB_DATABASE'] }, )
except:
    db_id = config['DOCUMENTDB_DATABASE']
    db_query = "select * from r where r.id = '{0}'".format(db_id)
    db = list(client.QueryDatabases(db_query))[0]

Next we will create the collection. Again, we may already have the collection there, so we can add new items to an existing collection but we need a different method to retrieve the collection handle.

options = {
    'offerEnableRUPerMinuteThroughput': True,
    'offerVersion': "V2",
    'offerThroughput': 400
}
try:
    collection = client.CreateCollection(db['_self'], 
         { 'id': config['DOCUMENTDB_COLLECTION'] }, options)
except:
    collection = next((coll for coll in client.ReadCollections(db['_self']) \
         if coll['id']==config['DOCUMENTDB_COLLECTION']))

The next step is to read the CSV file and add the new documents to the database collection.
We are going to do this one line at a time and keep only the time stamps, actor cities and geolocations (latitude and longitude) and the associated URL. We will also keep a separate list of the URLs for later use.

url_list = []
with open('20170630.export.csv') as csvfile:
    readCSV = csv.reader(csvfile, delimiter='\t')
    for row in readCSV:
        url_list.append(row[57])
        document1 = client.CreateDocument(collection['_self'],
            { 
             'id': row[0],
             'time_stamp': row[1],
             'time_stamp2': row[4],
             'city0_lat': row[46],
             'city0_lon': row[47],
             'city0_name': row[43],
              'city1_lat': row[53],
              'city1_lon': row[54],
              'city1_name': row[50],
              'url': row[57] 
              }
            )

The next thing we will do is to look at the URLs that involve a large number of cities (because they are often more interesting). Because we have the geolocations of each city we can draw a map and link the cities associated with the URL.

To do this we need to remove all the duplicates in our URL list using a standard Python trick: convert the list to a set and then back to a list.

s = set(url_list)
url_list = list(s)

We next query the database collection to find the cities associated with each URL. The best way to do this in the cloud is a simple map-reduce operation where we map each record to a pair consisting of a URL and a list of cities. We then invoke a reduce-by-key function to reduce all pairs with the same key to a single pair with that key with all of the lists associated with that key concatenated.
If we were doing more than a single day’s worth of data, It would be worth bringing Spark into the picture and using that for the map-reduce step. Instead we will do it sequentially using a data base query for each URL. The following function accomplishes that task.

def get_cities_for_url(url):
    db_query = {'query':"select * from r where r.url = '{0}'".format(url) }
    options = {}
    options['enableCrossPartitionQuery'] = True
    options['maxItemCount'] = 1000
    q = client.QueryDocuments(collection['_self'], db_query, options)
    results = list(q)
    cities = []
    for x in results:
        if x['city1_name'] != '':
            cities.append(x['city1_name']+" lat="+x['city1_lat'] + ' lon='+x['city1_lon'])
        if x['city0_name'] != '':
            cities.append(x['city0_name']+" lat="+x['city0_lat'] + ' lon='+x['city0_lon'])
    city_set = set(cities)
    cities = list(city_set)
    return cities

The function returns a list of strings where each string is a city and its geolocation. We have used the matplotlib Basemap tools to render the most interesting sets of cities where we define “interesting” to be those URL sets with more than 5 cities. In this case we have only plotted a few hundred of the city clusters.

gdelt-geoimage

Figure 2.   Cities linked by shared stories.

While this is colorful, it is not very informative. We can tell there was some events in Australia that involved Europe and the US (this turned out to be a scandal involving an important person in the catholic church) and there was a number of stories that involve North Korea, the US and China. Something happened in the Northwest that involved cities in the east coast of the US.n general the information content is low.

To create a different view, let’s transform the collection to one that reflects the set of cities associated with each URL. To accomplish that we call the get_cities_for_url() function above and create a new document that only three attributes: an ID, the URL and a list of cities associated with the URL. The new collection is called “cities”, but it is really more like a dictionary that associates the URLs with a list of cities.  (The Jupyter Notebook used to build this collection and illustrate the example below is available here.)

One thing we can do is to search for special words or phrases in the URLs and list the cities that are associated. To do this we can make use of another feature of DocumentDB and create a user defined function to do the string search. This is extremely easy and can be done from the Azure Portal. The functions are all java script. The data explorer tab for the “cities” collection and select the “New User Defined Function” tab. In our case we create a function “findsubstring” that searches for a substring “sub” in a longer string “s”.

function  findsubstring(s, sub){
    t = s.indexOf(sub)
    return t
}

We can now call this function in a database query as follows.

def checkInDB(name):
    db_query = {'query':"select r.url, r.cities from r where udf.findsubstring(r.url,'{0}')>0".format(name) }
    options = {}
    options['enableCrossPartitionQuery'] = True
    options['maxItemCount'] = 10000
    q = client.QueryDocuments(collection['_self'], db_query, options)
    results = list(q)
    return results

Picking a name from the news, “Ivanka”, we can see what stories and cities mentioned “Ivanka” on that day. The call checkInDB(“Ivanka”) returns

[{'cities': ['Oregon, United States', 
'White House, District of Columbia, United States',
'Buhari, Kano, Nigeria',
'Canyonville Christian Academy, Oregon, United States ', 
'Chibok, Borno, Nigeria', 
'Virginia, United States'],
'url': 'http://www.ghanaweb.com/GhanaHomePage/world/Trump-and-Ivanka-host-two-Chibok-girls-at-the-White-House-553963'}]

which is a story about a reception for two girls that were rescued from the Boko Haram in Africa. The list of associated cities include Canyonville Christian Academy, a private boarding school that has offered to continue their education.

A More Interactive Abstract View of the Data.

Another interesting approach to looking at the data is to consider stories that link together smaller cities in some ways. We are going to look for “local” story connections rather than big news items. More specifically, we will look at cities that appear in at least 3 but no more than 5 stories in the days news. This leaves out big cities like washington DC and big global headlines. Next we will build a graph that we can use to explore the stories and cities that share these more regional stories. There are actually 245 cities that appear in our restricted set.

In order to make our graph we need a dictionary that maps cities into the list of stories that mention it. Call that dictionary “stories” Our visualization has nodes that are cities and we will connect them by an edge if they appear in a story together. Let “linkcities” be the list of 245 cities describe above. We can then compute the edges using as follows.

edges = []
for i in range(0, len(linkcities)):
    for j in range(0, len(linkcities)):
        if i < j:             
		iset = set(stories[linkcities[i]])             
		jset = set(stories[linkcities[j]])             
		kset = iset.intersection(jset)             
		if len(kset)> 0:
                	edges.append((i,j))

The complete code for this example is given in this Jupyter notebook. We use the Plotly package to build and interactive graph. The nodes are distributed approximately over the surface of a sphere in 3D. You can rotate the sphere and mouse over nodes Doing so, shows the URLs for the stories associated with that town. The sphere appears as the image below.

city-graph

Figure 3.  Interactive city graph

A better way to see this is to interact with it. You can do this by going to the bottom panel in this file [https://SciEngCloud.github.io/city-graph.html].

Conclusion

Our experience with CosmosDB has been limited to using the DocumentDB API, but we were impressed with the experience. In particular, the ease with which one can globally distribute the data was impressive as were the selection of consistency protocols. As we said, we were unable to force the database to show temporary inconsistency even with the weakest protocol choice using very distant copies of the database. This should not be considered a fault. Our tests were too easy.  The examples above do not begin to stretch all the capabilities of CosmosDB, but we hope they provide a reasonable introduction for the Python programmer.)

One final comment. It seems CosmosDB was built on top of the Azure microservice container system. It would be very interesting to see more details of the implementation.

 

 

Containers for HPC. A look at Singularity.

Vanessa Sochat and I recently completed a little study of Singularity, the container system that brings the advantages of Docker-style encapsulation to HPC.  If you are interested this is now a supplementary chapter to the book “Cloud Computing for Science and Engineering” that is being published by MIT Press.  You can see the chapter here: https://cloud4scieng.org/singularity-a-container-system-for-hpc-applications/  The chapter has a high level overview and a brief tutorial showing how to build a singularity container for an MPI application and how to run it on an small cluster on AWS. 

Manifold Learning and Deep Autoencoders in Science

One way to think about machine learning is to view it as building a model of a system based on samples of data that are artifacts of that system.  This view does not resonate very well when the problem at hand is identifying photos containing sail boats, but it is apt when ML is applied to scientific data.

The data from scientific experiments is often takes the form of vectors in a very high dimensional space and we are looking for an underling organization of that data that reflects properties our system.   Manifold learning is based on the assumption that the system you are trying to model generates data that lies on or near a lower dimensional surface in the higher dimension coordinate space of the data.   Picture the surface of a sphere or a curve in 3-D.   If this manifold assumption about the data is true, it may be possible to “unfold’’ the surface so that a projection or other linear analysis makes the data easier to understand.

Autoencoders are deep neural networks that can be used to discover properties of the manifold of data that characterizes the system you are trying to model.  Illustrated below, autoencoders have an encoder phase and a decoder phase with a space of much lower dimension than the input in the middle.  By studying the properties of this lower dimensional space, we have a better view of the data manifold. You train an autoencoder to be the identity function.  

autoencoder

Recently, a lovely blog article A look at deep learning for science by Prabhat gave us an excellent overview of some uses of deep learning technology in science applications and several of these were applications of autoencoders.  Partly inspired by that article and by our own experiments with some very interesting images of neuron cells collected by Maryana Alegro at UCSF, we put together a little tutorial on autoencoders in science.   In the article we discuss two types of autoencoders: denoising and variational.  The variational autoencoder we examine is applied to the cell images in an effort to create a model that can be used to both classify samples as well generate new samples by following a path along the surface of the manifold. 

We have published this little study as a supplement to our chapter on machine learning in the book “Cloud Computing for Science and Engineering”.   The link to the article is here.