# A “Chatbot” for Scientific Research: Part 2 – AI, Knowledge Graphs and BERT.

## Abstract.

In 2018 I published a blog about building a cloud-resident “Research Assistant” (RA) chatbot that would be the companion of each scientist. The RA would be responsible for managing scientific data, notes and publication drafts. It could create intelligent summaries and search for important related scientific articles. That post demonstrated a simple prototype that provided spoken English input and simple dialog responses to search for available, relevant research. But it did not address the important issues of data management and textual analysis required to make the RA real. In a short, invited “vision talk” I gave at the e-Science 2019 conference I tried to address the technology that, in 2030, we would need to solve these problems.   This article does not describe an implementation. Rather it is a survey of the missing pieces I alluded to in the talk in terms of the current, related literature.

## Introduction

2017 was the year of the smart on-line bot and smart speaker. These are cloud based services that used natural language interfaces for both input and output to query knowledge graphs and search the web. The smart speakers, equipped with microphones listen for trigger phrases like “Hello Siri” or “hello Google” or “Alexa” and recorded a query in English, extracted the intent and replied within a second. They could deliver weather reports, do web searches, keep your shopping list and keep track of your online shopping. The impact of this bot technology will hit scientific research when the AI software improves to the point that every scientist, graduate student and corporate executive has a personal cloud-based research assistant. Raj Reddy calls these Cognition Amplifiers and Guardian Angels. We call it a research assistant.

Resembling a smart speaker or desktop/phone app, the research assistant is responsible for the following tasks:

1. Cataloging research data, papers and articles associated with its owner’s projects.
2. The assistant will monitor the research literature looking for papers exploring the same concepts seen in the owner’s work.
3. Automatically sifting through open source archives like GitHub that may be of potential use in current projects.
4. Understanding the mathematical analysis in the notes generated by the scientist and using that understanding to check proofs and automatically propose simulators and experiment designs capable of testing hypotheses implied by the research.

Understanding the implications of these 4 properties will be the central theme of this post.

In 2017 we published a short article about how we could build a chatbot for research. In that paper we presented a short overview of chatbot software circa 2017 and demonstrated a very simple toy example meta-search engine that used spoken commands about research interests and the bot would respond with matching documents from Bing, Wikipedia and ArXiv. To illustrate this, consider the sentence “Research papers by Michael Eichmair about the gannon-lee singularity are of interest.” This required out Bot, called the Research Assistant, to understand that the main topic of the sentence was the gannon-lee singularity (an obscure reference to a paper from the 1970s that I happen to know about) and the fact that we want related papers by Michael Eichmair. The result obtained by our Bot shown in Figure 1.

Figure 1.   The results (shortened) from our original 2017 Science Bot to the Eichmair question.

In 2019 the same results can now be obtained by directly inserting this sentence into Google or Bing. We suspect one reason for this is the use of vastly improved language models based on Transformers (that we will briefly describe below).   Our bot is not only obsolete, we will argue in this article that it completely misses the boat on what is needed to make something truly useful.   This report will not present any new research results.   Instead it will try to outline the types of tasks required to make the research assistant capable of demonstrating the capabilities listed above.   We will try to also give a survey of the best published work leading in these directions. (This report is an expansion of a paper that was originally written for an invited “vision” talk entitled “eScience 2050: a look Back” for the eScience 2019 conference held in San Diego, Sept. 2019.)

# Knowledge Graphs

If we look at the first two items in the RA capabilities list above, we see that they go well beyond simple meta search. These tasks imply that the research assistant will need to keep an organized research archive of data, notes and papers and have the ability to extract knowledge from the literature. We will assume that the individual items the RA manages will be cloud-resident objects that are described by a searchable, heterogeneous database of metadata.   One such database structure that can be used for this purpose is a Knowledge Graph (KG).   KGs are graphs where the nodes are entities and the links between nodes are relations. Often these node-edge-node triples are represented using Resource Description Framework (RDF) which consist of a subject, a relationship and an object. Each element of the triple has a unique identifier. The triple also has an identifier so that it can also be subjects or objects.

Having a KG that is based on scientific ontological terms and facts that can be augmented with the content of the individual scientist would be the proper foundation for our RA. To help explain this we need to take a diversion into the existing triple store KGs to see if there is one we can build upon.

There are several dozen implementations of RDF triple stores and many are open source. In addition, there are a number of commercial products available including

• Ontotext which produces GraphDB a commercial RDF knowledge graph used by commercial custormers in publishing ( BBC and Elsevier), pharmaceuticals (AstraZeneca) and libraries (Mellon funded projects for the British Museum and the US National Galery of Art)
• Grakn Labs in the UK had a knowledge graph Grakn that has special versions such as BioGrakn for life science apps.
• Cambridge Semantics has a product called AnzoGrapDB which has numerous customers in the pharmaceutical domain.
• And, of course, Oracle has a version of its database called “Spatial and Graph” that supports very large triple stores.

If you search for Knowledge Graph on the web or in Wikipedia you will lean that the KG is the one introduced by Google in 2012 and it is simply known as “Knowledge Graph”. In fact, it is very large (over 70 billion nodes) and is consulted in a large fraction of searches.   Having the KG available means that a search can quickly surface many related items by looking at nearby nodes linked to the target of the search.   This is illustrated in Figure 2 for the result of a search for “differential equation” which is displayed an information panel to the right of the search results.

Figure 2. Google information panel that appears on the right side of the page. In this case the search was for “differential equation”. (This image is shortened as indicated by …).

Google’s Knowledge Graph is not as good for science topics as the example in Figure 2 suggests. In fact, it is extremely good with pop culture, but for science applications like our RA, Google’s KG often just takes information from Wikipedia. In its earliest form Google KG was based on another KG known as Freebase. In 2014 Google began the process of shutting down Freebase and moving content to a KG associated Wikipedia called Wikidata. However, the Freebase archive is still on-line had has some excellent science content.

Launched in 2012 with a grant from Allen Institute, Google and the Gordon and Betty Moore Foundation Wikidata information is used in 58.4% of all English Wikipedia articles.   Items in Wikidata each have an identifier (the letter Q and a number) and each item has a brief description and a list of alias names. (For example, the item for Earth (Q2) has alternative names: Blue Planet, Terra Mater, Terra, Planet Earth, Tellus, Sol III, Gaia, The world, Globe, The Blue Gem, and more.) each item has a list of affiliated “statements” which are the “object-relation-object” triples that are the heart of the KG.   Relations are predicates and are identified with a P and a number. For example, Earth is an “instance of” (P32) “inner planet” (Q3504248). Figure 3 shows an illustration of the item “Hubble Telescope”.   There are currently 68 million items in Wikidata and, like Wikipedia it can be edited by anyone.

Having a KG is not very useful unless you have a way to search it.   In the case of Wikidata (and other RDF KGs) the language for making queries is called SPARQL. Combined with Wikidata, SPARQL queries are a very powerful way to search the KG.   To give a trivial example of what a SPARQL query look like let’s search for all the scientific instruments carried on the Hubble Telescope.

Figure 3.   Wikidata object Q2513, Hubble Telescope. This is an edited version of the full entry which has dozens of property statements.

To write the query we need to know that Hubble had id wd:q2513 and that the predicate “carries scientific instrument” is wdt:P1202. The query and results are shown below. To read the query note there are two unbound variables ?Inst and ?InstLabel.   The only significant part of the request is a match for tuples of the form (Hubble telescope, carries scientific instrument, ?Inst).

SELECT ?Inst ?InstLabel WHERE {
SERVICE wikibase:label { bd:serviceParam wikibase:language “[AUTO_LANGUAGE],en”. }wd:Q2513 wdt:P1202 ?Inst.
} LIMIT 100

The table below shows the output.

This example does not do justice to the power of the search capabilities.   A look at the example in the Wikidata Query Service will illustrate that point.

One of the more impressive KGs for science is the Springer Nature SciGraph which has over 2 billion triples related to scientific topics. While the content contains the full Springer content, it goes well beyond that such patents and grant awards.   Zhang et.al [zhang] have demonstrated the use of knowledge graphs for recommendations in the NASA Science Knowledge Graph (SKG) .

Building specialized KGs for science domains has been going on for a while.   In 2009, the Hanalyzer (short for high-throughput analyzer) system uses natural language processing to automatically extract a semantic network from all PubMed papers relevant to a specific scientist.

### Where’s the Data?

This brings us to the question is Wikidata a place to store experimental data sets? The usual approach to data set description is via Schema.org.   However recent work by one of the Wikidata founders, Denny Vrandecic, and presented at the workshop Advanced Knowledge Technologies for Science in a FAIR World (AKTS) entitled Describing datasets in Wikidata described how this can be done when schema.org may not be sufficient. At that same workshop Daniel Garijo, Pedro Szekely described a way the extended Wikidata to support external collection in a presentation entitled WDPlus: Leveraging Wikidata to Link and Extend Tabular Data.   We shall ague below that this is an important possible component of the research assistant.

### The Semantic Scholar Literature Graph

There is a very different approach to the problem of storing information about research papers than Wikidata.   The Allen Institute for Artificial Intelligence (AI2) has built the Semantic Scholar, a graph of the scientific literatures that has a structure that is tightly focused on research paper, their authors and the concepts in the papers that link them together.   More specifically, the Semantic Scholar Literature Graph, as described by Waleed Ammar, et. al has the following node types:

• Author – a person record
• Paper – a paper has a title, venue, year, etc.
• Entities – unique scientific concepts like “deep learning” or “natural language processing”.
• Mentions – references to entities from text

The nodes are linked by edges including author-to-paper, paper-citations, and mentions which are references in the text to entities.   Between mentions, edges link mention in the same sentence and between entities that are somehow related.   Figure 4 illustrates the graph.

Figure 4.  An approximate subgraph of the Literature Graph for a paper in Semantic Scholar.

Many of the entity node are associated with items in Wikimedia.

Another related project from AI2 is the GraphAL [GraphAL] query system for the knowledge graph. The query system can be accessed on-line. The types of queries that can be expressed are powerful. For example, finding the papers that mention certain pairs of entities, or all authors of papers that mention a certain entity. We shall return to this capability below.

# Building the Research Assistant

If we consider the first of two tasks on our requirements list for the RAs functionality

1. Cataloging research data, papers and articles associated with its owner’s projects

we see that this challenge may be well met by having the RA possess a copy of Wikidata together with the extensions described by Denny Vrandecic discussed above. If not that then Garijo and Szekely’s WDPlus Wikidata extension may be a perfect solution.

Turning now to the second task:

1. The assistant will monitor the research literature looking for papers exploring the same concepts seen in the owner’s work

we see the nature of the challenge is very different, but progress has been made on this task. Xiaoyi et.al have shown it is possible to use a combination of neural networks and rule-based reasoning to identify semantic entities and even implicitly cited datasets in earth science papers.

Given a set of research notes, grant proposal or draft of research papers, we need a way the way to identify the concepts in the user’s documents and then insert them into a version of the Semantic Scholar Literature graph. To do we can use a language model to scan the documents looking for interesting literature terms.   The state of the art for language parsing has made great strides over the last few years and we will look at one called Bidirectional Encoder Representation from Transformers (called BERT)

### Using BERT to extract knowledge from documents

Most older language analysis models were built from deep LSTM networks (which we discussed in our book on cloud computing). These models were unidirectional in that the processed text from right to left or left to right in order to train the network. Devlin et.al published the BERT paper in 2018 and revised it in 2019. BERT is unique in several respects. First it is designed so that it can be “pre-trained” on plane text to build a special encoder.   Then, for various language analysis tasks, such as question answering, paraphrasing and language inference, an additional layer is added so that the encoder plus the new layer can be tuned to address the task. (This is a type of transfer learning we have described before.) What makes this possible is the way BERT uses an encoder that captures a whole sentence at a time. The training is done by masking out a small number of words (15%) in the input and then using a loss function that measures how well the network predicts the correct masked word. Figure 5 below illustrates this.   The core of the encoder is based on transformers which have been shown to be powerful ways to capture context. (See the harvardNLP Annotated Transformer for a detailed walk through of building transformers.

Figure 5. Bert training of the encoder based on masking random words for the loss function. This figure taken from “BERT – State of the Art Language Model for NLP” by Rani Horev in Lyrn.

Another good blog explaining BERT and the transformers is by Ranko Mosic. The AllenNLP group has an excellent demo using the masked language model and this is illustrated in Figure 6. This shows the result of using a sentence “Multicore processors allow multiple threads of execution to run in parallel on the various cores.” with processors, execution and parallel masked. You can note that it did a rather good job (code is a reasonable substitute for execution here.)

Figure 6. The AI2 masked language model demo using the sentence “Multicore processors allow multiple threads of execution to run in parallel on the various cores.” with processors, execution and parallel masked.

Another application of a BERT based language model is semantic role labeling. This is good for analyzing sentences and identifying a subject verb and object. For our purposes this is important.   We would like to extract from the scientists document key scientific terms and the implied relations between them. With this we can query the literature graph for matches, or we can use it to extend the scientist private version of the literature graph or knowledge graph.

For example, a researcher working on papers related to programming of parallel computers may have key phrases that include, multicore programming, data parallel, multithreaded programs, synchronization, map reduce, BSP, etc.   The type of triples we may discover by mining the documents may include

(map reduce, used in, data parallel programming)

(multicore processors, speedup, multithreaded execution)

(synchronization problems, encountered in, multithreaded programs)

(locking mechanisms, solve, synchronization problems)

(bulk synchronous parallel, solve, synchronization problems)

(BSP, alias, bulk synchronous parallel)

(map reduce, type of, parallel algorithm)

The first and third elements of the triples correspond to entities that are associated with mentions in the document. The verbs are potential labels for entity-entity edges in the graph.

To demonstrate the capability the AI2 implementation of sematic role labeling we downloaded the language model and used it in a Jupyter notebook.   We tested it with a sentence related to general relativity:

A gravitational singularity is a place where gravity has caused an event horizon which created a black hole in space, but a naked singularity is a gravitational singularity that does not have an event horizon and, therefore naked singularities do not create a black hole.

Running this through the “predictor” function of the bert-base-srl-2019.06.17 model gives the output in Figure 7.

Figure 7. Output of “predictor” function of the bert-base-srl-2019.06.17 AI2 model. The code to download the model for python is given in the document associated with the demo. The function pullTripples is a post processor which removes annotations not essential for this illustration and formats the output.

As can be seen in the figure the model identified the key noun phrases (naked singularity, gravitational singularity, event horizon, black hole and gravity) as well as a sequence of reasonable triples. It should be possible to use the GraphAL query system to find associated entities on the literature graph. Indeed, a simple search in Semantic scholar for these terms will find dozens of related papers. From these results, one can build a personal literature graph for each of the owner’s documents with links to the discovered material.

# The Really Hard Problems

The final two requirements for the research assistant pose some really hard problems.

1. Automatically sifting through open source archives like GitHub that may be of potential use in current projects.
2. Understanding the mathematical analysis in the notes generated by the scientist and using that understanding to check proofs and automatically propose simulators and experiment designs capable of testing hypotheses implied by the research.

Github already has a very good search interface that can be used to discover resources related to specific general topics. For example, searching for “multicore programming” retrieves an excellent collection of archives that address the relevant to topics of parallelism and synchronization.

The Github machine learning group (yes, every organization these days has a ML or AI group) has done some nice work on using LSTM networks to translate English language text such as “ Read csv file into Pandas dataframe”, into the corresponding Python code.   This is done by building a good vector space embedding of the English statement and a trained LSTM that creates English summaries of code fragments. By associating the summaries with the original English question, they can map the question to the associated code. The Github team is also collaborating with Microsoft Research Cambridge where a team is working on Program Understanding. While all of this is still very early work it appears to be very promising.

## Automatically “understanding” mathematical analysis

The fourth property in our RA list reaches way beyond current capabilities. The work from the GitHub team described above can make translating English program requirements into code very similar to natural language translation, but anything involving “understanding” is, for now, out of reach.   However, there have been some interesting early attempts to bridge language and models of scientific theory. Eureka (now DataRobot) does automatic AI based time series analysis and DataRobot is also a tool for automatically building ML models given only data. Michael Schmidt and Hod Lipson consider the problem of deriving theoretical mathematical models directly from experimental data (see Figure 8).

Figure 8. From Michael Schmidt and Hod Lipson, Distilling Free-Form Natural Laws from Experimental Data. (SCIENCE VOL 324 3 APRIL 2009)

Automatic theorem checking research has been going on for years, but these systems require formal statements of the theorem to be checked and are usually designed for human-machine collaboration. If it were possible to create a system that could take a journal paper and automatically extract a formal expression of the mathematical content in a form that a checker could input, then we would be getting close to the goal.

The most impressive work on using the advanced deep learning technology to “comprehend” scientific text comes again from the AI2 team. Their system Aristo is “an intelligent system that reads, learns, and reasons about science”. Aristo recently got an “A” on the N.Y. Regents 8th grade science exams. This exam consists of multiple-choice questions such as the following:

Which object in our solar system reflect light and is a satellite that orbits around one planet? (A) Moon, (B) Earth, (C) Mercury, (D) Sun.

Aristo works by combining a number of component solvers to bear on the problem. Information retrieval and statistics for am important layer.   Pointwise mutual information is used to measure the likely hood of each Question-Answer pair against information retrieved from the text corpus. A quantitative reasoning solver is used to address questions that involved basic quantitative statements.   As shown in Figure 9, a tuple inference solver builds graphs that connect tuples from a scientific knowledge base to the terms in the question and the answers. Dalvi, Tandon and Clark have constructed an excellent knowledge base of science-related triples called the Aristo Tuple KB/.

The graphs with the most promising connection to one of the answers is the winner.

Figure 9. From Clark, et.al, From ‘F’ to ‘A’ on the N.Y. Regents Science Exams: An Overview of the Aristo Project. https://allenai.org/content/docs/Aristo_Milestone.pdf Aristo Tuple Inference Solver. Key terms in the question and answer candidates are linked to triples from the knowledge base.

While the Aristo work is a real milestone, it has far to go.   In particular, it does not yet have the ability to relate technical diagrams and equations in the text into its deductive (or abductive) analysis. I expect AI2 is working on this now. The bigger challenge, being able to classify documents by the content of the mathematical arguments used, is very hard when reasoning is spread over many pages. There is some interesting automatic document summarization work, but it is not up to this challenge.

# Final Thoughts

This discussion is far too incomplete to warrant a “conclusions” section. The ability of the research assistant to take and idea and run with it is central to what we need.   The idea may be a theory expressed in a draft technical paper or research proposal. Finding all the related publication is certainly a start, but first the RA must be able to abstract the important original ideas and not just the keywords and phrases.   It may be that the key idea is a more of a metaphor for a larger truth that manifests itself in research in various disciplines. But this is probably more than any RA can grasp.

There is going to be amazing progress over the next 30 years.   This is obvious when one looks at the state of computing 30 years ago.   Much of what we have today was then only a dream.

This post contains a look at many active research projects, and I am sure I am missing some very important ones.   Please contact me if I have mischaracterized any of them or if I have missed something really important.

# References

Most of the citations to literature in this blog are linked in-line.   Here are two for which I found it easier to provide explicit reference.

[grapAL] Christine Betts, Joanna Power, Waleed Ammar, GrapAL: Connecting the Dots in Scientiﬁc Literature, arXiv:1902.05170v2 [cs.DB] 19 May 2019

[zhang] Jia Zhang, Maryam Pourreza, Rahul Ramachandran, Tsengdar J. Lee, Patrick Gatlin, Manil Maskey, and Amanda Marie Weigel, “Facilitating Data-Centric Recommendation in Knowledge Graph”, in Proceedings of The 4th IEEE International Conference on Collaboration and Internet

Computing (CIC), Philadelphia, PA, USA, Oct. 18-20, 2018, pp. 207-216.

# A Very Gentle Introduction to Probabilistic Programming Languages

Abstract.   Probabilistic programming languages (PPLs) allow us to model the observed behavior of probabilistic systems in terms its underlying latent variables. Using these models, the PPL provides tools to make inferences concerning the latent variables that give rise to specific observed behaviors. In this short report, we look at two such programming languages: Gen, a language based on Julia from a team at MIT and PyProb which is based on Python and Torch from the Probabilistic Programming Group at the University of Oxford.   These are not the only PPls nor are they the first, but they illustrate the concepts nicely and they are easy to describe. To fully understand the concepts behind these systems requires a deep mathematical exploration of Bayesian statistics and we won’t go there in this report. We will use a bit of math, but the beauty of these languages is that you can get results with a light overview of the concepts.

## Introduction

In science we build theories that tell us how nature works.   We then construct experiments that allow us to test our theories.   Often the information we want to learn from the experiments is not directly observable from the results and we must infer it from what we measure.    For example, consider the problem of inferring the masses of subatomic particles based on the results of collider experiments,   or inferring the distribution of dark matter from the gravitational lensing effects on nearby galaxies, or finding share values that optimize financial portfolios subject to market risks, or unravelling complex models of gene expression that manifest as disease.

Often our theoretical models lead us to build simulation systems which generate values we can compare to the experimental observations.   The simulation systems are often programs that draw possible values for unknowns, call them x, from random number generators and these simulations use these values to generate outcomes y.   In other words, given values for x, our simulation is a “generative” function F which produces values y = F(x).     If our experiments give us values y’, we can think of the inference task as solving the inverse problem x = F-1(y’), i.e. finding values for the hidden variables x that give rise to the observed outcomes y’.   The more mathematical way to say this to say that our simulation gives us a probability distribution of values of y given the distribution associated with the random draws for x, which we write as p(y | x ). What we are interested in is the “posterior” probability p(x | y’) which is the distribution of x given the evidence y’. In other words, we want samples for values of x that generate values close to our experimental values y’. These probabilities are related by Bayes Theorem as

Without going into more of the probability theory associated with this equation, suffice it to say that the right-hand side of this equation can be very difficult to compute when F is associated with a simulation code.   To get a feel for how we can approach this problem, consider the function F defined by our program as a generative process: each time we run the program it makes a series of decisions based on random x values it draws and then generates a value for y. What we will do is methodically trace the program, logging the values of x and the resulting ys. To get a good feel for the behavior of the program, we will do this a million time.

Begin by labeling each point in the program  where a random value is drawn. Suppose we now trace the flow of the program so that each time a new random value is drawn we record the program point and the value drawn. As shown in Figure 1, we define a trace of the program to be the sequence [(a1, x1), (a2, x2), …(an, xn), y] of program address points and random values we encounter along the way.

Figure 1. Illustration of tracing random number draws from a simulation program. A trace is composed of a list of address, value tuples in the order they are encountered. ( If there are loops in the program we add an instance count to the tuple.)

If we can trace all the paths through the program and compute the probabilities of their traversal, we could begin to approximate the joint distribution p(x,y)=p(y|x)*p(y) but given that the x’s are drawn from continuous distributions this may be computationally infeasible. If we want to find those traces that lead to values of y near to y’, we need to use search algorithms that allow us to modify the x’s to construct the right traces.   We will say a bit more about these algorithms later, but this is enough to introduce some of the programming language ideas.

To illustrate our two probabilistic programming languages, we will use an example from the book “Bayesian Methods for Hackers” by Cameron Davidson-Pilon. (There are some excellent on-line resources for the book.   This includes Jupyter notebooks for each chapter that have been done with two other PPLs: PyMC3 and Tensorflow Probability.) The example comes from chapter 1.   It concerns the logs of text messages from a user. More specifically, it is the number of text messages sent per day over a period of 74 days.   Figure 2 shows bar graph of the daily message traffic.  Looking at the data, Davidson-Pilon made a conjecture that the traffic changes in some way at some point so that the second half of the time period has a qualitative difference from the first half. Data like this is usually Poisson distributed. If so, there is an average event rate such that the probability of k events in a single time slot is given by

If there really are two separate distributions the let us say the event rate is for the first half and for the second half and a day such that for all days before that  date the first rate applies and it is the second rate after that. (This is all very well explained in the Davidson-Pilon book and you should look at the solution there that uses PPL PyMC3. The solutions here are modeled on that one.)

Figure 2. From Chapter 1 of “Bayesian Methods for Hackers” by Cameron Davidson-Pilon.

## Gen

Gen is a language that is built on top of Julia and Tensorflow by Marco Cusumano-Towner, Feras A Saad, Alexander K Lew and Vikash K Mansinghka at MIT and described in their recent POPL paper [1]. In addition they have a complete on-line resource where you can download the package and read tutorials.

We gave a brief introduction to Julia in a previous article, but it should not be hard to understand the following even if you have never used Julia.   To cast this computation into Gen we need to build a model that captures the discussion above.   Shown below we call it myModel.

The first thing you notice about this code are the special annotations @gen and @trace.   This tells the Gen system that this is a generative model and that it will be compiled so that we can gather the execution traces that we discussed above.   We explicitly identify the random variables we want traced by the @trace annotation.   The argument to the function is a vector xs of time intervals from 1 to 74.   We create it when we read the data from Figure 2 from a csv file (which is shown in detail in the full Jupyter notebook for this example). Specifically, xs = [1.0, 2.0, 3.0 …, 74.0] and we set a vector ys so that ys[i] is the number of text messages on day i.

If our model process is driven by a Poisson to generate y value, then the math says we should assume that the time interval between events is exponentially distributed. Gen does not have an exponential distribution function, but it does have a Gamma distribution and  gamma(1, alpha) = exponential(1.0/alpha) . The statement

lambda1 = @trace(gamma(1, alpha), :lambda1)

tells Gen to pull lambda1 values from the exponential with mean alpha and we have initialized alpha to be the mean of the ys values (which we had previously computed to be 19.74…). Finally note we have used a special Julia labeling technique :variable-name to label this to be :lambda1.   This is effectively the address in the code of the random number draw that we will use to build traces.

We draw tau from a uniform distribution (and trace and label it) and then for each x[i] <= tau we assign the variable lambda to lambda1 and for each x[i] > tau we assign lambda to lambda2.   We use that value of lambda to draw a variable from the Poisson distribution and label that draw with a string “y-i”.

We can now generate full traces of our model using the Gen function simulate() and pull values from the traces with the get_choices() function as shown below.

The values for the random variable draws are from our unconstrained model, i.e.   they reflect the joint probability p(x,y) and not the desired posterior probability p(x | y’) that we seek. To reach that goal we need to run our model so that we can constrain the y values to y’ and search for those traces that lead the model in that direction.   For that we will use a variation of a Markov Chain Monte Carlo (MCMC) method called Metropolis-Hastings (MH). There is a great deal of on-line literature about MH so we won’t go into it here, but the basic idea is simple. We start with a trace and then make some random mods to the variable draws. If those mods improve the result, we continue. If not, we reject it and try again.   This is a great oversimplification, but Gen and the other PPLs provide library functions that allow us to easily use MH (and other methods.)   The code below shows the how we can invoke this to make inferences.

The inference program creates a map from the labels for the y values to the actual constraints from our data.   It then generates an initial trace and iteratively applies the MH algorithm to improve the trace. It then returns the choices for our three variables from the final trace. Running the algorithm for a large number of iterations yields the result below.

This result is just one sample from the posterior probabilities.   If we repeat this 100 times we can get a good look at the distribution of values.   Histograms of these values are shown below in Figure 3.

Figure 3.   Histograms of the tau and lambda values.  While difficult to read, the values are clustered near 44, 18, 24 respectively.

If we compare these results to the Davidson-Pilon book results which used the PyMC3 (and Tensorflow Probability) PPL, we see they are almost identical with the exception of the values of tau near 70 and 5. We expect these extreme values represent traces where the original hypothesis of two separate alphas was not well supported.

There is a great deal about Gen we have not covered here including combinators which allow us to compose generative function models together.   In addition, we have not used one of the important inference mechanisms called importance sampling.   We shall return to that topic later.

## PyProb

Tuan Anh Le, Atılım Günes Baydin, Frank Wood first published an article about PyProb in 2017 [3] and another very important paper was released in 2019 entitled “Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale” [4] which we will describe in greater detail later.   PyProb is built on top of the deep learning toolkit PyTorch which was developed and released by Facebook research.

Many concepts of PyProb are very similar to Gen, but PyProb is Python based so it looks a bit different. Our generative model in this case is an instance of a Python class as shown below. The main requirement of the is that it subclass Model and have a method called forward() that describes how to generate our traces.   Instead of the trace annotation used in Gen, we use PyProb sample and observe functions.   The random number variables in PyProb are all Torch tensors, so to we need to apply the method numpy() to extract values. The functions Normal, Exponential and Uniform are all imported from PyProb. Other than that, our generator looks identical to the Gen example.

Also note we have used the name mu1 and mu2 instead of alpha1 and alpha2 (for no good reason.) Running the MH algorithm on this model is almost identical to doing it in Gen.

Again, this is just a sample from the posterior.   You will notice that the posterior result function also tells us what percent of the traces were accepted by the MH algorithm.   PyProb has its own histogram methods and the results are shown in Figure 4 below.  The legend in the figure is difficult to read. It shows that the tau value is clusters near 44 with a few traces showing between 5 and 10.   The mu1 values are near 17 and mu2 values are near 23.   In other words, this agrees with our Gen results and the PyMC3 results in the Davidson-Pilon book.

Figure 4. Histogram of tau, mu1 and mu2 values.

### Building a PyProb Inference LSTM network.

There are several additional features of PyProb that are worth describing. While several of these are also part of Gen, they seem to be better developed in PyProb. More specifically PyProb has been designed so that our generative model can be derived from an existing scientific simulation code and it has an additional inference method, called Inference Compilation, in which a deep recurrent neural network is constructed and trained so that it can give us a very good approximation of our posterior distribution.   In fact the neural network is a Long Short Term Memory (LSTM) network that that is trained using traces from out model or simulation code.   The training, which can take a long time, produces a “distribution” q(x | y) that approximates our desired p(x | y). More of the details are given in the paper “Inference Compilation and Universal Probabilistic Programming” by Anh le, Gunes Baydin and Wood [3]. Once trained, as sketched in Figure 5, when the network is fed our target constraints y’ and trace addresses, the network will generate the sequence of components needed to make q(x|y= y’).

Figure 5. Recurrent NNet compiled and trained from model. (see [3, 4])

Building and training the network is almost automatic. We had one problem. The compiler does not support the exponential distribution yet, so we replaced it with a normal distribution.   To do create and train the RNN was one function call as shown below.

Once trained (which took a long time), using it was also easy. In this case we use the importance sampling algorithm which is described in reference [3] and elsewhere.

Figure 6 illustrates the histograms of values drawn from the posterior.

Figure 6.   Using the trained network with our data. As can be seen, the variance of the results is very small compared to the MH algorithm.

The fact that the training and evaluation took so much longer with our trivial example is not important, because the scalability of importance sampling using the compiled LSTM network. In the excellent paper “Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale” [4] Güneş Baydin, et. Al. describe the use of PyProb with a very large simulation code that models LHC experiments involving the decay of the tau lepton. They used 1024 nodes of the Cori supercomputer at LBNL to train and run their IC system. To do this required using PyProb’s ability to link a PyProb model to a C++ program. Using the IC LSTM network, they were able achieve a speed-up of over 200 over a baseline MCMC algorithm. The paper describes the details of the implementation and testing.

## Conclusion

The goal of this paper was to introduce the basic ideas behind Probabilistic Programming Languages by way of two relatively new PPLs, Gen and PyProb.   The example we used was trivial, but it illustrated the concepts and showed how the basic ideas were expressed (in very similar terms) in both languages.   Both languages are relatively new and they implementations are not yet fully mature.   However, we are certain that probabilistic programming will become a standard tool of data science in the future. We have put the source Jupyter Notebooks for both examples on GitHub.   Follow the installation notes for Gen and PyProb on their respective webpages and these should work fine.

https://github.com/dbgannon/probablistic-programming

The traditional way computer science is taught involves the study of algorithms, based on cold, hard logic which, when turned into software, runs in a deterministic path from input to output. The idea of running a program backward from output to the input does not make sense. You can’t “unsort” a list of number. The problem is even more complicated if our program is a scientific simulation or data science involving machine learning. In these cases, we learn to think about the results of a computation as representatives of internally generated probability distributions.

Some of the most interesting recent applications of AI to science have been the result of work on generative neural networks.   These systems are trained to perfectly mimic the statistical distribution of scientific data sets.   They can allow us to build “fake” human faces or perfect, but artificial spiral galaxies, or mimic the results of laboratory experiments. They can be extremely useful but, in the case of science, they tell us little about the underlying laws of nature.  PPLs allow us to begin to rescue the underlying science in the generative computation.

## References

Some of these are link.   Two can be found on arXiv and the Gen paper can be found in the ACM archive.

# eScience 2050: A Look Back

Abstract— This was originally written as an invited “vision” presentation for the eScience 2019 conference, but I decided to present something more focused for that event. Never wanting to let words go to waste I decided to put a revised version here.   It was written as a look back at the period of eScience from 2019 to 2050.   Of course, as it is being published in 2019, it is clearly a work of science fiction, but it is based on technology trends that seem relatively clear. Specifically, I consider the impact of four themes on eScience: the explosion of AI as an eScience enabler, quantum computing as a service in the cloud, DNA data storage in the cloud, and neuromorphic computing.

# I.  Introduction

Predictions of the future are often so colored by the present that they miss the boat entirely.   The future, in these visions, tends to look like today but a bit nicer. Looking at images of the future from the 1950s we see pictures of personal helicopters (Figure 1) that clearly missed the mark.   But in some cases, the ideas are not so far off, as in the case of the woman ordering shirts “online” as in Figure 2.

Fig 1. Future personal transportation. Mechanics Illustrated, Jan 1951

Fig 2. Shopping “online” vision.

To look at the evolution of eScience in the period from 2019 to 2050, it can help to look at the state of the art in computing in 1960 and ask how much of todays technology can we see emerging there.   My first computer program was written when I was in high school. It was a Fortran version of “hello world” that consited of one line of Fortran per punched card as shown in Figure 3 and the entire program consisted of a deck of 12 cards .

Fig 3.   The Fortran statement CALL RCLASS(AAA,21,NNC, PX3,PX4)

If asked to predict the future then, I might have said that software and data in the years ahead would be stored on a punched paper tape over ten miles long.

Fifty years ago, the high points of computing looked like:

• The programming language FORTRAN IV was created in 1961.
• IBM introduced its System/360. A monster machine.
• Gordon Moore makes an observation about scale in 1965 (that became Moore’s law).
• IBM created the first floppy disk in 1967.
• The SABRE database system that was used by IBM to help American Airlines manage its reservations data.
• In 1970 Edgar Codd came up with Relation Database Idea. It was not implemented until 1974.
• The Internet: a UCLA student, tries to send “login,” the first message over ARPANET, at 10:30 P.M. on October 29, 1969.  (The system transmitted “l” and then “o” …. and then crashed.)

What we can extrapolate from this is that mainframe IBM computers will continue to dominate, databases are going to emerge as a powerful tool and there will be a future for computer-to-computer networking, but PCs and web search were not remotely visible on the horizon. The profound implication of Moore’s law would not be evident until the first microprocessors appear in 1979 and the personal computer revolution of the 1980s followed. Those same microprocessors led to the idea that dozens of them could be packaged into a “parallel computer”. This resulted in a period of intense research in universities and startups to build scalable parallel systems. While the first massively parallel computer was the Illiac IV which was deployed at NASA Ames in 1974, it was the work on clusters of microprocessors in the 1980s [1] that led to the supercomputers of 2019. Looking at computing in 1970, few would have guessed this profound transition in computing from mainframes like the System/360 to the massively parallel systems of 2019.

# II. eScience from 2019 Forward

### A.   The Emergence and Evolution of the Cloud

The World Wide Web appeared in 1991 and the first search engines appeared in the mid 90s. As these systems evolved, so did the need to store large segments of the web on vast server farms. Ecommerce also drove the need for large data centers. These data centers quickly evolved to offering data storage and virtual machines as a service to external users.   These clouds, as they became known, were designed to serve thousands to millions of concurrent clients in interactive sessions.

For eScience, clouds provided an important alternative to traditional batch-oriented supercomputers by supporting interactive, data exploration and collaboration at scale. The cloud data centers evolved in several important way. Tools like Google’s Kubernetes was widely deployed to support computation in the form of swarms of microservices built from software containers. Deploying Kubernetes still required allocating resources and configuring systems and microservices are continuously running processes. Serverless computing is a newer cloud capability that avoids the need to configure VMs to support long running processes when they are not needed.

Serverless computing is defined in terms of stateless functions that respond to events such as signals from remote instruments or changes in the state of a data archive. For examples, an eScience workflow can be automatically triggered when instrument data arrives in the data storage archive. The analysis in the workflow can trigger other serverless function to complete additional tasks.

The Cloud if 2019 is defined in terms of services and not raw hardware and data storage. These services include

• Streaming data analytics tools that allow the users to monitor hundreds of live streams from remote instruments.   Edge computing services evolved as a mechanism to off-load some of the analytics to small devices that interposed between the instruments and the cloud.
• Planet scale databases allow a user to store and search data collections that are automatically replicated across multiple continents with guaranteed consistency.
• Services to support applications of machine learning became widely available across all the commercial public clouds. These services include speech recognition and generation, automatic language translation, computer vision tasks such as image recognition and automatic captioning.

The architecture of the cloud data centers continued to evolve toward ideas common in modern supercomputers. The first major change was the introduction of software defined networking throughout the data centers. Amazon began introducing GPU accelerators in 2012. This was followed by Google’s introduction of TPU chips to support the computational needs of deep learning algorithms coded in the Tensorflow system. The TPU (figure 4) is a device designed to accelerate matrix multiplication. It is based on systolic computation ideas of the 1980s and has been applied primarily to Google machine learning projects.

Fig 4.   Google TPU v3 architecture. See N. Jouppi, C. Young, N. Patil,  D. Patterson [2, 3].

The most radical evolutionary step has been taken by Microsoft in the Azure cloud. Project Brainwave represents an introduction of a mesh connected layer of FPGA devices that span the entire data center (figure 5).   These devices can be programed to cooperate on parallel execution of lage computational task including serving deep learning models.Fig 5. Microsoft FPGA based Project Brainwave [4, 5]

### A.   Major Themes Going Forward from 2019

The following pages will describe the four major themes that are going to dominate the ten years forward from 2019.   They are

• The explosion of AI as an eScience enabler.
• Quantum computing as a service in the cloud.
• DNA data storage in the cloud.
• Neuromorphic computing

## III. AI and eScience

Machine Learning went through a revolution in the years from 2008 to 2019 due to the intensive work at universities and in companies.   The cloud providers including Google, Microsoft and others were driven by the need improve their search engines. Along the way they had amassed vast stores of text and images. Using this data and the new deep learning technologies they were able to build very powerful language translations systems and image recognition and captioning services that were mentioned above. By 2017, applications to eScience were beginning to emerge.

### A.   Generative Neural Networks and Scientific Discovery

One of the more fascinating developments that arose from the deep learning research was the rise of a special class of neural networks called generative models. The important property of these networks is that if you train them with a sufficiently, large and coherent collection of data samples, the network can be used to generate similar samples. These are most often seen in public as pernicious tools to create fake images and videos of important people saying things they never said.

What these networks are doing is creating a statistical distribution of data that, when sampled, produces data that has properties that match nicely with the input data. When a scientist creates a simulation based on assumed parameters of nature, the simulations is evaluated against the statistical properties of the experimental observations.   Generative models can be used to validate the simulation output in case the data is sparse. One of the most commonly used generative model is the Generative Adversarial Network (GAN) in which to networks are pitted against each other. As shown in Figure 6, a discriminator is trained to recognize the data that is from the real data set and a generator is designed to fool the discriminator. When converged the generator mimics the data distribution of the real data. There are several interesting examples of how generative models have been used in science. We list several in [6], but two that stand out are an application in astronomy and one in drug design.

Fig 6. Generative Adversarial Network (GAN) from this site.

M. Mustafa, and colleagues demonstrate how a slightly-modified standard Generative Adversarial Network (GAN) can be used generate synthetic images of weak lensing convergence maps derived from N-body cosmological simulations [7]. The results illustrate how the generated images match their validation tests, but what is more important, the resulting images also pass a variety of statistical tests ranging from tests of the distribution of intensities to power spectrum analysis.

However, generative methods do not, by themselves, help with the problem of inferring the posterior distribution of the inputs to scientific simulations conditioned on the observed results, but there is value of in creating ‘life-like’ samples. F. Carminati, G. Khattak, S. Vallecorsa make the argument that designing and testing the next generation of sensors requires test data that is too expensive to compute with simulation . A well-tuned GAN can generate the test cases that fit the right statistical model at the rate needed for deployment [8].

In the area of drug design, Shahar Harel and Kira Radinsky have used a generative model to suggest chemical compounds that may be used as candidates for study [9]. They start with a prototype compound known to have some desirable properties. This is expressed as sequence and fed to a layer of a convolutional network that allow local structures to emerge as  shorter vectors that are concatenated. A final all-to-all layer is used to generate sequence of mean and variance vectors for the prototype. his is fed to a “diversity layer” which add randomness as shown in Figure 7.

Fig 7. Shahar Harel and Kira Radinsky multi-layer generative netowork for drug design.

The decoder is an LSTM-based recurrent network which generates the new molecule. The results they report are impressive. In a one series of experiments they took as prototypes compounds from drugs that were discovered years ago, and they were able to generate more modern variations that are known to be more powerful and effective. No known drugs were used in the training.

### B.   Probabilistic Programming and Bayesian Inference

For very large experiments such as those conducted in the Large Hadron Collider, the scientists are interested in testing their models of the universe based the record of particle collisions that were generated.   It is often the case that the scientists have a detailed simulation of the experimental system based on the current theoretical models of physics that are involved. These models are driven by setting parameters that correspond to assumptions about nature. When they run the experiment, they see new behaviors in the data, such as new particles emerging from collisions, and they would like to see which parameters when fed to the simulation can produce the output that corresponds to the experimental results. In other words, the simulation is a function taking parameters x to outputs y and the interesting question is given the outputs what the corresponding values for x. This is an inverse problem that is notoriously difficult to solve for large scientific problems. In terms of Bayesian statistics, they are interested in the posterior distribution P(x | y) where y is the experimental statistics.

In 2018 and later researchers began to study programming languages designed to express and solve problems like these. Pyro [10], Tensorflow Probability [11], and the Julia-based Gen [12] are a few of the Probabilistic Programing Languages (PPLs) introduced in 2019. One of the ideas in these languages is to build statistical models of behavior where program variables represent probability distributions and then, by looking at traces of the random choices associated with these variables during execution, one can make inferences about the random behaviors that influence outcomes.

These languages have been put in use in industries such as finance, their role in eScience is now becoming apparent.    A. Güneş Baydin, et al. showed that PPLs can be used in very large Bayesian analysis of data from an LHC use case. Their approach allows a PPL too couple directly to existing scientific simulators through a cross-platform probabilistic execution protocol [13].

Probabilistic programming with PPLs will go on to become a standard tool used by eScientists.

### C.   AI-based Research Assistant

2018 was the year of the smart on-line bot and smart speaker. These were cloud based services that used natural language interfaces for both input and output. The smart speakers, equipped with microphones listen for trigger phrases like “Hello Siri” or “hello Google” or “Alexa” and recorded a query in English, extracted the intent and replied within a second. They could deliver weather reports, do web searches, keep your shopping list and keep track of your online shopping.

The impact of this bot technology will hit eScience when the AI software improves to the point that every scientist, graduate student and corporate executive had a person cloud-based research assistant. Raj Reddy calls these Cognition Amplifiers and Guardian Angels. Resembling a smart speaker or desktop/phone app, the research assistant is responsible for the following tasks:

• Cataloging research data, papers and articles associated with its owner’s projects. The assistant will monitor the research literature looking for papers exploring the same concepts seen in the owner’s work.
• Coordinate meetings among collaborators and establish coordination among other research assistants.
• Automatically sifting through other data from collaborators and open source archives that may be of potential use in current projects.
• Understanding the mathematical analysis in the notes generated by the scientist and using that understanding to check proofs and automatically propose simulators and experiment designs capable of testing hypotheses implied by the research.

From the perspective of 2019 eScience, the claim that the Research Assistant of the future will be able to derive simulations and plan experiments may seem a bit “over the top”. However, progress on automatic code completion and generation was making great strides in 2019. Transformers [16] provide a second generation of sequence-to-sequence AI tools that outpaced the recurrent neural networks previously used.   Advances in programming tools based on similar technology show that AI can be used to generate code completions for C# method calls based on deep semantic mining of online archives of use cases.

We built a toy demo of the Research Assistant in [14] that could take natural language (spoken or typed) input and find related research articles from Bing, Wikipedia and ArXiv (see Figure 6.)

Fig 6. Demo cloud-based “research assistant” [14]

This demo prototype research assistant was built by composing a few cloud tools as shown in Figure 7. A much more sophisticated version can be built using a modern chatbot tool like RASA, but that will not address everything that is needed to reach the final goal.

Fig 7. The toy research assistant demo is hosted an “elegant” cardboard container from Google’s voice kit running a python script on raspberry Pi. It invokes Googles speech to text, a text analysis service running on Algorithmia and uses Amazon’s Lex to generate voice. Depending on context, calls are made to Bing, Wikipedia or ArXiv.

While this demo was a toy, there has been serious work over the last few years to make progress on this topic. In 2018, Google introduced its Dataset Search service to provide easy data set discovery. The work of the Allen Institute for AI stands out [15]. Their Semantic Sanity project is a sophisticated research search engine that allows you to tell it basic topics that interest you and it will monitor ArXiv looking for important related contributions.   Aristo is “an intelligent system that reads, learns, and reasons about science”. It can reason about diagrams, math and understand and answer elementary school science questions.

One important problem that must be solved is extracting non-trivial mathematical concepts from research papers and notes so that similar papers can be found. In addition to Semantic Sanity, there is some other early related work on this problem. For example, researchers are using unsupervised neural net to sift through PubChem and uses NLP techniques to identify potential components for materials synthesis [22].

## IV. The Rise of Quantum in the Cloud

Long viewed as a subject of purely theoretical interest, quantum computing emerged in 2018 as a service in the cloud. By 2019, Rigetti, IBM, D-Wave and Alibaba had live quantum computing services available.   Google and Microsoft followed soon after that.   The approaches taken to building a quantum computer differed in many respects.

The basic unit of quantum computing is the qubit which obeys some important and non-intuitive laws of physics. Multiple qubits can be put into an “entangled” state in which an observation about one can effect the state of the others even when they are physically separated.   One can apply various operators to qubits and pairs of qubits to form “logical” circuits and these circuits are the stuff of quantum algorithms. A fact of life about quantum states is that once you measure them, they are reduced to ordinary bits. However, the probability that a bit is reduced to a zero or a one is determined by the unmeasured quantum state of the system. A good quantum algorithm is one in which the quantum circuit produces outputs with a probability distribution that makes solving a specific problem very fast.

The first important quantum computations involved solving problems in quantum chemistry. In fact, this was the role that Richard Feynman had suggested for a quantum computer when he first came up with the idea in the 1980s. The other area of eScience that quantum computers excelled at was certain classes of optimization problems.

Early quantum computers are program by literally composing the quantum gates as illustrated in Figure 8.

Fig 8. A simple quantum circuit to create an entangled pair of two qubits and measure the result

Programming tools to build such circuits include IBM’s IBM-Q qiskit and Microsoft’s Q# language and compiler. The IBM hardware was based what they call “noisy intermediate-scale quantum computers (NISQ)”. As shown in Figure 9, the induvial qubits are non-linear oscillators. Tuned superconducting resonator channels are used for readout. Also, the qubits are tied together by additional superconducting channels

Fig 9 An early 5-qubit IBM-Q computational unit. Source: IBM-Q website.

One problem with early quantum systems is that the qubits are very susceptible to noise and degrade quickly.   In other words, the error rate of quantum algorithms can be very high. The deeper the quantum circuit (the number of layers of “gates”), the more noise. A way around this is to add error-correcting redundancy into the circuit, but this means more qubits are required.   Quantum volume refers to a measure of the number of qubits needed for the error-corrected circuit times the depth of the corrected circuit. Unfortunately, early quantum computers had very low bounds on achievable quantum volume.

Microsoft took a different approach to building qubits that are based on the topological property of woven threads. These are much more noise resistant and hence capable of building deeper circuits with less error correction.

## IV.  DNA-Based Storage

There are two problems with 2019 era storage technologies.   First, it was not very stable. Data degrades and storage devices fail. Second, there was not enough storage capacity to capture the data that was being generated.   2.5 quintillion bytes of data was generated each day in 2018. Even if you cast out 99.9% of this as being of no long-term value, the remainder still leaves 2.5e+15 bytes per day. The amount of data stored in the cloud in 2018 was 1.0e+18 (an exabyte). In other words, the cloud only held 400 days’ worth of the important data. Obviously, another order or magnitude or more of the data needed to be discarded to allow the data centers time to expand to contain this growth. A new storage technology was needed.

DNA storage was first proposed in the 1960, but not much happened with the idea until 1988 when researchers from Harvard stored an image in the DNA of e.coli. By 2015 researchers at Columbia University published a method that allowed the storage of 215 petabytes (2.15e+17) of data per gram of DNA. And DNA is very stable over long periods of time. While this was promising, there was still a big problem.   The encoding and decoding of the data were still an expensive manual process and it was not practical to have lab scientists running around in the back rooms of data centers managing wet processes.

In 2019, researchers at the University of Washington and Microsoft demonstrated an automated lab that could encode and decode data to DNA without human hands. The system works by converting the ones and zeros of digital data into the As, Ts, Cs and Gs that make up the building blocks of DNA. These encoded sequences are then fed into synthesizing systems. Their next breakthrough was to reduce the entire “lab” to a chip that uses microfluidics to route droplets of water around a grid to enact the needed chemistry steps. They also produced a sophisticated software stack called puddle that allow the scientist to program this with conventional high-level languages [20].

Other research has demonstrated ways in which DNA encoded data could be searched and structured in ways like relational databases.   As the costs came down, this became the standard cloud storage technology.

## VI. Neuromorphic Computing

It was long a dream of computer designers to build systems that mimicked the connectome of the biological brain. A major research initiative in Europe, the Human Brain Project, looked at the possibility of simulation a brain on traditional supercomputers. As that proved unrealistic, they turned to the study of special hardware devices that can simulate the behavior of neuron. Technology to build artificial neurons progressed in university and industry labs. The Stanford neurogrid can simulate six billion synapses.   In 2017 Intel introduced the Loihi neuromorphic research test chip. The device is a many-core mesh of 128 neuromorphic cores and each of which contains 1,024 primitive spiking neural units. The neural units are current-based synapse leaky integrate-and-fire neurons [21].

In addition to the Loihi chip, Intel has released Pohoiki Beach, a system comprised of 64 Loihi chips and a substantial software stack to allow application development. Because overall power consumption is 100 times lower than GPU based neural networks, Intel’s application target is autonomous vehicles and robots. While full realization of true biological brain-like functionality may not be realized until the 2nd half of the 21st century, it is none the less an exciting step forward.

## VII.   Conclusions

EScientists in 2019 found themselves at an important inflection point in terms of the technology they could deploy in doing science.   The cloud has evolved into a massive on-line, on-demand, heterogeneous supercomputer. It not only supports traditional digital simulation; it will also support hybrid quantum-digital computation.   It will soon allow applications to interact with robotic sensor nets controlled by neuromorphic fabrics. Storage of research data will be limitless and backed up by DNA based archives.

One of the most remarkable features of computing in the 21st century has been the evolution of software. Programming tools have evolved into very deep stacks that have used and exploited AI methods to enable scientific programmer to accomplish more with a few lines of Julia in a Juypter notebook than was remotely possible programming computers of 1980. New probabilistic programming languages that combine large scale simulation with deep neural networks show promise of making Bayesian inference an easy to use used tool in eScience.

The role of AI was not limited to the programming and execution of eScience experiments. Users of the AI research assistants in 2050 can look back at the “smart speakers” of 2019 and view them as laughably primitive as the users of the 2019 World Wide Web looked back at the 1970s era where computers were driven by programs written on decks of punched cards.

References

1. G. Fox, R. Williams, P. Massina, “Parallel Computing Works!”, Morgan Kauffman, 1994.
2. N. Jouppi, C. Young, N. Patil, D. Patterson, “A Domain-Specific Architecture for Deep Neural Networks”, Communications of the ACM, September 2018, Vol. 61 No. 9, Pages 50-59
3. https://cloud.google.com/tpu/docs/images/tpu-sys-arch3.png
4. Chung, et al. “Serving DNNs in Real Time at Datacenter Scale with Project Brainwave.” IEEE Micro 38 (2018): 8-20.
5. Fowers et al., “A Configurable Cloud-Scale DNN Processor for Real-Time AI,” 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Los Angeles, CA, 2018, pp. 1-14
6. D. Gannon, “Science Applications of Generative Neural Networks”, https://esciencegroup.com/2018/10/11/science-applications-of-generative-neural-networks/, 2018
7. Mustafa, et. al. “Creating Virtual Universes Using Generative Adversarial Networks” (arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018)
8. Carminati, G. Khattak, S. Vallecorsa, “3D convolutional GAN for fast Simulation”, https://www.ixpug.org/images/docs/IXPUG_Annual_ Spring_Conference_2018/11-VALLECORSA-Machine-learning.pdf
9. Harel and K. Radinsky, “Prototype-Based Compound Discovery using Deep Generative Models” http://kiraradinsky.com/files/acs-accelerating-prototype.pdf
10. Bingham, et al, “Pyro: Deep universal probabilistic programming.” Journal of Machine Learning Research (2018).
11. Tensorflow Probability, https://medium.com/tensorflow/an-introduction-to-probabilistic-programming-now-available-in-tensorflow-probability-6dcc003ca29e
12. Gen, https://www.infoq.com/news/2019/07/mit-gen-probabilistic-programs
13. Güneş Baydin, et al, “Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale”, arXiv:1907.03382v1
14. Gannon, “Building a ‘ChatBot’ for Scientific Research”, https://esciencegroup.com/2018/08/09/building-a-chatbot-for-scientific-research
15. Allen Institute for AI, Semantic Sanity and Aristo, https://allenai.org/demos
16. Vaswani et al., “Attention Is All You Need”, arXiv:1706.03762v5
17. Chong, Hybrid Quantum-Classical Computing https://www.sigarch.org/hybrid-quantum-classical-computing/
18. Shaydulin, et al., “A Hybrid Approach for Solving Optimization Problems on Small Quantum Computers,” in Computer, vol. 52, no. 6, pp. 18-26, June 2019.
19. Takahashi, B. Nguyen, K. Strauss, L. Ceze, “Demonstration of End-to-End Automation of DNA Data Storage”, Nature Scientific Reports, vol. 9, no. 1, pp. 2045-2322, 2019
20. Wellsey, et al., “Puddle: A Dynamic, Error-Correcting, Full-Stack Microfluidics Platform”, ASPLOS’19 , April 13–17, 2019.
21. Intel Loihi Neurmorphic Chip. https://en.wikichip.org/wiki/intel/loihi
22. E. Kim, et al., “Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning”, Chem. Mater. 2017 29219436-944

# Introduction

Wikipedia has a pretty good definition of workflow: an orchestrated and repeatable pattern of activity, enabled by the systematic organization of resources into processes that transform materials, provide services, or process information.   Doing Science involves the workflow of repeating and documenting experiments and the related data analysis. Consequently, managing workflow is critically important for the scientific endeavor. There have been dozens of projects that have built tools to simplify the process of specifying and managing workflow in science. We described some of these in our 2007 book “Workflows for e-Science” and another Wikipedia article gives another list of over a dozen scientific workflow system and this page lists over 200 systems. Many of these systems were so narrowly focused on a single scientific domain or set of applications that they have not seen broad adoption.   However, there are a few standouts and some of these have demonstrated they can manage serious scientific workflow.

Pegasus from ISI is the most well-known and used framework for managing science workflow.   To use it on the cloud you deploy a Condor cluster on a set of virtual machines. Once that is in place, Pegasus provides the tools you need to manage large workflow computations.   Makeflow from Notre Dame University is another example. Based on a generalization of the execution model for the old Unix Make system, Makeflow uses Condor but it also has a native distributed platform called WorkQueue. Makeflow is commonly used on large HPC Supercomputers but they also claim an implementation on AWS Lambda.

# Workflow in the Cloud.

Doing scientific computing in the cloud is different from the traditional scientific data centers built around access to a large supercomputer. A primary difference is that the cloud support models of computing not associated with traditional batch computing frameworks.   The cloud is designed to support continuously running services. These may be simple web services or complex systems composed of hundreds of microservices. The cloud is designed to scale to your needs (and budget) on-demand. The largest commercial public clouds from Amazon, Google and Microsoft are based on far more than providing compute cycles. They offer services to build streaming applications, computer vision and speech AI services, scalable data analytics and database services, managing edge devices, robotics and now attached quantum processors. While there are tools to support batch computing (Microsoft Azure even has an attached Cray), the cloud is also an excellent host for interactive computational experimentation.

Christina Hoffa, et. al. “On the Use of Cloud Computing for Scientific Workflows” describe some early 2008 experiments using cloud technology for scientific workflow. The cloud of 2019 presents many possibilities they did not have access to.   Two of these are “cloud native” microservice frameworks such as Kubernetes and serverless computing models.

Kubernetes has been used for several workflow systems. An interesting example is Reana from CERN. Reana is a research data analysis platform that runs on your desktop or on a cloud Kubernetes cluster. Reana uses several workflow languages but the one that is most frequently used is CWL, the Common Workflow Language, which is rapidly becoming an industry standard.   CWL is used in a number of other cloud workflow tools including AVADOS from Veritas Genetics, a version of the popular Apache Airflow workflow tools and several other systems with implementations “in progress”.   Argo is another workflow took that is designed to work with Kubernetes.

## Workflow Using Serverless Computing

Serverless computing is a natural fit for workflow management.   Serverless allows applications to run on demand without regard to compute resource reservation or management.   Serverless computations are triggered by events. Typical among the list of event types are:

• Messages arriving on Message Queues
• Changes in Databases
• Changes in Document Stores
• Service APIs being invoked
• Device sensor data sending new updates

These event types are each central to workflow automation. AWS Lambda was the first production implementation of a serverless platform, but not the last. Implementations from Microsoft, IBM and Google are now available and the open source implementation from OpenWhisk is available for OpenStack and other platforms.

Serverless computing is built around the concept of “function as a service” where the basic unit of deployment is not a VM or container, but the code of a function.   When the function is deployed it is tied to a specific event category such as one of those listed above.    These functions are intended to be relatively light weight (not a massive memory footprint and a short execution time).   The semantics of the function execution dictate that they are stateless.   This allows many instances of the same function to be responding to events at the same time without conflict.  The function instances respond to an event, execute and terminate.   While the function itself is stateless, it can affect the state of other object during its brief execution.   It can write files, modify databases, and invoke other functions.

### Workflow State Management

Most scientific workflows can be described as a directed acyclic graph where the nodes are the computational steps. An arc in the graph represents a completion of a task that signals another it may start.   For example, the first task writes a file in a storage container and that triggers an event which fires the subsequent task that is waiting for the data in the file. If the graph takes the shape of a tree where one node creates events which trigger one or more other nodes, the translation to serverless is straightforward: each node of the graph can be compiled into one function. (We show an example of this case in the next section.)

One of the challenges of using serverless computing for workflow is state management. If the “in degree” of a node is greater than one, then it requires more than one event to trigger the event.   Suppose there are two events that must happen before a node is triggered. If the function is stateless it cannot remember that the one of the conditions has already been met.  The problem is that the graph itself has state defined by which nodes have been enabled for execution. For example, Figure 1 is a CWL-like specification of such a case. NodeC cannot run until NodeA and NodeB both complete.

Figure 1. A CWL-like specification of a three step workflow where the third step requires that both the first step and second step are complete.   The first and second step can run simultaneously or in any order.

One common solution to this problem is to assume there is a persistent, stateful instance of a workflow manager that holds the graph and keeps track of its global state.   However, it is possible to manage the workflow with a single stateless function. To see how this can be done notice that in the above specification each of the step nodes requires the existence of one or more input files and the computation at that node produces one or more output files.  As shown in Figure 2 below, workflow stateless function listens to the changes to the file system (or a database).

Figure 2.   A workflow manager/listener function responds to events in the file system created by the execution of the applications.   As shown in the next session, if the app is small, it may be embedded in the manager, but otherwise it can be containerized and run elsewhere in the cloud.

Here we assume that the application invocations, which are shown as command-line calls in Figure 1, are either executed in the lambda function or by invocations to the application wrapped as a Docker container running as a microservice in Kubernetes. When an application terminates it deposits the output file to the file system which triggers an event for the workflow manager/listener function.

The workflow manager/listener must then decide which step nodes were affected by this event and then verify that all the conditions for that node are satisfied. For example, if the node requires two file, it much check that both are there before invoking the associated application. There is no persistent state in the manager/listener as it is all part of the file system. To run multiple workflow instances concurrently each event and file must have an instance number ID as part of its metadata.

## A Simple Example of Workflow using AWS Lambda

In the following paragraphs we describe a very simple implementation of a document classification workflow using AWS Lambda. The workflow is a simple tree with two levels and our goal here is to demonstrate the levels of concurrency possible with a serverless approach. More specifically we demonstrate a system that looks for scientific documents in an AWS S3 bucket and classifies them by research topic. The results are stored in an Amazon DynamoDB table. The documents each consist of a list of “scientific phrases”. An example document is below.

“homology is part of algebraic topology”,
‘The theory of evolution animal species genes mutations’,
“the gannon-lee singularity tells us something about black holes”,
‘supercomputers are used to do very large simulations’,
‘clifford algebras and semigroup are not fields and rings’,
‘galaxies have millions of stars and some are quasars’,
‘deep learning is used to classify documents and images’,
‘surgery on compact manifolds of dimension 2 yields all possible embedded surfaces’

(In the experiments the documents are the titles of papers drawn from ArXiv.)

The first step of the workflow classifies each statement according to basic academic field: Physics, Math, Biology and Computer Science. (Obviously there more fields than this, but this covered most of the collection we used to train the classifiers.) Once a sentence is classified as to topic it is then passed to the second stage of the workflow where it is classified as to subcategory. For example if a sentence is determined to belong to Biology, the subcategories that are recognized include Neuro, Cell Behavior, Genomics, Evolution, Subcellular, Health-Tissues&Organs and Molecular Networks. Physics sub areas are Astro, General Relativity and Quantum Gravity, Condensed Matter, High Energy, Mathematical Physics, Quantum mechanics and educational physics. Math is very hard, so the categories are simple: algebra, topology, analysis and other.   The output from the DynamoDB table for this list of statements is shown in Figure 3 below.

Figure 3. Output from the sample document in an AWS DynamoDB table. The “cmain predict” column is the output of the first workflow phase and the “cpredicted” column is the output of the second phase of the workflow.

## The AWS Lambda details.

A Lambda function is a very simple program that responds to an event, does some processing and exits. There is a complete command line interface to Lambda, but AWS has a very nice portal interface to build Lambda functions in a variety of standard languages.   I found the web portal far superior to the command line because it gives you great debugging feedback and easy access to you function logs that are automatically generated each time your lambda function is invoked.

The example below is a very simple Python lambda function that waits for a document to arrive in a S3 storage bucket.   When a document is placed in the bucket, the “lambda_handler” function is automatically invoked. In this simple example the function does three things.   It grabs the name of the new S3 object and the bucket name. It then opens and reads the document (as text). If the document is not text, the system throws an error and the evidence is logged in the AWS CloudWatch log file for this function. In the last step, it saves the result in a DynamoDB table called “blanklambda”.

To make this work you need to assign the correct permissions policies to the lambda function. In this case we need access to S3, the DynamoDB and the basic Lambda execution role which includes permission to create the execution logs.

To tell the system which bucket to monitor you most go to the S3 bucket you want to monitor and add to the property called “Events”. Follow the instructions to a reference to your new Lambda function.

In our workflow example we used 5 lambda functions: a main topic classifier, and a classifier lambda function for each of the subtopics.   It is trivial to make one Lambda function create an event that will trigger another. We send each document as a string json document encoding our list of statements.

The call function is

Lam = boto3.client("lambda",region_name="us-east-1")
resp = Lam.invoke(FunctionName="bio-function", InvocationType="RequestResponse",
Payload=json.dumps(statementlist))


The “bio-function” Lambda code receives the payload as a text string and convert it back to a list.

The workflow is pictured in Figure 4 below.  When a document containing a list of statements lands in S3 it invokes an instance of the workflow. The main classifier lambda function invokes one instance each of the sub-classifiers and those four are all running concurrently.   As illustrated in Figure 5, when a batch of 20 documents files land in s3 as many as 100 Lambda instances are running in parallel.

Figure 4. When a document list file lands in a specified S3 bucket it triggers the main lambda function which determines the general topic of each individual statement in the document list. The entire document is sent to each of the subtopic specialist lambda function that classifies the them into subtopics and places the result in the DynamoDB table.

Figure 5. As multiple documents land in S3 new instances of the workflow are run in parallel. A batch of 20 documents arriving all at once will generate 5*20 concurrent Lambda invocations.

### AWS Lambda’s Greatest Limitation

Lambda functions are intended to be small and execute quickly.   In fact, the default execution time limit is 3 seconds, but that is easily increased. What cannot be increased beyond a is the size of the package. The limit is 230Mbytes.   To import python libraries that are not included in their basic package you must add “layers”. These layers are completely analogous to the layers in the Docker Unified File System. In the case of Python a layer is a zipped file containing the libraries you need. Unfortunately, there is only one standard library layer available for python 3 and that includes numpy and scipy.   However, for our classification algorithms we need much more.   A grossly simplified (and far less accurate) version of our classifier requires only sklearn, but that is still a large package.

It takes a bit of work to build a special Lambda layer.   To create our own version of a Scikit learn library we turn to an excellent blog “How to create an AWS Lambda Python Layer” by Lucas.    We were able to do this and have a layer that also included numpy.   Unfortunately, we could not also include the model data, but we had the lambda instance dynamically load that data from S3. Because our model is simplified the load takes less than 1 second.

We tested this with 20 document files each containing 5 documents uniformly distributed over the 4 major topics. To understand the performance, we captured time stamps at the start and end of each instance of the main Lambda invocation.   Another interesting feature of AWS Lambda execution is that the amount of CPU resource assigned to each instance of a function invocation is governed by the amount of memory you allocate to it.   We tried our experiments with two different memory configurations: 380MB and 1GB.

Figure 6. Each horizontal line represents the execution of one Lambda workflow for one document.   Lines are ordered by start time from bottom to top. The figure on the top show the results when the lambda memory configuration was set to 380MB. The figure on the bottom shows the results with 1GB of memory allocated (and hence more cpu resource).

The results are shown in Figure 6. There are two interesting points to note.   First, the Lambda functions do not all start at the same time.   The system seems to see that there are many events that were generated by S3 and after a brief start it launches an additional set of workflow instances. We don’t know the exact scheduling mechanism used. The other thing to notice is that the increase in memory (and CPU resource made a profound difference.   While the total duration of each execution varied up to 50% between invocations the mean execution time for the large memory case was less than half that of the small memory case. In the 1GB case the system ran all 20 documents with a speed-up of 16 over the single execution case.

Having to resort to the simplified (and far less accurate) small classifier was a disappointment.   An alternative, which in many ways is more appropriate for scientific workflows, is to use the Lambda functions as a the coordination and step-trigger mechanism and have the lambda functions invoke remote services to do the real computation.   We ran instances of our full model as docker containers on the AWS LightSail platform and replaced the local invocations to the small classifier model with remote invocations to the full model as illustrated in Figure 7.

Figure 7.   Invoking the remote classifiers as web services.

Results in this configuration were heavily dependent on the scale-out of each of the classifier service. Figure 8 illustrate the behavior.

Figure 8. Running 20 documents simultaneously pushed to S3 using remote webservice invocations for the classification step. Once again these are ordered from bottom to top by lambda start time.

As can be seen in the diagram the fastest time using a remote service for the full classifier was nearly three times faster than the small classifier running on the lambda function.   However, the requests to the service caused a bottleneck which slowed down others.   Our version of the service ran on two servers with a very primitive random scheduler for load balance.   A better design would involve many more servers dynamically allocated to meet the demand.

# Conclusion

Serverless computing is a technology that can simplify the design of some scientific workflow problems. If you can define the workflow completely in terms of the occurrence of specific triggers and the graph of the execution is a simple tree, it is easy to setup a collection of function that will be triggered by the appropriate events. If the events don’t happen, the workflow is not run and you do not have to pay for servers to host it. (The example from the previous section was designed, debugged on AWS over a few weeks and run many times. The total bill was about $5.) There are two downsides to using serverless as part of a workflow execution engine. 1. The complexity of handling a workflow consisting of a complex non-tree DAG is large. A possible solution to this was described in Figure 2 and the accompanying text. The task of compiling a general CWL specification into such a workflow manager/listener is non-trivial because CWL specification can be very complex. 2. A topic that was only briefly discussed here is how to get a Lambda function to execute the actual application code at each step. In most workflow systems the work steps are command line invoked applications that take input files and produce output files. To invoke these from a serverless lambda function in the cloud requires that each worker application is rendered as a service that can be invoked from the lambda function. This can be done by containerizing the applications and deploying them in pods on Kubernetes or other microservice architecture as we illustrated in Figure 2. In other scenarios the worker applications may be submitted to a scheduler and run on a batch system. While we only looked at AWS Lambda here we will next consider Azure functions. More on that later. The source code for the lambda functions used here can be found at this GitHub repo Note: there are various tricks to getting lambda to use larger ML libraries, but we didn’t go down that road. One good example is described in a blog by Sergei on Medium. Another look at doing ML on lambda is described by Alek Glikson in this 2018 article. # Quantum Computing in the Cloud Part 2. A look at Quantum AI Algorithms In Part 1 of this article we looked at how quantum computers are now beginning to appear as cloud hosted attached processors and we also looked at how one can go about programming them. The most important question that part 1 failed to address is why quantum computers are interesting? The most well-known uses of a quantum computer are for building interesting quantum chemistry models and for factoring large numbers to break cryptographic codes. The chemistry application was the original suggestion from Richard Feynman in 1981 for building a quantum computer. The factoring application is important, but it will only remain interesting until the cryptographers find better algorithms. Another application area that has received a lot of attention lately is applying quantum computing to machine learning (ML). This is not surprising. (Everything in AI is hot now, so quantum AI must be incandescent.) Much of the published work on quantum ML is very theoretical and this article will try to provide pointers to some of the most interesting examples, but we will not go deep into the theory. However there are a few algorithms that run on the existing machines and we will illustrate one that has been provided by the IBM community. This algorithm builds a Support Vector Machines for binary classification and it runs on the IBM-Q system. More on that in the second half of this article. The following is a list of papers and link that formed the source for the brief summary that follows. They are listed in chronological order. Several of these are excellent surveys with lots of technical details. 1. Lloyd, M. Mohseni, and P. Rebentrost, Quantum algorithms for supervised and unsupervised machine learning, https://arxiv.org/abs/1307.0411 (2013) 2. Lloyd, M. Mohseni, and P. Rebentrost, Quantum principal component analysis, https://arxiv.org/abs/1307.0401 (2013) 3. Nathan Wiebey, Ashish Kapoor and Krysta M. Svore, Quantum Algorithms for Nearest-Neighbor Methods for Supervised and Unsupervised Learning, https://arxiv.org/abs/1401.2142 2014 4. Maria Schuld, Ilya Sinayskiy and Francesco Petruccione, An introduction to quantum machine learning, https://arxiv.org/abs/1409.3097v1 2014 5. Nathan Wiebe, Ashish Kapoor, and Krysta M. Svore, Quantum Deep Learning, https://arxiv.org/abs/1412.3489v2 2015 6. Verdon, M. Broughton and J. Biamonte, A quantum algorithm to train neural networks using low-depth circuits, https://arxiv.org/pdf/1712.05304.pdf 7. Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, Seth Lloyd, Quantum Machine Learning, https://arxiv.org/pdf/1611.09347v2.pdf 2018 8. Seth Lloyd and Christian Weedbrook. Quantum generative adversarial learning, https://arxiv.org/pdf/1804.09139v1.pdf 2018 9. Gao1, Z.-Y. Zhang, L.-M. Duan, A quantum machine learning algorithm based on generative models, http://advances.sciencemag.org/content/4/12/eaat9004.full, 2018 10. Dawid Kopczyk “Quantum machine learning for data scientists” 2018 ## A Tour of Quantum ML Algorithms The normal breakdown of machine learning algorithms is to divide it into 2 categories: supervised algorithms and unsupervised algorithms. Supervised algorithms include those where a quantity data and the “answers” are known, and we seek to build a model that will allow us to consider previously unseen data. These include simple regression, support vector machines (which we will consider in detail later), neural networks, restricted Boltzmann machines and deep learning classifiers. Unsupervised algorithms include clustering, dimensionality reduction and methods such as principle component analysis, and generative adversarial networks. Because most of the work on quantum ML is theoretical the measure of success is based on complexity theory and the holy grail for quantum algorithms is a proof of exponential speed up of the quantum algorithm over the best sequential counterpart. In the case of the classical ML algorithms that run in polynomial time (as a function of input data size), an exponential speedup for a quantum algorithm would be one that runs in polylogarithmic time or less. But if the data input must be loaded into the quantum computer, then any sequential read of the data would require linear time and we cannot boast true exponential speed-up. One way around this is to assume the data has been preloaded into a “quantum random access memory” (qRAM). An alternative is to consider problems where the data is generated by an internal algorithm or a natural process. A commentary in Nature by Scott Aaronson entitled “Read the fine print” (2015) points out that there are several difficulties with many quantum algorithms. These include the transformation of the data vector of dimension N into to log(N) qubits, the fact that the matrices must be well conditioned and the fact that the vector outputs may be difficult to read if they contain components that differ greatly in scale. In general, the “Quantum Supremacy” of a quantum algorithm may be theoretically true but it may be lost to fast classical algorithms in real applications if nasty constant factors dominate. ### Looking at a Few Algorithms A frequent assumption about data is that it is composed of vectors of dimension N=2n. These may be real or complex values. Let v = {v1, v2, … , vN} be such a vector, then we assume we can imbed this into n qubits as If this can be done, it has some nice properties that can be used in some important quantum subroutines in the quantum ML algorithms. ### K-Means Algorithm K-means is one of the standard classic ML unsupervised algorithms. It uses the vector-to-quantum vector map described above. If we define a second vector w in a similar way and then define A bit of algebra will verify that This implies we can compute Euclidian distance if we can compute |<ϴ | φ > |2 . It turns out there is a simple quantum circuit, called the swap-test that can help us estimate this value. Given this distance algorithm and another standard subroutine called Grover’s algorithm (that allows us to pick the best from alternatives) we can build a K-means algorithm. The quantum advantage of this algorithm only show up when N is large because the complexity of the classical algorithm with M data points is O(MN) and with the quantum algorithm it is O(M log(N)). ### Principal Component Analysis PCA is based on finding the eigenvectors of the covariance matrix of a set of input vectors. The quantum version of this algorithm uses an important subroutine called quantum phase estimation, which is a method to find the eigenvalues of a unitary matrix. For unitary matrices the eigenvalues are always of the form The algorithm computes the phase ϴ. For PCA in general, we start with a set of vectors {vi , i = 1, M}. The jth component of vi can be thought of jth feature of the vector. To simplify the discussion let’s assume that mean of this feature is 0 for all M vectors. In otherwords, set vi = vi – 1/M*sum(vi, i=1,M). We are looking for eigenvalues and eigenvectors of the matrix C = Vt * V where V is the NxM matrix consisting the vectors {vi , I = 1, M}. Another way to describe C is by the fact that it is NxN with Ci, j = sum( vki* vk j , k =1,M). C is M times the covariance matrix for our set of normalized vectors. The eigenvectors of C form a new basis that diagonalizes C with the eigenvalues on the diagonal. In the quantum algorithm we use the map of our vector data into the quantum space as above Where we have also normalized each vector to unit length. Now we build the Hermitian operator Expanding out the tensor product from the embedding representation we see this is precisely the quantum version of our covariance matrix. We can now turn this into a problem of finding the eigenvalues of a unitary matrix with the formula The algorithm now can use phase estimation and other tricks to get to the eigenvectors and values. For the serious details see reference [7] and [10] above. ### Neural Networks: RBMs and GANs Paper [5] and [7] discuss approaches to quantum versions of restricted Boltzmann machines (RBMs). Unlike the standard feed-forward deep neural network used for classification tasks, RBMs are a type of generative neural network. This means that they can be used to build a model of the statistical distribution of observed data. Sampling from this distribution can give you new examples of “similar” data objects. For example, a picture of a human face that is like others but different, or a potential chemical compound for drug discovery research. The simplest form of a RBM is a two layer network with one layer which is the visible nodes (v1, v2, .. vm) and the other is called the hidden layer of nodes (h1, h2, …, hn). Each visible node is connected to each hidden node as shown in the figure below. We assume that all the data are binary vectors of length m. We use each data sample to predict values for the hidden state using a sigmoid function and a set of weights wi,j connecting visible node i to hidden node j and an offset ai. What results is a conditional probably for h given v. Then given a sample of hidden states we project them back to the visible layer by a similar conditional probability function. The joint probability for v and h is given by Where the energy function is given by Where Z is the normalizing sum. This distribution is the Boltzmann or Gibbs distribution. There are several quantum solutions. One is to us simulated annealing which can be done on a machine like the D-wave system. The other approach is to exploit a technique known as Gibbs sampling. One can use the quantum computers can draw unbiased samples from the Gibbs distribution, thereby allowing the probabilities P(v,h) to be computed by quantum sampling. It seems that this can be done with a relatively small number of qubits, but the reader should consult [5] to get the full picture. Paper [9] considers the problem of creating quantum generative systems by starting with factor graphs and then then construct a quantum state that represents the distribution associated with the graph. They then show that this is conceptually more powerful that then factor graph approach. RBMs are one of the original forms of neural networks. (One survey article calls them the “model T” of networks.) RBM are not used very often because other generative methods seem to work better. One such better method is the Generative Adversarial Network (GAN), in which two networks compete. We discussed GANS in a previous post. One network called the generator and its job is to generate sample data that will fool the other network, the discriminator. The discriminator’s job is to tell the real data samples from the fake. When the system converges the discriminator is correct half the time and it is wrong half of the time. In paper [8] the authors generalize this to the quantum case by proposing three variations: both the discriminator are quantum systems, the discriminator is quantum and the generator is classical and the configuration where the discriminator is classical and the generator is quantum. They argue that for high dimensional data the quantum-quantum case may give exponential speed-up. ## An Example: Support Vector Machines To illustrate a simple quantum machine learning algorithm that runs on the IBM system we turn to a “classic” machine learning method known as Support Vector Machine (SVM). In general, an SVM can be used as a binary classifier of measurements of experiments where each experiment is represented by its features rendered as vector or point in Rn. For example, these might be measurements of cell images where we want to identify which cells are cancerous or it may be astronomical measurements of stars where we seek to determine which have planets. To understand the quantum version of SVM it is necessary to review the derivation of the classical case. In the simplest cases we can find a plane of dimension Rn-1 that separates the vectors (viewed as points) into two classes with the class on one side of the plane having the property we want and the class on the other side failing. These points with their label constitute the training set for the algorithm. In the case that n = 2, the plane is a line and the separator may look like figure 1a. The hyper plan can be described in terms of a normal vector w and an offset b. To determine if a point X in Rn is one side or the other of the plane, we evaluate If f(X) > 0 then the point is on one side and if f(X) < 0 the point is on the other side. Figure 1. a) on the left is a linear separator between the red and the blue points. b) on the right is an example without a linear separator. The name “support vector” comes from the optimal properties of the separating hyperplane. The hyperplane is selected so that it maximized the distance from all the training examples. The distance of a point to the hyperplane is the minimum distance to the hyperplane. This minimum distance is on a vector from the plane to the point that is parallel to the normal vector of the plane. Those points whose distance is the minimum are called the support vectors. Because the plane maximizes the distance from all points, those minima must be equal. A bit of algebra will show the normal distance between the nearest points on one side of the plane and the nearest point on the other side is 2/||w||. (Hence to maximize the distance we want to minimize ||w||2/2.) Let {xi , yi} I = 1,N be the training points where yi = -1 if xi is in the negative class (making f(xi) < 0) and yi = 1 if xi is in the positive class ( f(xi) > 0) . To compute the values for w and b we first note that because f(X) is only + or -, we can scale both w and b by the same factor without changing the sign. Hence, we can assume where equality happens only for the support vector points. To solve this problem of maximizing the minimum distance we are going to need to invoke a technique from math where we insert some new variables αi (known as Lagrange multipliers) and then compute As stated above we need minimize ||w||/2 so we can minimize its square. Looking at the min term, we must compute Taking the derivative with respect to w and b we get the minimum when Substituting these into L we see we must compute A look at L shows that when xi is not a support vector then the max cannot happen unless αi = 0. Numerically solving for the remaining αi’s we can compute b from the fact that for the K support vectors we have The matrix Is called the kernel matrix. The function f(X) now takes the form ### The Nonlinear Case and the Kernel Trick There are obviously many cases where a simple planar separator cannot be found. For example, Figure 1b contains two classes of points where the best separator is a circle. However, that does not mean there is no interesting combination of features that can be used to separate the classes with a plane if viewed in the right way. Let ϕ: Rn -> M be a mapping of out feature space into a space M of higher dimension. The goal is to find a function that spreads the data out so that we can apply the linear SVM case. All we require of M is that it be a Hilbert space where we can compute inner products in the usual way. If we define the function K(X,Y) = (ϕ(X) . ϕ(Y)) then our function f(X) above becomes Where the ais and b are computed in the space M. To illustrate this idea let ϕ: R2-> R3 be given by the mapping ϕ(X) = (X[0], x[1], 3 – 0.35*||X||2) and look at the data from figure 1b, but now in 3D (figure 2). As can be seen the mapping moves the data points onto a paraboloid where the points inside the circle are in the positive half-space and the rest are below. It is now relatively easy to find a planar separator in R3. Figure 2. Kernel mapping R2 to R3 providing a planar separator for the data in Figure 1b. When does this kernel function trick work? It turns out that it almost always does provided you can find the right mapping function. In fact the mapping function ϕ is not the important part. It is the function K(X,Y). As long as K is symmetric and positive semi-definite, then there is a function ϕ such that for every X and Y, K(X,Y) = (ϕ(X) . ϕ(Y)). But from the function f(X) above we see that we only need K and the inner product in M. As we shall see below M may be derived from quantum states. ## Quantum Support Vector Machines. We will look at the results of using a quantum SVM algorithm that run on the IBM quantum hardware. The complete mathematical details are in a paper by Vojtech Havlicek, et. al. entitled “Supervised learning with quantum enhanced feature spaces“ (https://arxiv.org/abs/1804.11326). They describe two algorithms. One of the algorithms is called a variational method and the other is a direct estimation of a quantum version of a Kernel function K(x,z). Because the later method is a tiny bit easier to explain we will follow that approach. There are two parts to this. We will work with examples with data in R2. 1. First we construct a function φ from R2 into the space of positive semideﬁnite density matrices with unit trace. We need this function to be hard to compute classically so we can preserve the “quantum advantage”. We can then create a 2 qubit function from R2 as The transform Uφ(x) is the key to embedding out training and test data into 2 qubits. The exact definition involves selecting a set of nonlinear functions ϕS(x): R2 -> R where S ranges over the subsets of the set {1,2}. Given these functions, the unitary function Uφ(x) is defined as Were Zi is the Pauli Z operator on the ith qubit. To create our kernel function we look at 1. This is almost the kernel, but not quite. What we define as the K(x,z) is related to the fidelities between x and z. To get that from the transition amplitudes above we measure this R times and if R is sufficiently large it will give us a good estimate. Once we have computed Ki,j = K(xi, xj) for the training set we can now use the “classical” computer to calculate the support vector coefficients ai and offset b. To make predictions of f(Z) we now use the quantum computer to calculate K(xi , Z) for each of the support vectors and plug that into the formula above. The exact mathematical details for deriving φ are, of course, far more complicated and the reader should look at the full paper. This example quantum support vector machine is available on-line for you to try out on IBM’s system. The code is simple because the details of the algorithm are buried in the qiskit.aqua libraries. To run the algorithm, we create a very small sample dataset from their library “add_hoc_data” and extract training and testing files. There are two ways to run the algorithms. One way is a 3-line invocation of the prepackaged algorithm Below is the “programmers” which shows a few more details. This shows that we are using the qasm_simulator with 2 qubits where qubit 0 is connected to qubit 1 on the simulated hardware (this reflects the actual hardware). We create an instance and train it with 1024 “shots” (R above). We can next print the kernel matrix and the results of the test. Using the programming method, we can directly invoke the predict method on our trained model. We can use this to show a map of the regions of the quantum support “projected” to the 2-D plane of the sample data. We do this by running the predict function on 10000 points in the plane. And plot this as a map and then add the training points. The resulting image is below. As you can see, the dark blue areas capture the orange data points and the lighter orange areas capture the light blue data points. It is interesting to compare this result to running this with the Scikit SVM library. Using the library is very simple. Converting the data set from the quantum algorithm to one we can give to the Scikit library as vectors X and Y, we have The Kernel function in this case is one of the standard kernels: RBF. Plotting the projection of the support surface wit the 2-D plain we see the image below. The match to the training data is perfect, but in this case the accuracy is only 0.6. We tried two additional test. The first was a simple linear partition along the line y = 0.6*x-0.2 with 20 points above the line and 20 below. In this case the quantum computation did not as well at first, but after several attempts with different data sets, it achieved a score of .95 and the Scipy RBF kernel also got a score of 0.95. The figure below illustrates the regions captured by both algorithms. We also used another example from their data collection consisting of breast cancer cases. The data was relatively high dimensional, so they projected it onto the 2 principal axes. In this case the quantum and the Scipy RBF both achieved an accuracy of 0.9. After numerous experiments we found that the quantum algorithm was unstable in that there were several cases that caused it to fail spectacularly (accuracy = 0.4). However, when it worked it worked very well. The experiments above were with very tiny data sets that were selected so the results could be easily visualized. A real test would require data sets 100 to 1000 time larger. # Conclusions Quantum computers are now appearing as cloud-based resources and when used with algorithms that exploit the quantum subsystem and classical computer working together, real breakthroughs may be possible. In the area of AI and machine learning, the current work is primarily very theoretical, and we hope that we have given the interested reader pointers to some of the recent papers. We took a deep dive into one classical algorithm, support vector machines, and illustrated it with a code that runs on the IBM-Q system. Our results were impressive in terms of accuracy, but we did not see speed-up over the classical algorithm because of the small size and dimensionality of the data sets. The current crop of algorithms will need improvements if they are to show substantial speed-up on real world problems. In particular, the mapping from real data to quantum states will need to improve. But this will be an area where we can expect to see substantial investment and research energy over the next few years. # Quantum Computing and the Cloud Over the last five years cloud computing systems have evolved to be the home to more than racks of servers. The need for specialized resources to for various classes of customers has driven vendors to add GPU server configurations as a relatively standard offering. The rise of Deep Learning has seen the addition of special hardware to accelerate both the training and inference phases of machine learning. This include the Google Tensor Processing hardware in the Google cloud and the FPGA arrays on Microsoft’s Azure. Quantum computing has now moved from theory to reality. Rigetti, IBM, D-Wave and Alibaba now all now have live quantum computing services in the cloud. Google and Microsoft will follow soon. In the paragraphs that follow we will dive a bit deeper into several of these services with illustrations of how they can be programmed and used. In this article we will look at two different systems: the IBM-Q quantum computer and it software stack qiskit and the Microsoft Q# quantum software platform. We could have discussed Rigetti which is similar to IBM and D-Wave but it is sufficiently different from the others to consider it elsewhere. We don’t have enough information about Alibaba’s quantum project to discuss it. # The Most Basic Math As our emphasis here will be on showing you what quantum programs look like, we will not go into the physics and quantum theory behind it, but it helps to have some background. This will be the shallowest of introductions. (You will learn just enough to impress people at a party as long as that party is not a gathering of scientists.) If you already know this stuff or you only want to see what the code looks like skip this section completely. There are many good books that give excellent introductions to quantum computing. (My favorite is “Quantum Computing: A Gentle Introduction” by Rieffel and Polak.) We must begin with qubits: the basic unit of quantum information. The standard misconception is that it is a probabilistic version of a binary digit (bit). In fact, it is a two-dimensional object which is described as a complex linear combination of two basis vectors |0> and |1> defined as Using this basis, any qubit |Ψ> can be described as a linear combination (called a superposition) of these basis vectors where α and β are complex numbers whose square norms add up to 1. Because they are complex this means the real dimension of the space is 4 but then the vector is projected to complex projective space so that two qubit representatives |Ψ> and |г> are the same qubit if there is a complex number c such that A slightly less algebraic representation is to see the qubit |Ψ> as projected onto the sphere shown on the right where |Ψ> is defined by two angles where Qubits are strange things that live in a world where we cannot know what the parameters α and β are unless we attempt to measure the qubit. Measurement can be thought of as projecting the qubit onto a special set of basis vectors and each device has its own set of basis vectors. Here we will assume all our measurements are with respect to the standard basis |0> and |1>. Measuring a qubit changes it and results in a classic bit 0 or 1. However, the probability that it is a 0 is || α ||2 and the probability we get a 1 is || β ||2. After a qubit has been measured it is projected into one of the basis vectors |0> and |1>. ## Basic Qubit Operators In addition to measurement, there are a number of operators that can transform qubits without doing too much damage to them. (As we shall see, in the real world, qubits are fragile. If you apply too many transformations, you can cause it to decohere: and it is no longer usable. But this depends upon the physical mechanism used to render the qubit.) The basic transformation on a single qubit can all be represented by 2×2 unitary matrices, but it is easy to just describe what the do to the basis vectors and extend that in the obvious way to linear combinations. • X is the “not” transform. It takes |0> to |1> and |1> to |0>. By linearity then X applied to any qubit is • H the Hadamard transform takes |0> to 1/√2(|0> + |1>) and |1> to 1/√2(|0> – |1>). This transform is used often in quantum programs to take an initial state such as |0> to a known mixed state (called superposition). ## Multiple Qubits Quantum computing gets interesting when multiple qubits interact. The way we describe the state space of two qubits is with the tensor product. For two qubits we can describe the state of the system in terms of the basis that is just the tensor product of 1-qubit basis: So that any state of the 2-qubit system can be describe as a linear combination of the form For three qubits the eight basis elements are |000> through |111>. The real dimension of this 3 qubit space is 23= 8. There are deep mathematical reasons for why the tensor product is the correct formulation rather than the direct sum of vector spaces as in classical mechanics, but the result is profound. The real dimension of an n-qubit system is 2n. When n = 50 the amount of standard computer memory required to store a single qubit is 8*250 = 16 petabytes (assuming 2 8-byte floating point values per complex coefficient). Hence there are limits to the size of a quantum systems we can simulate on a classical computer. We now have functional quantum computers with 20 qubits and we can simulate 32 qubits, but it may take 50 qubits or more to establish the more promising advantages of quantum computing. For two qubits there is a standard operation is often used. It is the • Cnot Controlled Not. This operation is so called because it can be thought of as using the first qubit (left most) in a pair to change the value of the second. More specifically, if the first bit of the pair is 0 then the result is the identity op: Cnot(|0x>) = |0x>. If the first bit is one then the second bit is flipped (not-ed): Cnot(|10>) = |11>, Cnot(|11>) = |10>. One of the most interesting things we can do with these operations is to apply the H operator to the first qubit and then Cnot to the result. In tensor product terms applying an operation H to the first qubit by not the other is to apply the operator H⊗Id to the pair, where Id is the identity operator. We can now compute Cnot(H⊗Id)(|0>|0>) as The result B =1/√2(|00> + |11>) is called a Bell state and it has some remarkable properties. First it is not the product of two 1-qubit vectors. (Some easy algebra can prove this claim.) Consequently B is a qubit pair that is not the simple the co-occurrence of two independent entities. The pair is said to be entangled. Information we can derive from the first qubit can tell us about the second. If we measure the first qubit of the pair we get 0 or 1 with equal likelihood. But if it is 0 then M(B) is transformed to |00>. If we get 1 M(B) becomes |11> . If it is |00>, measuring the second bit will give 0 with 100% certainty. If it is |11>, we will get 1 for the second bit. As this is true even if the two qubits are physically separated after they have been entangled, the fact that measurement of one qubit determines the result of measuring the second leads to amusing arguments about action at a distance and quantum teleportation. We now have enough of the math required to understand the programs that follow. # IBM-Q The IBM system is real and deployed on the IBM cloud. The core computational components are made up of superconducting Josephson Junctions, capacitors, coupling resonators, and readout resonators. As shown in Figure 1, the induvial qubits are non-linear oscillators. Tuned Superconducting resonator channels are used for readout. Also, the qubits are tied together by additional superconducting channels. IBM has several deployments each named after one of their research and development Labs. They include: • IBM-Q Tokyo. 20 qubits available for IBM clients • IBM-Q Melbourne 14 qubits and available for public use. • IBM-Q Tenerife 5 qubits available for public use. • IBM-Q Yorktown 5 qubits available for public use. Figure 1. A 5-qubit IBM-Q computational unit. Source: IBM-Q website. In addition, they have a very large simulator in the cloud, IBM-Q QASM_simulator, with 32 qubit capability. There is much more to the architecture of a complete quantum system. Two big challenges. First the qubit devices must be cryogenically cooled and howare how do you connect a system running at 15 millikelvins to a room temperature computing environment and how do you minimize noise to reduce errors. As shown in Figure 2, it takes several thermal layers and superconducting connections to make it happen. Figure 2. System Architecture of IBM-Q quantum architecture. Source: IBM Research Signing up to use the IBM-Q pubic systems is extremely easy. Go to the qx community page and sign-up and you can get an access code. The best way to use it is with Python Jupyter so download the most recent Anaconda distribution. The IBM-Q qiskit software stack is how we program the system. Ali Javadi-Abhari and Jay M. Gambetta have an excellent series of articles on Medium describing it. The current state of the art in the hardware is what they call “noisy intermediate-scale quantum computers (NISQ)”. The software stack is designed to allow researchers to explore several levels of NISQ computing. There are for components. • Qiskit Terra. This is the core software platform containing all the Python APIs for describing quantum circuits and the interface to submit them to the hardware and simulators. There are many qubit operators in Tera. See this notebook for a good sample. • Qiskit Aqua. This is the high-level application layer. It contains templates for building advanced applications in areas including chemistry, optimization and AI. (We will discuss this in more depth in part 2 of this article.) • Qiskit Aer. Qiskit has several different simulators that run in the cloud or locally on your laptop. The unitary_simulator. The standard operations on a quantum circuit are all unitary operations. The unitary_simulator computes the result of computation and displays the result as a complex unitary matrix. The statevector_simulator allows you to initialize a mullti-qubit to an arbitray unit combination of the basis vectors and it will do the simulation in terms of the state vectors. The qasm_simulator provides a detailed device level simulator that takes into account the fact that the hardware is noisy. We will illustrate this simulator below ## A Quantum “Hello World” example. Qiskit is a programming system based on compiling Python down to the basic assembly language of the IBM-Q hardware (or to the format needed by one of the simulators). For a simple “hello world” example we will use the simple demonstration of entanglement describe in the mathematical introduction section above. We start with two qubits in |0> state and apply the H transform to the first and then the controlled not (Cnot) operation to the pair. We next measure the first qubit and then the second. If they have become properly entangle then if the first was measured at a 1 then the second will be a 1. If the first is a 0, then the second will be measured as a zero. This code is based on the sample by Jay Gambetta and Ismael Faro. Using Jupyter we start by loading the libraries we will use and then load our account information (this was established earlier on the local machine earlier using the IBMQ.save_account(‘…. Key …’) operation. With the account information we can inquire about the backend quantum systems we are allowed to use. There are three: two are hardware and one is the cloud-based qasm-simulator. In line [4] we ask for the least busy of machines and it is the 16 qubit system. The next step is to define the program. We declare an array of 2 qubit registers and 2 classical 1- bit registers. As shown below, we create a circuit consisting these two resources. We apply the H operator to the first qubit and the Cnot to the first and the second. Finally, we measure both. We have not executed the circuit yet, but we can draw it. There is a standard way quantum circuits are drawn which is similar to a musical score. The registers (both quantum and classical) are drawn has horizontal lines and operators are placed on the lines in temporal order from left to right. As you can see, the H gate is simply represented as a box. The controlled not consists of a dot on the “control” qubit and a circle on the “controlled” instance. Recall that the value of the control determines what happens to the second qubit. Of course, we can’t know the result until we do the measurement and that is represented with a little dial and a line to the output classical bit. We next execute our circuit on the ibmq_16_melbourne hardware. We will run it 1000 times so we can get some interesting statistics. The execute command is a type of “future” call. It returns a placeholder for the result. The actual job will go into a queue and we can monitor the status. When complete we can get the data. The data is always returned from the experiment as a Python dictionary with keys labeled 0x0 for the basis |00>, ox1 for the basis vector |01> and 0x2 for |10> and 0x3 for |11>. As we can see from the results below, the hardware is indeed noisy. The two bits are corelated 92.1% of the time. We can also plot a histogram to see the results. Given that we should have created the Bell state 1/√2(|00> + |11>) with our two qubit operations there should be no |01> or |10> components. But the systems are noisy, and the measurements of qubits produce results that defined by probability distributions, the outcome should not surprise us. Even the initial state of the qubit may be |0> with only 99.9% accuracy. That means it is occasionally |1>. The IBM-Q qasm_simulator can provide a model of the execution at the device level. To use it we extract data about the device we are going to simulate. We can get very low-level details about the device and measured noise characteristics. We can also get the coupling_map that tells us how the individual qubit cells are connected on the chip. Using this device data we can invoke the simulator. As can be seen below, we get results that are very similar to the actual experiment. If we had used one of the other simulators we would see only the theoretically perfect results. # Microsoft Q# Quantum Programming Toolkit Perhaps the greatest challenge when building a quantum computer is designing it so that it is stable in the presence of noise. Qubits that are too fragile will experience decoherence if they are subject to prolonged episodes of noise while they are undergoing the transformations required by a quantum algorithm. The depth of a quantum algorithm is the count of the longest path of operations in the circuit. Based on the intrinsic error characteristics of the devices and the noise there may be a limit of a few tens of thousands in the circuit dept before decoherence is likely. For some algorithms, such as those involving iteration, this limit may make it unusable. One way to solve this is by introducing error correction through massive redundancy. Microsoft has been taking a different approach to building a qubit, one that, if successful, will yield a much more robust system without as much need for error correction. Called a topological qubit, it Is based on different physics. Topology is the branch of mathematics that is concerned with the properties of objects that are not changed when they are perturbed or distorted. For example a torus is a 2-dimenional object that cannot be deformed into a sphere without ripping the surface of the torus. But the torus can be deformed into various other shapes, such as the surface of a coffee mug with no such tearing. Or consider points on a 1-D line as beads on a string. We cannot change their order unless we can move them from the one dimensional line to a second dimension and back to the line. This braded structure is a topological constraints that is a global property and therefor very robust. If you can make a qubit from this property it would impervious to minor noise. In condensed matter physics the 2016 Nobel prize was awarded to David Thouless, Duncan Haldane and Michael Kosterlitz for their work understanding strange behavior of matter when restricted to thin films. Their discovery demonstrated that the behavior had to do with the topology of 2-D surfaces. A similar discovery had to do with chains of atoms on a thin (1-D) , superconducting wire. The properties of the pair of objects at the ends of the chain were tied to the whole of the chain and not subject to minor local perturbations. Microsoft uses a similar idea to construct their “topological qubits” made from spitting electrons to form “Majorana fermion quasi-particles”. Situated at the opposite end of topological insulators they are highly noise resistant. This implies that one does not need massive redundancy in the number of qubits required for error correction that is needed for many other qubit models. Of course the above description does not tell us much about the exact nature of their process, but several interesting theory papers exist. ## The Q# programming environment. The first iteration of a quantum computing software platform from Microsoft was based on the F# functional programming language in 2014 and called LIQUi|> (see Wecker and Svore). The current version is based on C# and is nicely embedded in Visual Studio and Visual Studio Code. You can download it here for Windows 10, Mac and Linux. The installation straightforward. There is also a Python binding but we will look at the Visual Studio version here. The Q# programming language is designed as a hybrid between quantum operations on qubits and classical procedural programming designed to operate on a digital computer that contains the quantum device as a co-processor. Q# extends C# by introducing a number of new standard types including Qubit, Result (the result of a qubit measurement), unit (indicating an operator returns no result) and several additional operators. A complete description of the operational semantics is defined on the Q# link above. For our purposes here any Java or C# program should be able to follow the code. There are two standard libraries and a set of research libraries. The standard libraries include • Prelude: the collection of logic, libraries and runtime specific to a particular quantum computer architecture. • Cannon: The hardware independent library of primitive operator that can be used as part of quantum algorithm design. There is also a set of excellent standard libraries that include important topics like amplitude magnification, quantum Fourier Transforms, iterative phase estimation and other topics more advanced than this article can cover. There is also a set of research libraries with a focus on applications in Chemistry and Quantum Chemistry. ## Hello World in Microsoft’s Q# We begin by creating a new C# project with the “file->new->project” menu. If Q# has been correctly installed, you can select “Q# application” from the list of C# configurations and fill in the name for the project. In our case we are using “BellTest” and the system now shows the following. This is almost exactly what you will find in the introduction when you download the kit. The file called operations.qs contains the main part of our algorithm. We will have two operations, one for initializing a quantum variable and one for the bulk of our algorithm. The operation Set takes a qubit and a desired value (0 or 1) and sets the qubit to have that value. It does this as follows. First we measure the qubit. Recall that measurement (in the standard basis) returns a 0 or a 1 and (in the standard basis) projects the qubit to either |0> or |1>. If measures 0 and a |1> is desired, we use the Not gate (X) to flip it. If a 1 is measured and 1 is desired no change is made, etc. The operation BellTest takes an iteration count and an initial value we will use for qubit 1. The code is essentially identical to the example we described for qiskit. We start with an array of 2 qubits. We set qubit 0 to |0> and qubit 1 to |initial> . Next we apply H to qubit 0 and Cnot to the pair as before. We measure qubit 0 and compare it to the measurement of qubit 1. If they agree we increment a counter. If the measurement is 1 we also count that. This process is repeated count time and returns our counter values. The main program calls BellTest two different ways: once with the initial value for qubit 1 to be 0 and one with the value of qubit 1 to be 1. In the case that both qubits are initialized to 0 we saw from the mathematical introduction that the state after the Cnot should be the Bell state 1/√2(|00> + |11>). Consequently, after the measurements both qubits should always agree: if one is 0 the other is 0 and if one is 1 the other is 1. However if qubit 1 is initialized to 1 the situation is different. Evaluating the state mathematically we get Hence, we should see that the two qubits never agree when measured. The main program that drives this experiment is which runs BellTest 1000 times with each configuration of qubit 1. The results are below. These results agree with the theoretical result. The only probabilistic effect is the count in the number of zeros/ones measured. Unlike the Qasm_simulator, there was no noise introduced into the initialization because the Q# simulator was not modeling a specific hardware configuration. # Conclusion The near-term future of quantum computers will be as co-processors tightly integrated as cloud services. While IBM is now ready to start selling there systems into private clouds, others like Google and Microsoft will probably stay with a cloud service offering. This short paper was intended to illustrate two approaches to programming quantum computers. It is certainly not sufficient to begin any serious quantum algorithm development. Fortunately, there is a ton of great tutorial material out there. There is a great deal of exciting research that remains to be done. Here are a few topics. 1. Quantum Compiler Optimzation. Given the problem of qubit decoherence over time, it is essential that quantum algorithms terminate in as few step as possible. This is classic compiler optimization. 2. Efficient error correction. If you have a great quantum algorithm that can solve an important problem with 100 qubits, but the error correction requires 1000 qubits, the algorithm may not be runnable on near term machines. 3. Breakthrough algorithmic demonstrations. “Quantum supremacy” refers to concrete demonstrations of significant problem solutions on a quantum computer that cannot be duplicated on a classical machine. In some cases this is argued to be algorithms that are “exponentially” faster than the best classical algorithm. However, good quantum algorithms may lead to the discovery of classical algorithms that are quantum inspired. For example, here is one for recommendation systems. Part 2 of this report will address the application of quantum computing to AI and machine learning. This is both a controversial and fascinating topic. # Julia Distributed Computing in the Cloud ### Abstract. This brief note is intended to illustrate why the programming language Julia is so interesting to a growing number of computational and data scientists. Julia is designed to deliver high performance on modern hardware while retaining the interactive capabilities that make it well suited for Jupyter-style scientific exploration. This paper illustrates, with some very unscientific examples, how Julia can be deployed and with Docker and Kubernetes in the cloud. # Introduction In all our previous posts we have used Python to build applications that interact with cloud services. We used Python because everybody knows it. However, as many scientists have now discovered, for scientific applications the programming language Julia is a better alternative. This note is not a Julia tutorial, but rather, it is intended to illustrate why Julia is so interesting. Julia was launched in 2012 by Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman and is now gaining a growing collection of users in the science community. There is a great deal of Julia online material and a few books (and some of it is up to date). In a previous blog post we looked at Python Dask for distributed computing in the cloud. In this article we focus on how Julia can be used for those same parallel and distributed computing tasks in the Cloud. Before we get to the cloud some motivational background is in order. # 1. Julia is Fast! There is no reason a language for scientific computing must be as dull as Fortran or C. Languages with lots of features like Java, Python and Scala are a pleasure to use but they are slow. Julia is dynamic (meaning it can run in a read-eval-print-loop in Jupyter and other interactive environments), it has a type system that support parametric polymorphism and multiple dispatch. It is also garbage collected and extremely extensible. It has a powerful package system, macro facilities and a growing collection of libraries. It can call C and Python directly if it is needed. And it generates fast code. Julia uses a just-in-time compiler that optimized your program depending on how you use it. To illustrate this point, consider a simple function of the form function addem(x) # do some simple math with x x += x*x return x end  Because we have not specified the exact type of x, this defines a generic function: it will work with arguments of any type that has meaning for the math operations used. But when we invoke it with a specific type, such as int or float, a version of the function is compiled on the fly that is specialized to that type. This specialization takes a bit of time, but when we invoke the function a second time with that argument type, we used the specialized version. As illustrated in Figure 1, we called the function twice with integer arguments. The second call is substantially faster than the first. Calling it with a floating point argument is slow until the specialized version for floating point variables is created, then the function runs fast. Figure 1. Using a timing macro to measure the speed of function calls. In step 23 the generic version is called with an integer. The second call uses the version optimized for integers. In steps 25 an 26 the specialized version for floating point numbers is generated and run. The Julia team has put together a benchmark set of small examples written in several languages. We extracted the benchmarks for C, Python and Julia and ran them. The resulting execution time are shown below. As you can see, the Julia code generation is arguably as good as C (compiled with optimized gcc). Relative to Python it is as much as 50 times faster. Figure 2 makes this more explicit. The only benchmark where Python is within a factor of two is matrix multiply and that is because Python is using the optimized numpy.linalg libraries.  Benchmark python gcc – O julia recursion_fibonacci 2.977 0.000 0.037 parse_integers 2.006 0.117 0.197 userfunc_mandelbrot 6.294 0.068 0.065 recursion_quicksort 12.284 0.363 0.364 iteration_pi_sum 558.068 22.604 22.802 matrix_statistics 82.952 11.200 10.431 matrix_multiply 70.496 41.561 41.322 print_to_file 72.481 17.466 8.100 Figure 2. Speed-up of Julia and C relative to Python # 2. Julia Math and Libraries Doing basic linear algebra and matrix computation in Julia is trivial. The following operations each take less than one second in Jupyter using the package LinearAlgebra.jl. M = rand(1000, 1000); # generates a 1000 by 1000 matrix of random floats N = M^-1 # compute the inverse of M C = M*M’ # M’ is the transpose so the product is symmetric. Q = svd(C) # computes the singular value decomposition of C. Q.S # are the singular values. Sparse matrices are also supported along with a large number of other matrix creation and manipulation operations. ### Differential Equations The differential equation package built by a team led by Christopher Rackauckas is one of the most impressive. To illustrate this we consider an example from their excellent tutorial. The Lorenz attractor is a fascinating example of a chaotic solution to a system of differential equations that model convection. The system involves the evolution in 3D of a system that is characterized by three parameters. As shown below the equations are expressed exactly as you would describe them mathematically (dx is dx/dt) and the three parameters ar sigma, rho and beta. The initial point is (1,0,0) and the region is integrated over [0,100]. The package automatically picks an appropriate solver and the output is plotted as shown in Figure 3. Running this on a Mac mini took about 5 seconds. We used another package “Plots” to render the image. g = @ode_def LorenzExample begin dx = σ*(y-x) dy = x*(ρ-z) – y dz = x*y – β*z end σ ρ β u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) p = [10.0,28.0,8/3] prob = ODEProblem(g,u0,tspan,p) sol = solve(prob) plot(sol) Figure 3. Plot of the Lorenz attractor solution. (yes, Greek Unicode character names are valid.) Note: Thomas Breloff provides a more explicit integration with a Gif animation in this Plots tutorial. As with the above example, Breloff’s solution works well in Jupyter. # Julia Distributed Computing in the Cloud Julia has several packages that support parallel computing. These include a library for CUDA programming, OpenMP, Spark and MPI.jl for MPI programming. MPI is clearly the most reliably scalable parallel computing model for tasks involving 10000 or more cores. It is based on low-latency, two-sided communication in which coordinated, synchronized send-receive pairs and collective operations are executed in parallel across large clusters equipped with specialized networking. While Julia can be used with MPI, the natural model of communication in Julia is based on one-sided communication based on threads, tasks, futures and channels. We will use the package Distributed.jl to demonstrate this. The Julia distributed computing model is based on distributing tasks and arrays to a pool of workers. Worker are either Julia processes running on the current host or on other hosts in your cluster. (In the paragraphs that follow we show how to launch local workers, then workers on other VMs in the cloud and finally workers as docker containers running in a Kubernetes cluster. Let’s begin with a trivial example. Create a function which will flip a coin “n” times and count the number of heads. Here we generated random Boolean values and converted them to 0 or 1 and added them up. Now let’s add some workers. First we include the “Distributed” package and count the current number of workers. Next using the “addproc( )” function we added two new worker processes on the same server running this Jupyter notebook. Workers that our notebook process knows about are identified with an integer and we can list them. (The notebook process, called here the master, is not a worker) The worker processes are running Julia but they don’t know about the things we have defined in the master process. Julia has a very impressive macro facility that is used extensively in the distributed computing package to distributed code objects to the workers and to launch remote computations. When we create a function in the master that we want executed in the workers we must make sure it also gets defined in that worker. We use the “@everywhere” macro to make sure things we define locally are also defined in each worker. We even must tell the workers we are using the Distributed package. In the code below we created a new version of our count_heads function and distributed it. Julia uses the concept of futures to launch computation on the workers. The “@spawnat” macro takes two parameters: the ID of a worker and a computation to be launched. What is returned is a future: a placeholder for the final result. By “fetch( a)” we can grab the result of the future computation and return it to our calling environment. (Some library functions like “printf()” when executed on the workers are automatically mapped back to the master.) In the following example we create two local workers and define a function for them to execute. Workers each have a unique integer ID and we print them. Then we use @spawnat to launch the function on each worker in turn. We can easily measure the cost of this remote function evaluation with the @time macro as follows. We enclose the call and fetch in a block and time the block. (Notice we have increased the number of coins to 109 from 104.) If we want both workers working together, we can compose a list of spawned invocations with a construct of the form [ @spawn at i expr(i) for i in range]  This returns a list (technically, in Julia it is an array) of futures. We can then grab the result of each future in turn. The result is shown below. This is run on a dual core desktop machine, so parallelism is limited but, in this case, the parallel version is 1.85 times faster than the individual call. The reason that it is not exactly two time faster is partly due to the overhead in sequentially launching and resolving the futures. But it is also due to communication delays and other OS scheduling delays. A more compact and elegant way to write this parallel program is to use the Julia distributed parallel map function pmap(f, args). pmap takes a function and applies it to each element of a set of arguments and uses the available workers to do the work. The results are returned in an array. In this case count_headse did not need an argument so we constructed an anonymous function with one parameter to provide to the pmap function. In this execution we were only concerned with dividing the work into two parts and then letting the system schedule and execute them with the available worker resources. We chose 2 parts because we know there is two workers. However, we could have divided into 10 parts and applied 10 argument values and the task would have been accomplished using the available workers. ## Julia distributed across multiple host machines. To create a worker instance on another host Julia uses secure shell (ssh) tunnels to talk to it. Hence you need five things: the IP address of the host, the port that secure shell uses, the identity of the “user” on that host and the private ssh key for that user. The ssh key pair must be password-less. The location of the Julia command on that host is also needed. In this and the next example we simplify the task of deploying Julia on the remote host by deploying our Julia package as a docker container launched on that host. To make the ssh connection work we have mapped the ssh port 22 on the docker container to port 3456 on the host. (We describe the container construction and how it is launched in the next section.) In the previous section we provided “addprocs()” with a single integer representing the number of worker we wanted launched locally. As shown below, the remote version requires a bit more. We supply an array of tuples to addprocs() where each tuple provides the contact point and the number of workers we want there. In this example we spawn 2 workers on one remote node and one worker on the other. We also provide the local location of the private key (here called pubkey) in the sshflags argument. We also want each worker to have the Distributed.jl package and another package called “PyCall” which enables calling python library functions. We demonstrate the python call with a call to socket.hostname() in each worker. Notice that the remote function invocation returns the “print” output to the master process. The strange hostnames that are printed are the synthetic host names from the docker containers. This example did not address performance. We treat that subject briefly at the end of the next section. ### Channels In addition to spawning remote tasks Julia supports a remote channel mechanism. This allows you to declare a channel that can carry messages of a given type and hold them for remote listeners to pickup. In the example below we declare a remote channel that carries string messages and a function defined for the listeners to use. The worker can use the “take!()” function to remove an item from the channel and the master uses “put!()” to fill the channel. The message “-1” tells the listener to stop listening. Using the channel mechanism one can use Julia to have a group of workers respond to a queue of incoming messages. In the Github site for this post we have put an example where the workers take the channel messages and put the results into an Azure table. ## Julia Distributed Computing on Kubernetes Depending upon your favorite flavor of cloud there are usually three or four ways to spin up a cluster of nodes to run a distributed computation. The most basic way to do this is to launch a group of virtual machines where each has the needed resources for the problem you want to solve. For Julia, the best thing to do is launch instances of a ‘data science VM” that is available on AWS or Azure. There are two things your VM needs: an installation of Julia version 1.0.0 or later and the ability to ssh to it without using a password. Here is how to do this on Azure. First, run “ssh-keygen” and it will step you through the process of generating a key-pair and give you the option of having no password. Then from the Azure portal select “create a resource” and search for Linux data science VM. As you go through the installation process when it asks for a key, paste in the public key you just generated. When the VM is up you can ssh to it to verify that it has the needed version of Julia installed by typing “Julia” at the command line. If it can’t find Julia you will need to download and install it. While you are logged and running Julia, you should also install some of the libraries your distributed program will need. For example, Distributed.jl and PyCall.jl or any package specific to your application. If you have a large cluster of VMs, this is obviously time consuming. A better solution is to package your worker as a Docker container and then use Kubernetes to manage the scale-out step. We have set up a docker container dbgannon/juliacloud2 in the docker hub that was be used for the following experiments. This container is based on Jupyter/datascience-notebook so it has a version of Jupyter and Julia 1.0.0 already installed. However to make it work has a Julia distributed worker it must be running the OpenSsh daemon sshd. A passwordless key pair has been preloaded into the appropriate ssh directory. We have also installed the Jullia libraries Distributed.jl, PyCall.jl and IJulia.jl. PyCall is needed because we need to call some python libraries and Ijulia.jl is critical for running Julia in Jupyter. We have included all the files needed to build and test this container in Github. ## Launching the container directly on your local network The docker command to launch the container on your local network is docker run -it -d -p 3456:22 -p 8888:8888 dbgannon/juliacloud2  This exposes the ssh daemon on port 3456 and, if you run the jupyter notebook that is on port 8888. To connect to the server you will need the ssh key which is found on the Github site. (If you wish to use your own passwordless keypair you will need to rebuild the docker container using the files in Github. Just replace pubkey and pubkey.pub. and rebuild the container.) To connect to the container and start Jupyter use the key pubkey. ssh -i pubkey jovyan@localhost -p 3456 … jovyan@bdfb4a7068e2:$ jupyter notebook


Notice that when you did this you were prompted to agree to add the ECDSA key fingerprint to your known hosts file.  Jupyter will come up on your local machine at http://localhost:8888 and the password is “julia”.  If you are running this remotely replace localhost with the appropriate IP and make sure port 8888 is open.    To launch worker instances run docker as above (but you don’t need “-p 8888:8888”. )  When they are up you will need to ssh to each from your instance running jupyter.   Doing this step is necessary to put the ECDSA key into the known hosts of your master jupyter instance.

## Launching the containers from Kubernetes

Now that Kubernetes has become a cloud standard it is relatively easy to create a cluster from the web portal.   The experiments here were completed on a small 5 node cluster of dual core servers.  Once the cluster was up it was easy to launch the individual components from the Kubernetes command line.  Creating the Jupyter controller required two commands: the first to create the deployment and the second to expose it through a load balancer as a service.

kubectl run jupyter --image=dbgannon/jupyter --port=8888
kubectl expose deployment jupyter --type=LoadBalancer


Creating the worker deployment required one line.

kubectl run julcloud --image=dbgannon/juliacloud


(Note: this wad done with earlier versions of the Docker containers.  Juliacloud2 described above combines the capabilities of both in one container.)

One the two deployments were running, we used the Kubernetes web portal to scale the julcloud deployment to 5 different pods as shown in Figure 4 below.

Figure 4.  Kubernetes deployment of 5 worker pods and the Jupyter pod.

Unfortunately, we still needed to find the Ip address of each pod (which can be found on the Kubernetes portal) and ssh to each from the Python controller to add them to the known hosts file there.  (It should be possible to automate this step.) of Using this configuration we ran two experiments.   In the first we created a simple standard Monte Carlo simulation to compute Pi and ran it by allocating 2 workers per Kubernetes worker pod.  The code is shown below.

We scaled the number of pods to 10 and put two workers per pod and computed the speed up for N = 8*109 and N = 8*1010.  The highly unscientific results are shown in the chart below.  Notice that 10 pods and 20 workers is two workers per core, so we cannot realistically achieve speeds up much beyond  10, so a maximum speedup of 13.9 with 20 workers is good.  ( In fact, it reflects likely inefficiency seen when run with only one worker.)

Figure 5.  Speedup relative to one worker when using up to 20 workers and 10 pods on five servers.

The second experiment was to use 10 workers to pull data from a distributed channel and have them push it to the Azure table service.   The source code for this experiment is in the Github  repository.

# Conclusion

There is a great deal that has not been covered here.  One very important item missing from the above discussion is the Julia distributed array library.  This is a very important topic, but judging from the documentation it may still be a bit of a work-in-progress.   However I look forward to experimenting with it.

One of the applications of Julia that inspired me to take a closer look at Julia is the Celeste astronomy project that used 650,000 cores to reach petaflop performance at NERSC.  Julia computing is now a company devoted to supporting Julia.  Their website has many other great case studies.

There are many interesting Julia resources.  Juliacon is an annual conference that brings together the user community.   A brief look at the contents of the meeting videos (100 of them!) shows the diversity of Julia applications and technical directions.

# Science Applications of Generative Neural Networks

Machine learning is a common tool used in all areas of science. Applications range from simple regression models used to explain the behavior of experimental data to novel applications of deep learning. One area that has emerged in the last few years is the use of generative neural networks to produce synthetic samples of data that fit the statistical profile of real data collections. Generative models are among the most interesting deep neural networks and they abound with applications in science. The important property of all generative networks is that if you train them with a sufficiently, large and coherent collection of data samples, the network can be used to generate similar samples. But when one looks at the AI literature on generative models, one can come away with the impression that they are, at best, amazing mimics that can conjure up pictures that look like the real world, but are, in fact, pure fantasy. So why do we think that they can be of value in science? There are a several reasons one would want to use them. One reason is that the alternative method to understand nature may be based on a simulation that is extremely expensive to run. Simulations are based on the mathematical expression of a theory about the world. And theories are often loaded with parameters, some of which may have known values and others we can only guess at. Given these guesses, the simulation is the experiment: does the result look like our real-world observations? On the other hand, generative models have no explicit knowledge of the theory, but they do an excellent job of capturing the statistical distribution of the observed data. Mustafa Mustafa from LBNL states,

“We think that when it comes to practical applications of generative models, such as in the case of emulating scientific data, the criterion to evaluate generative models is to study their ability to reproduce the characteristic statistics which we can measure from the original dataset.” (from Mustafa, et. al arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018)

Generated models can be used to create “candidates” that we can use to test and fine-tune instruments designed to capture rare events. As we shall see, they have also been used to create ‘feasible’ structures that can inform us about possibilities that were not predicted by simulations. Generative models can also be trained to generate data associated with a class label and they can be effective in eliminating noise. As we shall see this can be a powerful tool in predicting outcomes when the input data is somewhat sparse such as when medical records have missing values.

Flavors of Generative Models

There are two main types of GMs and, within each type, there are dozens of interesting variations. Generalized Adversarial Networks (GANs) consist of two networks, a discriminator and a generator (the bottom part of Figure 1 below). Given a training set of data the discriminator is trained to distinguish between the training set data and fake data produced by the generator. The generator is trained to fool the discriminator. This eventually winds up in a generator which can create data that perfectly matches the data distribution of the samples. The second family are autoencoders. Again, this involved two networks (top in figure below). One is designed to encode the sample data into a low dimensional space. The other is a decoder that takes the encoded representation and attempts to recreate it. A variational autoencoder (VAEs) is one that forces the encoded representations to fit into a distribution that looks like the unit Gaussian. In this way, samples from this compact distribution can be fed to the decoder to generate new samples.

Figure 1.

Most examples of generative networks that are commonly cited involve the analysis of 2-D images based on the two opposing convolutional or similar networks.  But this need to be the case. (see “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space” by Anh Nguyen, et. al. arXiv:1612.00005v2  [cs.CV]  12 Apr 2017).

One fascinating science example we will discuss in greater detail later is by Shahar Harel and Kira Radinsky.  Shown below (Figure 2), it is a hybrid of a variational autoencoder with a convolutional encoder and recurrent neural network decoder for generating candidate chemical compounds.

Figure 2.  From Shahar Harel and Kira Radinsky have a different approach in “Prototype-Based Compound Discovery using Deep Generative Models” (http://kiraradinsky.com/files/acs-accelerating-prototype.pdf ).

## Physics and Astronomy

Let’s start with some examples from physics and astronomy.

In statistical mechanics, Ising models provide a theoretical tool to study phase transitions in materials. The usual approach to study the behavior of this model at various temperatures is via Monte Carlo simulation. Zhaocheng Liu, Sean P. Rodrigues and Wenshan Cai from Georgia Tech in their paper “Simulating the Ising Model with a Deep Convolutional Generative Adversarial Network” (arXiv: 1710.04987v1 [cond-mat.dis-nn] 13 Oct 2017). The Ising states they generate from their network faithfully replicate the statistical properties of those generated by simulation but are also entirely new configurations not derived from previous data.

Astronomy is a topic that lends itself well to applications of generative models. Jeffrey Regier et. al. in “Celeste: Variational inference for a generative model of astronomical images” describe a detailed multi-level probabilistic model that considers both the different properties of stars and galaxies at the level of photons recorded at each pixel of the image. The purpose of the model is to infer the properties of the imaged celestial bodies. The approach is based on a variational computation similar to the VAEs described below, but far more complex in terms of the number of different modeled processes. In “Approximate Inference for Constructing Astronomical Catalogs from Images, arXiv:1803.00113v1 [stat.AP] 28 Feb 2018”, Regier and collaborators take on the problem of building catalogs of objects in thousands of images. For each imaged object there are 9 different classes of random variables that must be inferred. The goal is to compute the posterior distribution of these unobserved random variables conditional on a collection of astronomical images. They formulated a variational inference (VI) model and compared that to a Markov chain monte carlo (MCMC) method. MCMC proved to be slightly more accurate in several metrics but VI was very close. On the other hand, the variational method was 1000 times faster. It is also interesting to note that the computations were done on a Cori, the DOE supercomputer and the code was written in Julia.

Cosmological simulation is used to test our models of the universe. In “Creating Virtual Universes Using Generative Adversarial Networks” (arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018) Mustafa Mustafa, et. al. demonstrate how a slightly-modified standard GAN can be used generate synthetic images of weak lensing convergence maps derived from N-body cosmological simulations. The results, shown in Figure 3 below, illustrate how the generated images match the validation tests. But, what is more important, the resulting images also pass a variety of statistical tests ranging from tests of the distribution of intensities to power spectrum analysis. They have made the code and data available at http://github.com/MustafaMustafa/cosmoGAN . The discussion section at the end of the paper speculates about the possibility of producing generative models that also incorporate choices for the cosmological variable that are used in the simulations.

Figure 3.  From  Mustafa Mustafa, et. al. “Creating Virtual Universes Using Generative Adversarial Networks” (arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018

## Health Care

Medicine and health care are being transformed by the digital technology. Imaging is the most obvious place where we see advanced technology.  Our understanding of the function of proteins and RNA has exploded with high-throughput sequence analysis. Generative methods are being used here as well. Reisselman, Ingraham and Marks in “Deep generative models of genetic variation capture mutation effects” consider the problem of how mutations to a protein disrupt it function. They developed a version of a variational autoencoder they call DeepSequence that is capable if predicting the likely effect of mutations as they evolve.

Another area of health care that is undergoing rapid change is health records. While clearly less glamourous than RNA and protein analysis, it is a part of medicine that has an impact on every patient. Our medical records are being digitized at a rapid rate and once in digital form, they can be analyzed by many machine learning tools. Hwang, Choi and Yoon in “Adversarial Training for Disease Prediction from Electronic Health Records with Missing Data” (arXiv:1711.04126v4 [cs.LG] 22 May 2018) address two important problems. First, medical records are often incomplete. They have missing value because certain test results were not correctly recorded. The process of translating old paper forms to digital artifacts can introduce additional errors. Traditional methods of dealing with this are to introduce “zero” values or “averages” to fill the gaps prior to analysis, but this is not satisfactory. Autoencoders have been shown to be very good at removing noise from data (see https://towardsdatascience.com/how-to-reduce-image-noises-by-autoencoder-65d5e6de543). Hwang and his colleagues applied this to medical records. The second thing they have done is to use a GAN to predict the disease from the “corrected” record. The type of GAN they use is an “AC-GAN” (see https://arxiv.org/pdf/1610.09585.pdf) which incorporates a class label with each training item. This allows a class label along with the random latent variable as input to force the generator to create an output similar to training elements of that class. A byproduct is a discriminator that can tell if an input has the correct class label. In their case the they are interested in if a given medical record may predict the occurrence of a tumor or not. Of course, this is far from usable as a sole diagnostic in a clinical setting, but it is a very interesting technology.

## Drug Design

One exciting application of these techniques is in the design of drugs. The traditional approach is high throughput screening in which large collections of chemicals are tested against potential targets to see if any have potential therapeutic effects. Machine learning techniques have been applied to the problem for many years, but recently various deep learning method have shown surprisingly promising results. One of the inspirations for the recent work has been the recognition that molecular structures have properties similar to natural language (see Cadeddu, A, et. al.. Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angewandte Chemie 2014, 126.) More specifically, there are phrases and grammar rules in chemical compounds that have statistical properties not unlike natural language. There is a standard string representation called SMILES that an be used to illustrate these properties. SMILES representations describe atoms and their bonds and valences based on a depth-first tree traversal of a chemical graph. In modern machine learning, language structure and language tasks such as machine natural language translation are aided using recurrent neural networks. As we illustrated in our book, an RNN trained with lots of business news text is capable of generating realistic sounding business news headlines from a single starting word. However close inspection reveals that the content is nonsense. However, there is no reason we cannot apply RNNs to SMILES string to see if they can generate new molecules. Fortunately, there are sanity tests that can be applied to generated SMILES string to filter out the meaningless and incorrectly structured compounds. This was done by a team at Novartis (Ertl et al. Generation of novel chemical matter using the LSTM neural network, arXiv:1712.07449) who demonstrated that these techniques could generate billions of new drug-like molecules. Anvita Gupta, Alex T. Muller, Berend J. H. Huisman, Jens A. Fuchs, Petra Schneid and Gisbert Schneider applied very similar ideas to “Generative Recurrent Networks for De Novo Drug Design”. They demonstrated that if they started with fragments of a drug of interest they could use the RNN and transfer learning to generate new variations that can may be very important. Another similar result is from Artur Kadurin, et. al. in “druGAN: An Advanced Generative Adversarial Autoencoder Model for de Novo Generation of New Molecules with Desired Molecular Properties in Silico.

Shahar Harel and Kira Radinsky have a different approach in “Prototype-Based Compound Discovery using Deep Generative Models”. There model is motivated by a standard drug discovery process which involves start with a molecule, called a prototype, with certain known useful properties and making modifications to it based on scientific experience and intuition. Harel and Radinsky designed a very interesting Variational Autoencoder shown in figure 2 above. As with several others the start with a SMILES representation of the prototype. The first step is an embedding space is generated for SMILES “language”. The characters in the prototype sequence are imbedded and fed to a layer of convolutions that allow local structures to emerge as shorter vectors that are concatenated, and a final all-to-all layer is used to generate sequence of mean and variance vectors for the prototype. This is fed to a “diversity layer” which add randomness.

The decoder is an LSTM-based recurrent network which generates the new molecule. The results they report are impressive. In a one series of experiments they took as prototypes compounds from drugs that were discovered years ago, and they were able to generate more modern variations that are known to be more powerful and effective. No known drugs were used in the training.

## Summary

These are only a small sample of the research on the uses of Generative Neural networks in science.   We must now return to the question posed in the introduction:  When are these applications of neural networks advancing science?    We should first ask the question what is the role of ‘computational science’?  It was argued in the 1990s that computing and massive computational simulation had become the third paradigm of science because it was the only way to test theories for which it was impossible to design physical experiments.   Simulations of the evolution of the universe is a great example.    These simulations allowed us to test theories because they were based on theoretical models.  If the simulated universe did not look much like our own, perhaps the theory is wrong.   By 2007 Data Science was promoted as the fourth paradigm.   By mining the vast amounts of the data we generate and collect, we can certainly validating or disproving scientific claims.    But when can a network generating synthetic images qualify as science?  It is not driven by theoretical models.   Generative models can create statistical simulations that are remarkable duplicates of the statistical properties of natural systems.   In doing so they provide a space to explore that can stimulate discovery.   There are three classes of why this can be important.

• The value of ‘life-like’ samples. In “3D convolutional GAN for fast Simulation” F. Carminati, G.  Khattak, S.  Vallecorsa make the argument that designing and testing the next generation of sensors requires test data that is too expensive to compute with simulation.  But a well-tuned GAN is able to generate the test cases that fit the right statistical model at the rate needed for deployment.
• Medical records-based diagnosis. The work on medical records described above by Hwang shows that using a VAE to “remove noise” is statistically superior to leaving them blank or filling in averages.   Furthermore their ability to predict disease is extremely promising as science.
• Inspiring drug discovery. The work of Harel and Radinsky show us that a VAE can expand the scope of potential drug for further study.   This is an advance in engineering if not science.

Can it replace simulation for validating models derived from theory?  Generative neural networks are not yet able to replace simulation.   But perhaps theory can evolve so that it can be tested in new ways.

# Part 2. Generative Models Tutorial

Generative Models are among the most interesting deep neural networks and they abound with applications in science. There are two main types of GMs and, within each type, several interesting variations. The important property of all generative networks is that if you train them with a sufficiently, large and coherent collection of data samples, the network can be used to generate similar samples. The key here is the definition of ‘coherent’. One can say the collection is coherent if when you are presented with a new example, it should be a simple task to decide if it belongs to the collection or not. For example, if the data collection consists entirely of pictures of cats, then a picture of a dog should be, with reasonably high probability, easily recognized as an outlier and not a cat. Of course, there are always rather extreme cats that would fool most casual observers which is why we must describe our collect of objects in term of probability distributions. Let us assume our collection c is naturally represented embedded in $R^m$ for some m. For example, images with m pixels or other high dimensional instrument data. A simple way to think about a generative model is a mathematical device that transforms samples from a multivariant normal distribution $\mathcal{N}^k (0,1)$  into so that they look like they come from the distribution $p_c(X)$ for our collection c. Think of it as a function

$Gen: v \sim \mathcal{N}^k (0,1) \to R^m$

Another useful way to say this is to build another machine we can call a discriminator

$Disc: R^m \to [0,1]$

such that for $X \in R^m, Disc(X)$   is probability that X is in the collection c. To make this more “discriminating” let us also insist that $Disc(Gen(v)=0$.  In other word, the discriminator is designed to discriminate between the real c objects and the generated ones. Of course, if the Generator is really doing a good job of imitating  then the discriminator with this condition would be very hard to build.  In this case we would expect $Disc(Gen(v)=1$.

## Generative Adversarial networks

were introduced by Goodfellow et, al (arXiv:1406.2661) as a way to build neural networks that can generate very good examples that match the properties of a collection of objects.  It works by designed two networks:  one for the generator and one for the discriminator. Define  to be the distribution of latent variables that the generator will map to the collection space. The idea behind the paper is to simultaneously design the discriminator and the generator as a two-player min-max game.

The discriminator is being trained to recognize object from c (thereby reducing   for  ) and pushing  to zero for .   The resulting function

Represents the min-max objective for the Discriminator.

On the other hand, the generator wants to push  to 1 thereby maximizing
.   To do that we minimize

.

There are literally dozens of implementations of GANs in Tensorflow or Karas on-line.   Below is an example from one that works with 40×40 color images.   This fragment shows the step of setting up the training optimization.

#These two placeholders are used for input into the generator and discriminator, respectively.
z_in = tf.placeholder(shape=[None,128],dtype=tf.float32) #Random vector
real_in = tf.placeholder(shape=[None,40,40,3],dtype=tf.float32) #Real images
Gz = generator(z_in) #Generates images from random z vectors
Dx = discriminator(real_in) #Produces probabilities for real images
Dg = discriminator(Gz,reuse=True) #Produces probabilities for generator images
#These functions together define the optimization objective of the GAN.
d_loss = -tf.reduce_mean(tf.log(Dx) + tf.log(1.-Dg)) #This optimizes the discriminator.
g_loss = -tf.reduce_mean(tf.log(Dg)) #This optimizes the generator.
tvars = tf.trainable_variables()
#The below code is responsible for applying gradient descent to update the GAN.
trainerD = tf.train.AdamOptimizer(learning_rate=0.0002,beta1=0.5)
trainerG = tf.train.AdamOptimizer(learning_rate=0.0002,beta1=0.5)

#Only update the weights for the discriminator network.
d_grads = trainerD.compute_gradients(d_loss,tvars[9:])
#Only update the weights for the generator network.
g_grads = trainerG.compute_gradients(g_loss,tvars[0:9])
update_D = trainerD.apply_gradients(d_grads)
update_G = trainerG.apply_gradients(g_grads)



We tested this with a very small collection of images of galaxies found on the web.  There are three types: elliptical, spiral and barred spiral.  Figure 4 below shows some high-resolution samples from the collection.

(Note:  the examples in this section use pictures of galaxies, but , in terms of the discussion in the previous part of this article, these are illustrations only.  There are no scientific results; just algorithm demonstrations. )

Figure 4.  Sample high-resolution galaxy images

We reduced the images to 40 by 40 and trained the GAN on this very small collection.  Drawing samples at random from the latent z-space we can now generate synthetic images.  The images we used here are only 40 by 40 pixels, so the results are not very spectacular.  As shown below, the generator is clearly able to generate elliptical and spiral forms.  In the next section we work with images that are 1024 by 1024 and get much more impressive results.

Figure 5.   Synthetic Galaxies produced by the GAN from 40×40 images.

## Variational Autoencoders

The second general category generative models are based on variational autoencoders. An autoencoder transforms our collection of object representations into a space of much smaller dimension in such a way so that that representation can be used to recreate the original object with reasonably high fidelity. The system has an encoder network that creates the embedding in the smaller space and a decoder which uses that representation to regenerate an image as shown below in Figure 6.

Figure 6. Generic Autoencoder

In other words, we want  to approximate  for each i in an enumeration of our collection of objects.  To train our networks we simply want to minimize the distance between   and  for each i.   If we further set up the network inputs and outputs so that they are in the range [0, 1] we can model this as a Bernouli distribution so cross entropy is a better function to minimize.  In this case the cross entropy can be calculated as

(see http://www.godeep.ml/cross-entropy-likelihood-relation/  for a derivation)

A variational autoencoder differs from a general one in that we want the generator to create an embedding that is very close to a normal distribution in the embedding space.  The way we do this is to make the encoder force the encoding into a representation consisting of a mean and standard deviation.  To force it into a reasonably compact space we will force our encoder to be as close to   as possible. To do that we need a way to measuree how far a distribution p is from a Gaussian q. That is given by the Kullback-Leibler divergence which measures now many extra bits (or ‘nats’) are needed to convert an optimal code for distribution q into an optimal code for distribution p.

If both p and q are gaussian this is easy to calculate (thought not as easy to derive).

In terms of probability distributions we can think of our encoder as  where x is a training image. We are going to assume   is normally distributed and let  be  parameterized by     .  Computing   is now easy. We call this the Latent Loss and it is

We now construct our encoder to produce  and  .  To sample from this latent space, we simply draw from and transform it into the right space.   Our encoder and decoder networks can now be linked as follows.

the loss function is now the sum of two terms:

Note: there is a Baysian approach to deriving this.  see https://jaan.io/what-is-variational-autoencoder-vae-tutorial   for an excellent discussion.

One of the interesting properties of VAEs is that they do not require massive data sets to converge.   Using our simple galaxy photo collection we trained a simple VAE.  The results showing the test inputs and the reconstructed images are illustrated below.

Figure 7.   test input and reconstruction from the galaxy image collection.   These images are 1024×1024.

Using encodings of five of the images we created a path through the latent space to make the gif movie that is shown below.  While not every intermediate “galaxy” looks as good as some of the originals, it does present many reasonable “synthetic” galaxies that are on the path between two real ones.

Figure 8.  image from the “movie”

The notebook for this autoencoder is available as html (see https://s3.us-east-2.amazonaws.com/a-book/autoencode-galaxy.html) and as a jupyter notebook (see https://s3.us-east-2.amazonaws.com/a-book/autoencode-galaxy.ipynb )  The compressed tarball of the galaxy images is here: https://s3.us-east-2.amazonaws.com/a-book/galaxies.tar.gz.

## acGANs

The generative networks described above are just the basic variety.    One very useful addition is the Auxiliary Classifier GAN.    An acGAN allows you to incorporate knowledge about the class of the objects in your collection into the process.   For example, suppose you have labeled images such as all pictures of dogs are labeled “dog” and all pictures of cats have the label “cat”.    The original paper on this subject “Conditional Image Synthesis with Auxiliary Classiﬁer GANs” by Oden, Olah and Shlens  shows how a GAN can be modified so that the generator can be modified so that it takes a class label in addition to the random latent variable so that it generates a new element similar to the training examples of that class. The training is augmented with an additional loss term that models the class of the training examples.

There are many more fascinating examples.   We will describe them in more detail in a later post.

# Building a “ChatBot” for Scientific Research

We can use Alexa, Cortana and Siri to check the weather, stream our favorite music and lookup facts like “who directed Casablanca?” But if I want to find all the research on quantum entanglement in the design of topological quantum computers, these services will fall short.   If, in addition, I want these articles cataloged in my personal archive and the citations automatically pulled I need a much more robust digital assistant.  Raj Reddy talks about a broader and more pervasive future for digital assistant he calls Cognition Amplifiers and Guardian Angels.  In this post we look at chatbots and their limitations and show how to build a simple, voice-driven scientific metasearch tool we call the research assistant.  Finally, we discuss the next phase of research assistant.

# Smart Speakers and Chatbots.

The revolution in “smart speaker” digital assistants like Siri, Amazon Echo, Google Home is showing us the power of voice to provide input to smart cloud services.   These assistants can take notes, tell us the weather conditions, place on-line orders for us and much more.   I even allow Microsoft’s Cortana to read my email.  If I send the message “I’ll get back to you tomorrow” to a friend, Cortana will remind me the next day that a response is needed.  Amazon allows people to add “skill” (additional capabilities) to there Alexa system.  These smart speakers are designed around open-ended question-response scenario.  These assistants leverage very powerful speech-to-text technology and semantic analysis systems to extract a query or command.  The query answers are derived from web-data analysis and the commands are to a limited number of native services (or external skills).

A chatbot is a system that engages the user in a dialog.   They go beyond the question answering smart speakers and are usually designed to help people interact with the services of a specific company or agency.   The Google, IBM, Amazon and Microsoft have all introduced cloud services to help anybody build a chatbot.   These services guide you through the process of building and training a chatbot.   A good example is Google’s Dialogflow.   Using this tool to create a bot, you specify three things:

• Intents – which are mapping between what the user says and how you wish the system to respond.
• Entities – that are the categories of subjects that your bot understands. For example, if you bot is a front end to your clothing store, one category of entity my be product type: shoes, dress, pants, hats,  and another entity might be size: large, xl, small, medium,  and another might be color.
• Context – This is knowledge obtained by the bot during the course of the conversation. For example, the name of the user, or the users favorite shoe color.

The goal of the bot is to extract enough information from the user to take a meaning action such as fulfilling an order.   The hard part is designing the intents so that a dialog can lead to the desired conclusion.  The resulting sysgtem is called an agent and the flow of action is illustrated in Figure 1.  It might start with ‘Good morning, how can I help you?’   and end with a summary of the discussion.  As the programmer you need to supply as many possible variations on possible user questions and responses as possible.  And you must annotate these with makers where your entities can be extracted.    Your examples are used in a training phase that maps your intents and entities together and builds a model that will learn variations on the input and not repeat the same sequence of responses each time, so it seems less robotic to the user.

Amazon’s Lex provides a similar service that also integrates with their lambda services.  Microsoft has the Azure Bot Service and IBM has Watson assistant chatbot tools.

Figure 1. Google Dialogflow Agent architecture.

These tools are all designed to help you build a very focused conversation with the user about a very narrow universe such as the on-line part of a business.   But this raises the question, can one build a chat bot that can carry out an open-ended conversation.  Perhaps one that could pass the Turing test?   The research literature on the subject is growing and deep learning tools like recurrent and convolutional neural networks have been applied to the problem (see https://arxiv.org/pdf/1606.07056.pdf , https://arxiv.org/pdf/1605.05110.pdf and more).   Unfortunately chatbots designed to engage in open-ended conversation have only had a limited success.   Xiaoice is one that interacts with users on the Chinese micro blogging service Weibo.  The problem is that while it sounds like conversation, it is mindless.  Microsoft’s Tay was an English language version that operated on Twitter until it was taken down after only 16 hours because of the unfortunate language it had learned.  A successor Zo seems to be working better, but it does not produce conversations with intellectual content.

There is an excellent pair of articles by Denny Britz about the role of deep learning for conversational assistants.   He make the point that for open-ended conversations (he calls them open-domain) the challenges are large compared to the fixed domain chatbots because so much more world knowledge is required.

# Cognition Amplifiers and the Research Assistant.

In the spring of 2018 Raj Reddy gave the keynote presentation at the IEEE services congress.  His topic was one he has addressed before and it clearly resonated with the audience.   He described Cognition Amplifiers and Guardian Angels.  He defined a Cognition Amplifier (COG) as a Personal Enduring Autonomic Intelligent Agent that anticipates what you want to do and help you to do it with less effort. A Guardian Angel (GAT) is a Personal Enduring Autonomic Intelligent Agent that discovers and warns you about unanticipated events that could impact your safety, security and happiness.

Consider now the application of the Cognition Amplifier to scientific research.   If you are working on writing a research paper, you may wish your autonomic research assistant to provide a fast and accurate search of the scientific literature for a specific list of scientific concepts. In fact, as you write the paper, the assistant should be able to pick up the key concepts or issues and provide a real-time bibliography of related papers and these should be stored  and indexed in a private space on the cloud.  Extracting key phrases from technical documents is already a heavily research field so applying this technology to this problem is not a great leap.   However, key phrase extraction is not the whole challenge.   Take sentence “It seems that these days investors put a higher value on growth than they do on profitability”.  The categorical topic is value of growth vs profitability – investor opinions which is not simply a key phrase, but a concept and we need the research assistant to look for concepts.    Your research assistant should always understand and track the context of your projects.

Finally, a good research assistant for science should be able to help with the analytical part of science.  For example, it should help locate public data in the cloud related to experiments involving your topics of interest.   The assistant should be able to help formulate, launch and monitor data analysis workflows. Then coordinate and catalog the results.

And, of course, if your autonomous research assistant is also a Guardian Angel, it will also keep you informed of grant reporting deadlines and perhaps pull together a draft quarterly report for your funding agency.

I fully expect that it is possible to build such an agent in the years ahead.   However, the remainder of this article is a simple demo that is a far cry from the full research assistant agent described above.

# The Research Assistant Metasearch Tool.

In the following paragraphs we describe a very simple voice-driven agent that can be used to look for research articles about scientific topics.  We also show how such a system can be assembled from various simple devices and cloud services.   The system we describe is not very sophisticated.  In fact it is not much better than Cortana at finding things given English input sentence.  However we feel it does illustrate the architecture of a voice-driven agent that can be built by gluing together easy to use cloud services.

Our scenario is a researcher sitting near the device and asking about very specific research topics such as “physical models of immune response” or “programming a topological quantum computer”.    We assume the user wants a spoken response if that response is simple, but we also realize that this is impractical if the system is creating a list research journal papers.  To address this issue, the system also has a display in a web browser.  (We note that the Cortana and Google assistant do the same if the response is a list.)

Figure 2 illustrates the basic architecture of the system.

Figure 2.  The architecture of the research assistant.

The components of the system are:

1. The voice-to-text translator. Here we use a simple voice kit from google.   This consists of some special hardware and a raspberry pi 2 computer all packaged in an elegant cardboard box.  You wake the system up by pressing the botton on top and speak.   The audio is captured and sent to the google voice service for transcription and it is returned as a text string.
2. The next step is to parse the text string into the components that will allow us to extract the topics of the query.  This is another cloud service call.  This time it is to Algorithmia.com and the service is called key phrases.  (we wrote about this in a previous article.)  The service takes English sentences and invoked Googles ParsyMcParseface (another Algorithmia.com AI service) and returns a list composed of three types of phrases: subject (s), actions (a) and objects (o).  It also flags prepositional phrases with a “/” character.   So for example, “I am interested in physical models of immune response” returns

[‘s: I ‘, ‘a: am ‘,  ‘o: interested /in physical models /of immune response.’]

1. The analysis engine is a 500-line Python Dash-based web server that extracts the topics and a few other items of interest and decides how to search and display the results on the browser. There are three web services used for the search: Wikipedia,  Bing and Cornell’s ArXiv service[1].  To see how this works, consider the example the sentence “research papers by Michael Eichmair about the gannon-lee singularity are of interest“. The analysis engine detects the topic as the gannon-lee singularity and Michael Eichmar as the author.  The fact that research papers are of interest indicates that the we should look in the Cornell ArXiv repository of papers.   (The results for this query are at the end of this section).    (Truth in advertising: our parser and analysis are far from perfect.   For example, “tell me about Deep Learning”   vs “tell me about deep learning” yield two different parses.  The first yields

[‘a: tell ‘, ‘o: me /about Deep Learning ‘]

which is fine. But the second gives us

[‘a: tell ‘, ‘o: me about deep ‘, ‘a: learning ‘]

which causes the analysis to fail. )

1. Finally, we use the Amazon Lex services to generate the audio reading of the Wikipedia results. If you have an aws account, the Python API is easy to use.

## Examples

Figure 3 illustrates the interface.  We have started with the statement “I am interested in physical models of immune response.”

Figure 3.   The interface provides a switch to allow the Wikipedia response to read aloud.   In this case we have typed the English statement of intent into the query box and hit the “Translate and Search” button.

We respond with the phrase “look for it in Wikipedia” and get the result in figure 4.    Following that response, we say “how about research papers” and we get the response in figure 5.

Figure 4.   The response to “look for it in wikipedia”.  A short summary from Wikipedia is extracted along with related images found on the subject.  The spoken response is controlled by the audio control  at the top of the response.

Figure5.   The mention of research papers suggest that we should consult the Cornell library arXiv.    Shown above is only the first result of 10 listed on the page.

Returning to the example mentioned above “research papers by Michael Eichmair about the gannon-lee singularity are of interest” we get the following results.   You will notice that the Wikipedia result is a default hit for “big bang singularity” and not directly related to the precise query.  The Bing results and the ArXiv hits are accurate.

Figure 6.  Results for the query “research papers by Michael Eichmair about the gannon-lee singularity are of interest”.  (This page was slightly edited to shorten the list of Bing results.)

The system has a limited capability to pull together concepts that are distributed over multiple sentences.  For example the input string “what are anyons?   How do they relate to topological quantum computation?” will build the topic “anyons topological quantum computation”.

If you are interested in trying to use the system point your browser here.  I will try keep it up and running for a few months. There is no voice input because that requires a dedicated Google voice kit on your desk.   You need to decide if you want to have a playback of the audio for Wikipedia summaries.   If you don’t want it, simply press the “Read Aloud” button.  Then enter a query and press the “Translate and Search” button.   Here are some samples to try:

1. what did they say about laughter in the 19th century?
2. are there research papers about laughter by Sucheta Ghosh?
3. what can you tell me about Quantum Entanglement research at Stanford? (this one fails!)
4. what can you tell me about research on Quantum Entanglement at Stanford?
5. what are anyons? How do they relate to topological quantum computation?
6. Who was Winslow Homer? (this one give lots of images)
7. I am interested in gravitational collapse. (respond with web, Wikipedia or arxiv)

As you experiment, you will find MANY errors.  This toy is easily confused.   Please email me examples that break it.  Of course, feedback and suggestions are always welcome.  I can make some of the source code available if there is interest. However, this is still too unreliable for public github.

[1] There are other arguably superior sources we would like to have used.  For example, Google Scholar would be perfect, but they have legal restrictions on invoking that service from an application like ours.  Dblp is also of interest but it is restricted to computer science.

# Introduction

This post is based on a talk I prepared for the Scientific Cloud Computing Workshop at HPDC 2018.

Two years ago, Ian Foster and I started writing  “Cloud Computing for Science and Engineering”.   That book covers fundamental cloud tools and computational models, but there are some topics we alluded to but did not explore fully because they were still on the horizon.  In other cases, we were simply clueless about changes that were coming. Data center design and cloud services have evolved in some amazing ways in just two years and many of these changes represent opportunities for using cloud technology in scientific research.

## Whose cloud?

Any  discussion of cloud computing in science leads to the question of definition.  What defines a cloud for science?   For example, the European Open Science Cloud (EOSC) is a European-wide virtual environment for data sharing and collaboration.  That project will involve multiple data archives, research labs and HPC centers, commercial service providers and EU agencies and funding.  It is truly an important project.  However, my focus here is on the fundamental technologies that are driving hardware and software innovation, and these tend to come from a combination of academic, open source and commercial providers.   The most ubiquitous commercial clouds are:

• Amazon Web Services (AWS) – 40% of all commercial cloud resources on the planet,
• Microsoft Azure – about 50% of AWS but growing,
• Google Cloud – a solid and growing third place,
• IBM Bluemix – growing very fast and in some measures bigger now that Google.

There are many more, smaller or more specialized providers: Salesforce, DigitalOcean, Rackspace, 1&1, UpCloud, CityCloud, CloudSigma, CloudWatt, Aruba, CloudFerro, Orange, OVH, T-Systems.

There are a number of smaller cloud systems that have been deployed for scientific research.  They  include Aristotle, Bionimbus, Chameleon, RedCloud, indigo-datacloud, EU-Brazil Cloud,  and the NSF JetStream.  The advantage of these research clouds is that they can be optimized for use by a specific user community in ways not possible in a commercial cloud.  For example, Chameleon is funded by the US NSF to support basic computer systems research at the foundational level which is not possible when the foundation is proprietary.

## Are these clouds of any value to Science?

When the first commercial clouds were introduced in 2008 the scientific community took interest and asked if there was value there.  In 2011 the official answer to this question seemed to be  “no”.  Two papers (see end node 1) described research experiments designed to address this question.   The conclusion of both papers was that these systems were no match for traditional supercomputers for running MPI-based simulation and modeling.   And, in 2010, they were correct.   Early cloud data centers were racks of off-the-shelf PCs and the networks had terrible bisection bandwidth and long latencies.   They were no match for a proper HPC cluster or supercomputer.

Over the last few years, others have recognized a different set of roles for the cloud in science that go beyond traditional supercomputer simulation.   The biology community was quick to adopt cloud computing especially when it is necessary to do large scale analysis on thousands of independent data samples.  These applications ranged from metagenomics to protein folding.   These computations could each fit on a single server, so network bandwidth is not an issue and, using the scale of the cloud, it is easy to launch thousands of these simultaneously.   Hosting and sharing large public scientific data collections is another important application.   Google, AWS, Microsoft and other have large collections and they also are also providing new ways to host services to explore this data.

However, there are at least three additional areas where the cloud is a great platform for science.

### Analysis of streaming data

Microsoft’s AI for earth project (Figure 1) looks at the application of streaming data from sources such as satellites to do land cover analysis,  sensors on and above farm land to improve agriculture outcomes and crowd sourced data to help understand biodiversity.

Figure 1.  Applications of streaming include land cover analysis, using sensor data for improving agriculture and biodiversity.   From  https://www.microsoft.com/en-us/aiforearth

The internet of things is exploding with sensor data as are on-line experiments of all types.  This data will be aggregated in edge computing networks that do initial analysis with results fed  to the cloud systems for further analysis.   Urban Informatics is a topic that has emerged as critical to the survival of our cities and the people who live in them.  Sensors of all types are being deployed in cities to help understand traffic flow, microclimates, local pollution and energy waste.  Taken together this sensor data can paint a portrait of the city that planners can use to guide its future.  Streaming data and edge computing is a topic that will involve the growing capabilities and architecture of the cloud.  We will return to this later in this document.

### Interactive big data exploration

Being able explore and interact with data is a critical component of science.   If it fits on our laptop we can use tools like Matlab, Excel or Mathematica to conduct small computational experiments and visualize the results.  If the data is too big it must be stored on something bigger than that laptop.   Traditional supercomputers are up to the task of doing the analysis, but because they are designed around batch computing there are not well suited to interactive use.   The cloud exists to host services that can be used by thousands of simultaneous users.   In addition, there is a new generation of interactive data analysis tools that are cloud-ready for use on very large data collections.  This collection of tools includes Spark and Python Dask.   In both cases these tools can be driven by the open-source Jupyter studio which provides a powerful, interactive compute and visualization tool.  The commercial providers have adapted Jupyter and its interactive computational model into their offerings.   Google has Cloud Datalab (Figure 2), Amazon uses Jupyter with its SageMaker Machine Learning platform and Microsoft provide a special data science virtual machine that runs Jupyter Hub so that teams of users can collaborate.

Figure 2.  Google’s Cloud Data lab integrates SQL-like queries to be combined with Python code and visualization to a Jupyter based web interface. https://cloud.google.com/datalab/

Being able to interact with data at scale is part of the power of the cloud.   As this capability is combined with advanced cloud hosted machine learning tools and other services, some very promising possibilities arise.

### The quest for AI and an intelligent assistant for research

The commercial clouds were originally built to  host web search engines.   Improving those search engines led to a greater engagement of the tech companies with machine learning.   That work led to deep learning which enabled machine language translation,  remarkably strong spoken language recognition and generation and image analysis with object recognition.  Many of these  capabilities rival humans in accuracy and speed.  AI is now the holy grail for the tech industry.

One outcome of this has been the proliferation of voice-driven digital assistants such as Amazon’s Echo, Microsoft’s Cortana, Apple’s Siri and Google Assistant.   When first introduce these were novelties, but as they have improved their ability to give us local information, do web searching, keep our calendars has improved considerably.   I believe there is an opportunity for science here.

Ask the question “what would it take to make Alexa or Cortana help with my research?”   The following use cases come to mind.

1. Provide a fast and accurate search of the scientific literature for a given specific scientific concept or topic and not just a keyword or specific author. Then ask who is working on or has worked on this topic?  Is there public data in the cloud related to experiments involving this topic?  Translate and transcribe related audio and video.
2. Understand and track the context of your projects.
3. Help formulate, launch and monitor data analysis workflows. Then coordinate and catalog results.  If state-space search is involved, automatically narrow the search based on promising findings.
4. Coordinate meetings and results from collaborators.

If I ask Alexa to do any of this now, she would politely say “Sorry.  I can’t help you with that.”  But with the current rate of change in cloud AI tools, ten years seems like a reasonable timeframe.

Figure 3.  Siri’s science geek grandson.

# Technical Revolutions in the Cloud

Two of the three scenarios above are here now or very close.   The third is some ways off.   There have been three major changes in cloud technology in the past five years and some aspects of these changes are true game-changers for the industry.    The first, and most obvious is the change in scale of cloud deployments. The two leaders, AWS and Azure are planetary in scale.  This is illustrated in Figure 4 below.

Figure 4.  A 2016 map of cloud coverage from Atomia.com.  https://www.atomia.com/2016/11/24/ comparing-the-geographical-coverage-of-aws-azure-and-google-cloud/ There is some inaccuracy here because AWS and Azure define regional data centers differently, so counting the dots is not a good comparison.   In addition, data centers are now under construction in South Africa and the Middle East.

This map does not include all the data centers run by the smaller cloud providers.

# Cloud Services

A major departure from the early days of the cloud, where scientists focused on storage and servers, has been an explosion in pay-by-the-hour cloud hosted services.  In addition to basic IaaS the types of services available now are:

• App services: Basic web hosting, mobile app backend
• Streaming data: IoT data streams, web log streams, instruments
• Security services: user authentication, delegation of authorization, privacy, etc.
• Analytics: database, BI, app optimization, stream analytics
• Integrative: networking, management services, automation

In addition, the hottest new services are AI machine learning services for mapping, image classification, voice-to-text and text-to-voice services and text semantic analysis.   Tools to build and train voice activated bots are also now widely available.   We will take a look at two examples.

### A planet scale database

The Azure Cosmos DB is a database platform that is globally distributed.   Of course, distributing data across international boundaries is a sensitive topic, so the Cosmos platform allows the database creator to pick the exact locations you want copies to reside.   When you create an instance of the database you use a map of Azure data centers and select the locations as shown in Figure 5.

Figure 5.  Cosmos DB.  A database created in central US and replicated in Europe, South India and Brazil.

The database can support 4 modes: Documents, key-value Graph and NoSQL.  In addition, there are five different consistency models the user can select: eventual, consistent prefix, session, bounded stateless and strong consistency all with 99.9% guarantee of less than 15ms latency.  My own experiments validated many these claims.

### Cloud AI Services

The commercial clouds are in a race to see who can provide the most interesting and useful AI services on their cloud platform.   This work began in the research laboratories in universities and companies over the past 25 years, but the big breakthroughs came when deep learning models trained on massive data collections began to reach levels of human accuracy.  For some time now, the public cloud companies have provided custom virtual machines that make it easy for technically sophisticated customers to use state of the art ML and neural network tools like TensorFlow, CNTK and others.  But the real competition is now to provide services for building smart applications that can be used by developers lacking advanced training in machine learning and AI. We now have speech recognition, language translation, image recognition capabilities that can be easily integrated into web and mobile applications.

We gave this a try with services that use a technique called Transfer Learning to make it possible to re-train a deep neural network to recognize objects from a narrow category using a very small training set.   We chose images of galaxies and used the services of IBM Watson, Azure and Amazon.   Figure 6 illustrates the results from IBM’s tool.  The results were surprisingly good.

Figure 6.  IBM’s Watson recognizing previously unseen images of three different galaxies.  The details of this study are here: https://esciencegroup.com/2018/02/16/cloud-services-for-transfer-learning-on-deep-neural-networks/

# The Revolution in Cloud Service Design

Making all of these services work, perform reliably and scale to thousands of concurrent users forced a revolution in cloud software design.    In order to support these applications, the tech companies needed a way to design them so that they could be scaled rapidly and updated easily.   They settled on a design pattern that based on the idea of breaking the applications into small stateless components with well defined interfaces.   Statelessness meant that a component could be replaced easily if it crashed or needed to be upgraded.   Of course, not everything can be stateless, so state is saved in cloud hosted databases.   Each component was a “microservice” and it could be built from containers or functions.  This design pattern is now referred to as “cloud native” design.   Applications built and managed this way include Netflix, Amazon, Facebook, Twitter, Google Docs, Azure CosmosDB, Azure Event hub, Cortana, Uber.

Figure 7.  Conceptual view of microservices as stateless services communicating with each other and saving needed state in distribute databases or tables.

To manage applications that required dozens to hundreds of concurrently running microservice you need a software foundation or container orchestration system to monitor the services and schedule them on available resources.  Several candidates emerged and are used. Siri, for example, is composed of thousands of microservices running on the Apache Mesos system.   Recently cloud providers have settled on a de-facto standard container orchestrator built by Google and released as open source called Kubernetes.   It is now extremely easy for any customer to use Kubernetes on many cloud deployments to launch and manage cloud native applications.

## Serverless Functions

The next step in the cloud software evolution was the introduction of “serverless functions”.   The original idea of cloud computing involved launching and managing a virtual machine.  However, suppose you want to have a cloud-based application whose sole job is to wait for an event to trigger some action.  For example, monitor a file directory and wait for a change such as the addition of a new file.  When that happens, you want to send email to a set of users alerting them of the change.   If this is a rare event, you don’t want to have to pay for an idle virtual machine that is polling some service looking for a change.  Amazon was the first to introduce the concept of a function-as-a-service.   With AWS Lambda, you only need to describe the function in terms of the trigger event and the output actions it takes when the even appears.  As illustrated in Figure 8, there are many possible triggers and corresponding output channels.

Figure 8.   From Amazon AWS.   The triggers and outputs of a lambda function.

In addition to AWS Lambda, Azure Functions, Google Functions, IBM OpenWhisk are similar systems.  OpenWhisk is now open source.  Another open source solution is Kubeless that allow you to deploy a lambda-like system on top of your Kubernetes cluster.   These serverless systems let you scale up and down extremely rapidly and automatically.   You can have hundreds of instances responding to events at once.  And the cost is based on charge-by-use models.  AWS has  introduced AWS Fargate which allows any containerized application to run in serverless mode.

## The Edge and the Fog

The frontier of cloud computing is now at the edge of the network.  This has long been the home of content distribution systems where content can be cached and access quickly, but now that the Internet-of-Things (IoT) is upon us, it is increasingly important to do some computing at the edge.  For example, if you have a thousand tiny sensors in a sensitive environment or farm and you need to control water from sprinklers or detect severe weather conditions, it is necessary to gather the data, do some analysis and signal an action.   If the sensors are all sending WIFI messages they may be routable to the cloud, but a more common solution is to provide local computing that can do some event preprocessing and response while forwarding summary data to the cloud.  That local computing is called the Edge, or if a distributed systems of edge servers, the Fog.

If serverless functions are designed to respond to signals, that suggests that it should be possible to extend them to run in the edge servers rather than the cloud.  AWS was the first to do this with a tool called GreenGrass that provides a special runtime system that allows us to push/migrate lambda functions or microservices from the data center to the edge.   More recently Microsoft has introduced Azure IoT Edge which is built on open container technologies.  Using an instance open source Virtual Kubelet deployed on the edge devices we can run our Kubernetes containers to run on the edge.  You can think of a Kubelet as the part of Kubernetes that runs on a single node. This enables Kubernetes clusters to span across the cloud and edge as illustrated in Figure 9.

Figure 9.  shows a sketch of migrating containerized functions to edge function.   That way our IOT devices can communicate with locally deployed microservices.  These microservices can communicate with cloud-based services.  The edge function containers can also be updated and replaced remotely like any other microservice.

# The Evolution of the Data Center

As mentioned at the beginning of this paper, the early days (2005) of cloud data center design systems were based on very simple server and networks were designed for outgoing internet traffic and not bisectional bandwidth for parallel computing.   However, by 2008 interest in performance began to grow.  Special InfiniBand sub-networks were being installed at some  data centers.  The conventional dual-core servers were being replaced by systems with up to 48 cores and multiple GPU accelerators.  By 2011 most of the commercial clouds and a few research clouds had replaced traditional network with software defined networking.  To  address the demand of some of its customers, in 2017 Microsoft added  Cray® XC™  and Cray CS™ supercomputers to a few data centers and then acquired the company cycle computing.

From 2016 we have seen progress focused on performance and parallelism.   The driver of this activity has been AI and, more specifically, the deep neural networks (DNNs) driving all the new services.  There are many types of DNNs but two of the most common are convolutional, which look like a linear sequence of special filters, and recurrent networks which, as the name implies, are networks with a feedback component.   And there are two phases to neural network design and use.  The first is the training phase which requires often massive parallelism and time.  But it is usually an off-line activity.  The second phase is called inference and it refers to the activity of evaluating the trained network on classification candidates.  In both the convolutional and recurrent network inference boils down to doing a very large number of matrix-vector and matrix-matrix multiplies where the coefficients of the matrix are the trained model and the vector represent the classification candidates.      To deliver the performance at scale that was needed by the AI services it was essential to do these operations fast.  While GPUs were good, more speed was needed.

## Google’s Tensor Processing Unit

In 2015 Google introduced the Tensor Processing Unit (TPU) and the TPU2 in 2017.

Figure 10.  Above, Google data center.  https://www.nextplatform.com/2017/05/22/hood-googles-tpu2-machine-learning-clusters/  Below architecture of Google Tensor Processing Unit TPU. From “In-Datacenter Performance Analysis of a Tensor Processing Unit”, Norman P. Jouppi et al.​  ISCA 2017  https://ai.google/research/pubs/pub46078

Figure 10 illustrates several racks of TPU equipped servers and the functional diagram of the TPU.  One of the key components is the 8-bit matrix multiply capable of delivering 92 TeraOps/second (TOPS).   (It should be noted that DNNs can be well trained on less than IEEE standard floating-point standards and floating point systems with small mantissa are popular.)  The multiply unit uses a systolic algorithm like those proposed for VLSI chips in the 1980s.

## Microsoft’s Brainwave

In 2011 a small team in Microsoft Research led by Doug Burger began looking at the use of FPGAs to accelerate the search result ranking produced by Bing.   Over several iterations they arrived at a remarkable design that allowed them to put the FPGA between the network and the NIC so that the FPGA could be configured into separate plane of computation that can be managed and used independently from the CPU (see Figure 11).   Used in this way groups of FPGAs could be configured into a subnetwork to handle tasks such as, database queries and  inference stage of deep learning in addition to Bing query  optimization.

Figure 11.  The Brainwave architecture.

Figure 12. The brainwave software stack for mapping a DNN to one or more FPGAs. From BrainWave_HOTCHIPS2017.pptx, Eric Chung, et. al., https://vneetop.wordpress.com/ 2017/10/28/accelerating-persistent-neural-networks-at-datacenter-scale/

The team also built a software stack that could really make this possibility a reality.   What makes much of this possible is that the models for DNNs are all based on flow graphs which describe sequences of tensor operations.  As shown in Figure 12 above, the flow graphs can be compiled to a graph internal representation that can be split and partitioned across one or more FPGAs.  They refer to the result as a hardware microservice.   Just recently Mary Wall [see endnote 2] wrote a nice blog about the teams work on using this idea to do deep learning inference on land cover maps.  Each compiled inference hardware microservice is mapped to a single FPGA, but they used 800 of the inference instances in parallel with 80 VMs to process 20 terabytes of aerial imagery into land cover data for the entire United States.   It took only about 10 minutes for a total cost of \$42. [see endnote3] Mary Wall’s code is in the blog and available in Github.

# Conclusion

Cloud data centers are getting high performance networks (with latencies of only a few microseconds in the case of Azure Brainwave) and immense computing capacity such as the tensor processing capability of Google’s TPU.  At the same time designers of supercomputers are having to deal with more failure resilience and complexity in the design of the first exascale supercomputers.  For the next generation exascale systems the nodes will be variations on a theme of multicore and GPU-style accelerators.

Observed from a distance, one might conclude the architectures of cloud data centers and the next generation of supercomputers are converging.  However, it is important to keep in mind that the two are designed for different purposes. The cloud is optimized for fast response for services supporting many concurrent globally distributed clients. Supers are optimized for exceptionally fast execution of programs on behalf of a small number of concurrent users.   However, it may be the case that an exascale system may be so large that parts of it can run many smaller parallel jobs at once.  Projects like Singularity provide a solution for running containerized application on supercomputers in a manner similar to the way microservices are run on a cloud.

## Possible Futures

### The continuum: edge+cloud+supercomputer

There are interesting studies showing how supercomputers are very good at training very large, deep neural networks.  Specifically, NERSC scientists have show the importance of this capability in many science applications[4]. However, if you need to perform inference on models that are streamed from the edge you need the type of edge+cloud strategy described here.   It not hard to imagine scenarios where vast numbers of instrument streams are handled by the edge and fed to inference models on the cloud and those models are being continuously improved on a back-end supercomputer.

### A data garden

In the near future, the most important contribution clouds can make to science is to provide access to important public data collections.  There is already reasonable start.   AWS has an opendata registry that has 57 data sets covering topics ranging from astronomy to genomics.   Microsoft Research has a Data Science for Research portal with a curated collection of datasets relating to human computer interaction, data mining, geospatial, natural language processing and more.  Google cloud has a large collection of public genomics datasets.  The US NIH has launch three new cloud data and analytics projects.  They include the Cancer Genomics Cloud led by the Institute for Systems Biology with Google’s cloud, FireCloud from the Broad Institute also using Google’s cloud and Cancer Genomics Cloud (CGC), powered by Seven Bridges.   These NIH facilities also provide analytics frameworks designed to help research access and effective use the resources.

I am often asked about research challenges in cloud computing that student may wish to undertake.   There are many.  The fact that the IEEE cloud computing conference being held in San Francisco in July received nearly 300 submissions shows that the field is extremely active.   I find the following topics very interesting.

1. Find new ways to extract knowledge from the growing cloud data garden.   This is a big challenge because the data is so heterogeneous and discovery of the right tool to use to explore it requires expert knowledge.  Can we capture that community knowledge so that non-experts can find their way?  What are the right tools to facility collaborative data exploration?
2. There are enormous opportunities for systems research in the edge-to-cloud-to-supercomputer path.  How does one create a system to manage and optimize workflows of activities that span this continuum?  Is there a good programming model for describing computations involving the edge and the cloud?  Can a program be automatically decomposed into the parts that are best run on the edge and the parts on cloud?  Can such a decomposition be dynamically adjusted to account for load, bandwidth constraints, etc.?
3. Concerning the intelligent assistant for research, there are a number of reasonable projects short of build the entire thing.  Some may be low hanging fruit, and some may be very hard.  For example, ArXiv, Wikipedia and Google search and Bing are great for discovery but in different ways.   Handling complex queries like “what is the role of quantum entanglement in the design of a quantum computer?” should lead to a summary of the answer with links.   There is a lot of research on summarization and there are a lot of sources of data.  Another type of query is “How can I access data on genetic indicators related to ALS?”  Google will go in the right direction, but it takes more digging to find data.

These are rather broad topics, but progress on even the smallest part may be fun.

[1] L. Ramakrishnan, P. T. Zbiegel, S. Campbell, R. Bradshaw, R. S. Canon, S. Coghlan, I. Sakrejda, N. Desai, T. Declerck, and A. Liu. Magellan: Experiences from a science cloud. In 2nd International Workshop on Scientiﬁc Cloud Computing, pages49–58., ACM, 2011.

P. Mehrotra, J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazanoﬀ, S. Saini, and R. Biswas. Performance evaluation of Amazon EC2 for NASA HPC applications. In 3rd Workshop on Scientiﬁc Cloud Computing, pages 41–50. ACM, 2012

[3] https://blogs.microsoft.com/green/2018/05/23/achievement-unlocked-nearly-200-million-images-into-a-national-land-cover-map-in-about-10-minutes/  from Lucas Joppa – Chief Environmental Scientist, Microsoft