# eScience 2050: A Look Back

Abstract— This was originally written as an invited “vision” presentation for the eScience 2019 conference, but I decided to present something more focused for that event. Never wanting to let words go to waste I decided to put a revised version here.   It was written as a look back at the period of eScience from 2019 to 2050.   Of course, as it is being published in 2019, it is clearly a work of science fiction, but it is based on technology trends that seem relatively clear. Specifically, I consider the impact of four themes on eScience: the explosion of AI as an eScience enabler, quantum computing as a service in the cloud, DNA data storage in the cloud, and neuromorphic computing.

# I.  Introduction

Predictions of the future are often so colored by the present that they miss the boat entirely.   The future, in these visions, tends to look like today but a bit nicer. Looking at images of the future from the 1950s we see pictures of personal helicopters (Figure 1) that clearly missed the mark.   But in some cases, the ideas are not so far off, as in the case of the woman ordering shirts “online” as in Figure 2.

Fig 1. Future personal transportation. Mechanics Illustrated, Jan 1951

Fig 2. Shopping “online” vision.

To look at the evolution of eScience in the period from 2019 to 2050, it can help to look at the state of the art in computing in 1960 and ask how much of todays technology can we see emerging there.   My first computer program was written when I was in high school. It was a Fortran version of “hello world” that consited of one line of Fortran per punched card as shown in Figure 3 and the entire program consisted of a deck of 12 cards .

Fig 3.   The Fortran statement CALL RCLASS(AAA,21,NNC, PX3,PX4)

If asked to predict the future then, I might have said that software and data in the years ahead would be stored on a punched paper tape over ten miles long.

Fifty years ago, the high points of computing looked like:

• The programming language FORTRAN IV was created in 1961.
• IBM introduced its System/360. A monster machine.
• Gordon Moore makes an observation about scale in 1965 (that became Moore’s law).
• IBM created the first floppy disk in 1967.
• The SABRE database system that was used by IBM to help American Airlines manage its reservations data.
• In 1970 Edgar Codd came up with Relation Database Idea. It was not implemented until 1974.
• The Internet: a UCLA student, tries to send “login,” the first message over ARPANET, at 10:30 P.M. on October 29, 1969.  (The system transmitted “l” and then “o” …. and then crashed.)

What we can extrapolate from this is that mainframe IBM computers will continue to dominate, databases are going to emerge as a powerful tool and there will be a future for computer-to-computer networking, but PCs and web search were not remotely visible on the horizon. The profound implication of Moore’s law would not be evident until the first microprocessors appear in 1979 and the personal computer revolution of the 1980s followed. Those same microprocessors led to the idea that dozens of them could be packaged into a “parallel computer”. This resulted in a period of intense research in universities and startups to build scalable parallel systems. While the first massively parallel computer was the Illiac IV which was deployed at NASA Ames in 1974, it was the work on clusters of microprocessors in the 1980s [1] that led to the supercomputers of 2019. Looking at computing in 1970, few would have guessed this profound transition in computing from mainframes like the System/360 to the massively parallel systems of 2019.

# II. eScience from 2019 Forward

### A.   The Emergence and Evolution of the Cloud

The World Wide Web appeared in 1991 and the first search engines appeared in the mid 90s. As these systems evolved, so did the need to store large segments of the web on vast server farms. Ecommerce also drove the need for large data centers. These data centers quickly evolved to offering data storage and virtual machines as a service to external users.   These clouds, as they became known, were designed to serve thousands to millions of concurrent clients in interactive sessions.

For eScience, clouds provided an important alternative to traditional batch-oriented supercomputers by supporting interactive, data exploration and collaboration at scale. The cloud data centers evolved in several important way. Tools like Google’s Kubernetes was widely deployed to support computation in the form of swarms of microservices built from software containers. Deploying Kubernetes still required allocating resources and configuring systems and microservices are continuously running processes. Serverless computing is a newer cloud capability that avoids the need to configure VMs to support long running processes when they are not needed.

Serverless computing is defined in terms of stateless functions that respond to events such as signals from remote instruments or changes in the state of a data archive. For examples, an eScience workflow can be automatically triggered when instrument data arrives in the data storage archive. The analysis in the workflow can trigger other serverless function to complete additional tasks.

The Cloud if 2019 is defined in terms of services and not raw hardware and data storage. These services include

• Streaming data analytics tools that allow the users to monitor hundreds of live streams from remote instruments.   Edge computing services evolved as a mechanism to off-load some of the analytics to small devices that interposed between the instruments and the cloud.
• Planet scale databases allow a user to store and search data collections that are automatically replicated across multiple continents with guaranteed consistency.
• Services to support applications of machine learning became widely available across all the commercial public clouds. These services include speech recognition and generation, automatic language translation, computer vision tasks such as image recognition and automatic captioning.

The architecture of the cloud data centers continued to evolve toward ideas common in modern supercomputers. The first major change was the introduction of software defined networking throughout the data centers. Amazon began introducing GPU accelerators in 2012. This was followed by Google’s introduction of TPU chips to support the computational needs of deep learning algorithms coded in the Tensorflow system. The TPU (figure 4) is a device designed to accelerate matrix multiplication. It is based on systolic computation ideas of the 1980s and has been applied primarily to Google machine learning projects.

Fig 4.   Google TPU v3 architecture. See N. Jouppi, C. Young, N. Patil,  D. Patterson [2, 3].

The most radical evolutionary step has been taken by Microsoft in the Azure cloud. Project Brainwave represents an introduction of a mesh connected layer of FPGA devices that span the entire data center (figure 5).   These devices can be programed to cooperate on parallel execution of lage computational task including serving deep learning models.Fig 5. Microsoft FPGA based Project Brainwave [4, 5]

### A.   Major Themes Going Forward from 2019

The following pages will describe the four major themes that are going to dominate the ten years forward from 2019.   They are

• The explosion of AI as an eScience enabler.
• Quantum computing as a service in the cloud.
• DNA data storage in the cloud.
• Neuromorphic computing

## III. AI and eScience

Machine Learning went through a revolution in the years from 2008 to 2019 due to the intensive work at universities and in companies.   The cloud providers including Google, Microsoft and others were driven by the need improve their search engines. Along the way they had amassed vast stores of text and images. Using this data and the new deep learning technologies they were able to build very powerful language translations systems and image recognition and captioning services that were mentioned above. By 2017, applications to eScience were beginning to emerge.

### A.   Generative Neural Networks and Scientific Discovery

One of the more fascinating developments that arose from the deep learning research was the rise of a special class of neural networks called generative models. The important property of these networks is that if you train them with a sufficiently, large and coherent collection of data samples, the network can be used to generate similar samples. These are most often seen in public as pernicious tools to create fake images and videos of important people saying things they never said.

What these networks are doing is creating a statistical distribution of data that, when sampled, produces data that has properties that match nicely with the input data. When a scientist creates a simulation based on assumed parameters of nature, the simulations is evaluated against the statistical properties of the experimental observations.   Generative models can be used to validate the simulation output in case the data is sparse. One of the most commonly used generative model is the Generative Adversarial Network (GAN) in which to networks are pitted against each other. As shown in Figure 6, a discriminator is trained to recognize the data that is from the real data set and a generator is designed to fool the discriminator. When converged the generator mimics the data distribution of the real data. There are several interesting examples of how generative models have been used in science. We list several in [6], but two that stand out are an application in astronomy and one in drug design.

Fig 6. Generative Adversarial Network (GAN) from this site.

M. Mustafa, and colleagues demonstrate how a slightly-modified standard Generative Adversarial Network (GAN) can be used generate synthetic images of weak lensing convergence maps derived from N-body cosmological simulations [7]. The results illustrate how the generated images match their validation tests, but what is more important, the resulting images also pass a variety of statistical tests ranging from tests of the distribution of intensities to power spectrum analysis.

However, generative methods do not, by themselves, help with the problem of inferring the posterior distribution of the inputs to scientific simulations conditioned on the observed results, but there is value of in creating ‘life-like’ samples. F. Carminati, G. Khattak, S. Vallecorsa make the argument that designing and testing the next generation of sensors requires test data that is too expensive to compute with simulation . A well-tuned GAN can generate the test cases that fit the right statistical model at the rate needed for deployment [8].

In the area of drug design, Shahar Harel and Kira Radinsky have used a generative model to suggest chemical compounds that may be used as candidates for study [9]. They start with a prototype compound known to have some desirable properties. This is expressed as sequence and fed to a layer of a convolutional network that allow local structures to emerge as  shorter vectors that are concatenated. A final all-to-all layer is used to generate sequence of mean and variance vectors for the prototype. his is fed to a “diversity layer” which add randomness as shown in Figure 7.

Fig 7. Shahar Harel and Kira Radinsky multi-layer generative netowork for drug design.

The decoder is an LSTM-based recurrent network which generates the new molecule. The results they report are impressive. In a one series of experiments they took as prototypes compounds from drugs that were discovered years ago, and they were able to generate more modern variations that are known to be more powerful and effective. No known drugs were used in the training.

### B.   Probabilistic Programming and Bayesian Inference

For very large experiments such as those conducted in the Large Hadron Collider, the scientists are interested in testing their models of the universe based the record of particle collisions that were generated.   It is often the case that the scientists have a detailed simulation of the experimental system based on the current theoretical models of physics that are involved. These models are driven by setting parameters that correspond to assumptions about nature. When they run the experiment, they see new behaviors in the data, such as new particles emerging from collisions, and they would like to see which parameters when fed to the simulation can produce the output that corresponds to the experimental results. In other words, the simulation is a function taking parameters x to outputs y and the interesting question is given the outputs what the corresponding values for x. This is an inverse problem that is notoriously difficult to solve for large scientific problems. In terms of Bayesian statistics, they are interested in the posterior distribution P(x | y) where y is the experimental statistics.

In 2018 and later researchers began to study programming languages designed to express and solve problems like these. Pyro [10], Tensorflow Probability [11], and the Julia-based Gen [12] are a few of the Probabilistic Programing Languages (PPLs) introduced in 2019. One of the ideas in these languages is to build statistical models of behavior where program variables represent probability distributions and then, by looking at traces of the random choices associated with these variables during execution, one can make inferences about the random behaviors that influence outcomes.

These languages have been put in use in industries such as finance, their role in eScience is now becoming apparent.    A. Güneş Baydin, et al. showed that PPLs can be used in very large Bayesian analysis of data from an LHC use case. Their approach allows a PPL too couple directly to existing scientific simulators through a cross-platform probabilistic execution protocol [13].

Probabilistic programming with PPLs will go on to become a standard tool used by eScientists.

### C.   AI-based Research Assistant

2018 was the year of the smart on-line bot and smart speaker. These were cloud based services that used natural language interfaces for both input and output. The smart speakers, equipped with microphones listen for trigger phrases like “Hello Siri” or “hello Google” or “Alexa” and recorded a query in English, extracted the intent and replied within a second. They could deliver weather reports, do web searches, keep your shopping list and keep track of your online shopping.

The impact of this bot technology will hit eScience when the AI software improves to the point that every scientist, graduate student and corporate executive had a person cloud-based research assistant. Raj Reddy calls these Cognition Amplifiers and Guardian Angels. Resembling a smart speaker or desktop/phone app, the research assistant is responsible for the following tasks:

• Cataloging research data, papers and articles associated with its owner’s projects. The assistant will monitor the research literature looking for papers exploring the same concepts seen in the owner’s work.
• Coordinate meetings among collaborators and establish coordination among other research assistants.
• Automatically sifting through other data from collaborators and open source archives that may be of potential use in current projects.
• Understanding the mathematical analysis in the notes generated by the scientist and using that understanding to check proofs and automatically propose simulators and experiment designs capable of testing hypotheses implied by the research.

From the perspective of 2019 eScience, the claim that the Research Assistant of the future will be able to derive simulations and plan experiments may seem a bit “over the top”. However, progress on automatic code completion and generation was making great strides in 2019. Transformers [16] provide a second generation of sequence-to-sequence AI tools that outpaced the recurrent neural networks previously used.   Advances in programming tools based on similar technology show that AI can be used to generate code completions for C# method calls based on deep semantic mining of online archives of use cases.

We built a toy demo of the Research Assistant in [14] that could take natural language (spoken or typed) input and find related research articles from Bing, Wikipedia and ArXiv (see Figure 6.)

Fig 6. Demo cloud-based “research assistant” [14]

This demo prototype research assistant was built by composing a few cloud tools as shown in Figure 7. A much more sophisticated version can be built using a modern chatbot tool like RASA, but that will not address everything that is needed to reach the final goal.

Fig 7. The toy research assistant demo is hosted an “elegant” cardboard container from Google’s voice kit running a python script on raspberry Pi. It invokes Googles speech to text, a text analysis service running on Algorithmia and uses Amazon’s Lex to generate voice. Depending on context, calls are made to Bing, Wikipedia or ArXiv.

While this demo was a toy, there has been serious work over the last few years to make progress on this topic. In 2018, Google introduced its Dataset Search service to provide easy data set discovery. The work of the Allen Institute for AI stands out [15]. Their Semantic Sanity project is a sophisticated research search engine that allows you to tell it basic topics that interest you and it will monitor ArXiv looking for important related contributions.   Aristo is “an intelligent system that reads, learns, and reasons about science”. It can reason about diagrams, math and understand and answer elementary school science questions.

One important problem that must be solved is extracting non-trivial mathematical concepts from research papers and notes so that similar papers can be found. In addition to Semantic Sanity, there is some other early related work on this problem. For example, researchers are using unsupervised neural net to sift through PubChem and uses NLP techniques to identify potential components for materials synthesis [22].

## IV. The Rise of Quantum in the Cloud

Long viewed as a subject of purely theoretical interest, quantum computing emerged in 2018 as a service in the cloud. By 2019, Rigetti, IBM, D-Wave and Alibaba had live quantum computing services available.   Google and Microsoft followed soon after that.   The approaches taken to building a quantum computer differed in many respects.

The basic unit of quantum computing is the qubit which obeys some important and non-intuitive laws of physics. Multiple qubits can be put into an “entangled” state in which an observation about one can effect the state of the others even when they are physically separated.   One can apply various operators to qubits and pairs of qubits to form “logical” circuits and these circuits are the stuff of quantum algorithms. A fact of life about quantum states is that once you measure them, they are reduced to ordinary bits. However, the probability that a bit is reduced to a zero or a one is determined by the unmeasured quantum state of the system. A good quantum algorithm is one in which the quantum circuit produces outputs with a probability distribution that makes solving a specific problem very fast.

The first important quantum computations involved solving problems in quantum chemistry. In fact, this was the role that Richard Feynman had suggested for a quantum computer when he first came up with the idea in the 1980s. The other area of eScience that quantum computers excelled at was certain classes of optimization problems.

Early quantum computers are program by literally composing the quantum gates as illustrated in Figure 8.

Fig 8. A simple quantum circuit to create an entangled pair of two qubits and measure the result

Programming tools to build such circuits include IBM’s IBM-Q qiskit and Microsoft’s Q# language and compiler. The IBM hardware was based what they call “noisy intermediate-scale quantum computers (NISQ)”. As shown in Figure 9, the induvial qubits are non-linear oscillators. Tuned superconducting resonator channels are used for readout. Also, the qubits are tied together by additional superconducting channels

Fig 9 An early 5-qubit IBM-Q computational unit. Source: IBM-Q website.

One problem with early quantum systems is that the qubits are very susceptible to noise and degrade quickly.   In other words, the error rate of quantum algorithms can be very high. The deeper the quantum circuit (the number of layers of “gates”), the more noise. A way around this is to add error-correcting redundancy into the circuit, but this means more qubits are required.   Quantum volume refers to a measure of the number of qubits needed for the error-corrected circuit times the depth of the corrected circuit. Unfortunately, early quantum computers had very low bounds on achievable quantum volume.

Microsoft took a different approach to building qubits that are based on the topological property of woven threads. These are much more noise resistant and hence capable of building deeper circuits with less error correction.

## IV.  DNA-Based Storage

There are two problems with 2019 era storage technologies.   First, it was not very stable. Data degrades and storage devices fail. Second, there was not enough storage capacity to capture the data that was being generated.   2.5 quintillion bytes of data was generated each day in 2018. Even if you cast out 99.9% of this as being of no long-term value, the remainder still leaves 2.5e+15 bytes per day. The amount of data stored in the cloud in 2018 was 1.0e+18 (an exabyte). In other words, the cloud only held 400 days’ worth of the important data. Obviously, another order or magnitude or more of the data needed to be discarded to allow the data centers time to expand to contain this growth. A new storage technology was needed.

DNA storage was first proposed in the 1960, but not much happened with the idea until 1988 when researchers from Harvard stored an image in the DNA of e.coli. By 2015 researchers at Columbia University published a method that allowed the storage of 215 petabytes (2.15e+17) of data per gram of DNA. And DNA is very stable over long periods of time. While this was promising, there was still a big problem.   The encoding and decoding of the data were still an expensive manual process and it was not practical to have lab scientists running around in the back rooms of data centers managing wet processes.

In 2019, researchers at the University of Washington and Microsoft demonstrated an automated lab that could encode and decode data to DNA without human hands. The system works by converting the ones and zeros of digital data into the As, Ts, Cs and Gs that make up the building blocks of DNA. These encoded sequences are then fed into synthesizing systems. Their next breakthrough was to reduce the entire “lab” to a chip that uses microfluidics to route droplets of water around a grid to enact the needed chemistry steps. They also produced a sophisticated software stack called puddle that allow the scientist to program this with conventional high-level languages [20].

Other research has demonstrated ways in which DNA encoded data could be searched and structured in ways like relational databases.   As the costs came down, this became the standard cloud storage technology.

## VI. Neuromorphic Computing

It was long a dream of computer designers to build systems that mimicked the connectome of the biological brain. A major research initiative in Europe, the Human Brain Project, looked at the possibility of simulation a brain on traditional supercomputers. As that proved unrealistic, they turned to the study of special hardware devices that can simulate the behavior of neuron. Technology to build artificial neurons progressed in university and industry labs. The Stanford neurogrid can simulate six billion synapses.   In 2017 Intel introduced the Loihi neuromorphic research test chip. The device is a many-core mesh of 128 neuromorphic cores and each of which contains 1,024 primitive spiking neural units. The neural units are current-based synapse leaky integrate-and-fire neurons [21].

In addition to the Loihi chip, Intel has released Pohoiki Beach, a system comprised of 64 Loihi chips and a substantial software stack to allow application development. Because overall power consumption is 100 times lower than GPU based neural networks, Intel’s application target is autonomous vehicles and robots. While full realization of true biological brain-like functionality may not be realized until the 2nd half of the 21st century, it is none the less an exciting step forward.

## VII.   Conclusions

EScientists in 2019 found themselves at an important inflection point in terms of the technology they could deploy in doing science.   The cloud has evolved into a massive on-line, on-demand, heterogeneous supercomputer. It not only supports traditional digital simulation; it will also support hybrid quantum-digital computation.   It will soon allow applications to interact with robotic sensor nets controlled by neuromorphic fabrics. Storage of research data will be limitless and backed up by DNA based archives.

One of the most remarkable features of computing in the 21st century has been the evolution of software. Programming tools have evolved into very deep stacks that have used and exploited AI methods to enable scientific programmer to accomplish more with a few lines of Julia in a Juypter notebook than was remotely possible programming computers of 1980. New probabilistic programming languages that combine large scale simulation with deep neural networks show promise of making Bayesian inference an easy to use used tool in eScience.

The role of AI was not limited to the programming and execution of eScience experiments. Users of the AI research assistants in 2050 can look back at the “smart speakers” of 2019 and view them as laughably primitive as the users of the 2019 World Wide Web looked back at the 1970s era where computers were driven by programs written on decks of punched cards.

References

1. G. Fox, R. Williams, P. Massina, “Parallel Computing Works!”, Morgan Kauffman, 1994.
2. N. Jouppi, C. Young, N. Patil, D. Patterson, “A Domain-Specific Architecture for Deep Neural Networks”, Communications of the ACM, September 2018, Vol. 61 No. 9, Pages 50-59
4. Chung, et al. “Serving DNNs in Real Time at Datacenter Scale with Project Brainwave.” IEEE Micro 38 (2018): 8-20.
5. Fowers et al., “A Configurable Cloud-Scale DNN Processor for Real-Time AI,” 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Los Angeles, CA, 2018, pp. 1-14
6. D. Gannon, “Science Applications of Generative Neural Networks”, https://esciencegroup.com/2018/10/11/science-applications-of-generative-neural-networks/, 2018
7. Mustafa, et. al. “Creating Virtual Universes Using Generative Adversarial Networks” (arXiv:1706.02390v2 [astro-ph.IM] 17 Aug 2018)
8. Carminati, G. Khattak, S. Vallecorsa, “3D convolutional GAN for fast Simulation”, https://www.ixpug.org/images/docs/IXPUG_Annual_ Spring_Conference_2018/11-VALLECORSA-Machine-learning.pdf
9. Harel and K. Radinsky, “Prototype-Based Compound Discovery using Deep Generative Models” http://kiraradinsky.com/files/acs-accelerating-prototype.pdf
10. Bingham, et al, “Pyro: Deep universal probabilistic programming.” Journal of Machine Learning Research (2018).
11. Tensorflow Probability, https://medium.com/tensorflow/an-introduction-to-probabilistic-programming-now-available-in-tensorflow-probability-6dcc003ca29e
12. Gen, https://www.infoq.com/news/2019/07/mit-gen-probabilistic-programs
13. Güneş Baydin, et al, “Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale”, arXiv:1907.03382v1
14. Gannon, “Building a ‘ChatBot’ for Scientific Research”, https://esciencegroup.com/2018/08/09/building-a-chatbot-for-scientific-research
15. Allen Institute for AI, Semantic Sanity and Aristo, https://allenai.org/demos
16. Vaswani et al., “Attention Is All You Need”, arXiv:1706.03762v5
17. Chong, Hybrid Quantum-Classical Computing https://www.sigarch.org/hybrid-quantum-classical-computing/
18. Shaydulin, et al., “A Hybrid Approach for Solving Optimization Problems on Small Quantum Computers,” in Computer, vol. 52, no. 6, pp. 18-26, June 2019.
19. Takahashi, B. Nguyen, K. Strauss, L. Ceze, “Demonstration of End-to-End Automation of DNA Data Storage”, Nature Scientific Reports, vol. 9, no. 1, pp. 2045-2322, 2019
20. Wellsey, et al., “Puddle: A Dynamic, Error-Correcting, Full-Stack Microfluidics Platform”, ASPLOS’19 , April 13–17, 2019.
21. Intel Loihi Neurmorphic Chip. https://en.wikichip.org/wiki/intel/loihi
22. E. Kim, et al., “Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning”, Chem. Mater. 2017 29219436-944

# Julia Distributed Computing in the Cloud

### Abstract.

This brief note is intended to illustrate why the programming language Julia is so interesting to a growing number of computational and data scientists.  Julia is designed to deliver high performance on modern hardware while retaining the interactive capabilities that make it well suited for Jupyter-style scientific exploration. This paper illustrates, with some very unscientific examples, how Julia can be deployed and with Docker and Kubernetes in the cloud.

# Introduction

In all our previous posts we have used Python to build applications that interact with cloud services.  We used Python because everybody knows it.  However, as many scientists have now discovered, for scientific applications the programming language Julia is a better alternative.  This note is not a Julia tutorial, but rather, it  is intended to illustrate why Julia is so interesting. Julia was launched in 2012 by Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman and is now gaining a growing collection of users in the science community.  There is a great deal of Julia online material and a few books (and some of it is up to date).   In a previous blog post we looked at Python Dask for distributed computing in the cloud.   In this article we focus on how Julia can be used for those same parallel and distributed computing tasks in the Cloud. Before we get to the cloud some motivational background is in order.

# 1.    Julia is Fast!

There is no reason a language for scientific computing must be as dull as Fortran or C.   Languages with lots of features like Java, Python and Scala are a pleasure to use but they are slow.   Julia is dynamic (meaning it can run in a read-eval-print-loop in Jupyter and other interactive environments), it has a type system that support parametric polymorphism and multiple dispatch.   It is also garbage collected and extremely extensible.   It has a powerful package system, macro facilities and a growing collection of libraries.   It can call C and Python directly if it is needed.

And it generates fast code.   Julia uses a just-in-time compiler that optimized your program depending on how you use it.    To illustrate this point, consider a simple function of the form

function addem(x)
# do some simple math with x
x += x*x
return x
end


Because we have not specified the exact type of x, this defines a generic function: it will work with arguments of any type that has meaning for the math operations used.   But when we invoke it with a specific type, such as int or float, a version of the function is compiled on the fly that is specialized to that type.  This specialization takes a bit of time, but when we invoke the function a second time  with that argument type, we used the specialized version.   As illustrated in Figure 1, we called the function twice with integer arguments.  The second call is substantially faster than the first.  Calling it with a floating point argument is slow until the specialized version for floating point variables is created, then the function runs fast.

Figure 1.   Using a timing macro to measure the speed of function calls.   In step 23 the generic version is called with an integer.   The second call uses the version optimized for integers.   In  steps 25 an 26 the specialized version for floating point numbers is generated and run.

The Julia team has put together a benchmark set of small examples written in several languages.  We extracted the benchmarks for C, Python and Julia and ran them.   The resulting execution time are shown below.   As you can see, the Julia code generation is arguably as good as  C (compiled with optimized gcc).  Relative to Python it is as much as 50 times faster.  Figure 2 makes this more explicit.  The only benchmark where Python is within a factor of two is  matrix multiply and that is because Python is using the optimized numpy.linalg libraries.

 Benchmark python gcc – O julia recursion_fibonacci 2.977 0.000 0.037 parse_integers 2.006 0.117 0.197 userfunc_mandelbrot 6.294 0.068 0.065 recursion_quicksort 12.284 0.363 0.364 iteration_pi_sum 558.068 22.604 22.802 matrix_statistics 82.952 11.200 10.431 matrix_multiply 70.496 41.561 41.322 print_to_file 72.481 17.466 8.100

Figure 2.   Speed-up of Julia and C relative to Python

# 2.    Julia Math and Libraries

Doing basic linear algebra and matrix computation in Julia is trivial.  The following operations each take less than one second in Jupyter using the package LinearAlgebra.jl.

M = rand(1000, 1000); # generates a 1000 by 1000 matrix of random floats
N = M^-1                        # compute the inverse of M
C = M*M’                       # M’ is the transpose so the product is symmetric.
Q = svd(C)                     # computes the singular value decomposition of C.
Q.S                                 # are the singular values.

Sparse matrices are also supported along with a large number of other matrix creation and manipulation operations.

### Differential Equations

The differential equation package built by a team led by Christopher Rackauckas is one of the most impressive.  To illustrate this we consider an example from their excellent tutorial.   The Lorenz attractor is a fascinating example of a chaotic solution to a system of differential equations that model convection.   The system involves the evolution in 3D of a system that is characterized by three parameters.  As shown below the equations are expressed exactly as you would describe them mathematically (dx is dx/dt) and the three parameters ar sigma, rho and beta.   The initial point is (1,0,0) and the region is integrated over [0,100].   The package automatically picks an appropriate solver and the output is plotted as shown in Figure 3.   Running this on a Mac mini took about 5 seconds.  We used another package “Plots” to render the image.

g = @ode_def LorenzExample begin
dx = σ*(y-x)
dy = x*(ρ-z) – y
dz = x*y – β*z
end σ ρ β

u0 = [1.0;0.0;0.0]
tspan = (0.0,100.0)
p = [10.0,28.0,8/3]
prob = ODEProblem(g,u0,tspan,p)
sol = solve(prob)
plot(sol)

Figure 3.  Plot of the Lorenz attractor solution. (yes, Greek Unicode character names are valid.)

Note: Thomas Breloff provides a more explicit integration with a Gif animation in this Plots tutorial.  As with the above example, Breloff’s solution works well in Jupyter.

# Julia Distributed Computing in the Cloud

Julia has several packages that support parallel computing.   These include a library for CUDA programming, OpenMP, Spark and  MPI.jl for MPI programming.  MPI is clearly the most reliably scalable parallel computing model for tasks involving 10000 or more cores.  It is based on low-latency, two-sided communication in which coordinated, synchronized send-receive pairs and collective operations are executed in parallel across large clusters equipped with specialized networking.  While Julia can be used with MPI, the natural model of communication in Julia is based on one-sided communication based on threads, tasks, futures and channels.  We will use the package Distributed.jl to demonstrate this.

The Julia distributed computing model is based on distributing tasks and arrays to a pool of workers.  Worker are either Julia processes running on the current host or on other hosts in your cluster.   (In the paragraphs that follow we show how to launch local workers, then workers on other VMs in the cloud and finally workers as docker containers running in a Kubernetes cluster.

Let’s begin with a trivial example.   Create a function which will flip a coin “n” times and count the number of heads.

Here we generated random Boolean values and converted them to 0 or 1 and added them up.   Now let’s add some workers.

First we include the “Distributed” package and count the current number of workers.   Next using the “addproc( )” function we added two new worker processes on the same server running this Jupyter notebook.  Workers that our notebook process knows about are identified with an integer and we can list them.   (The notebook process, called here the master, is not a worker)

The worker processes are running Julia but they don’t know about the things we have defined in the master process.  Julia has a very impressive macro facility that is used extensively in the distributed computing package to distributed code objects to the workers and  to launch remote computations.

When we create a function in the master that we want executed in the workers we must make sure it also gets defined in that worker.   We use the “@everywhere” macro to  make sure things we define locally are also defined in each worker.   We even must tell the workers we are using the Distributed package.  In the code below we created a new version of our count_heads function and distributed it.

Julia uses the concept of futures to launch computation on the workers.   The “@spawnat” macro takes two parameters: the ID of a worker and a computation to be launched.   What is returned is a future: a placeholder for the final result.   By “fetch( a)” we can grab the result of the future computation and return it to our calling environment.  (Some library functions like “printf()” when executed on the workers are automatically mapped back to the master.)

In the following example we create two local workers and define a function for them to execute.  Workers each have a unique integer ID and we print them.   Then we use @spawnat to launch the function on each worker in turn.

We can easily measure the cost of this remote function evaluation with the @time macro as follows. We enclose the call and fetch in a block and time the block.  (Notice we have increased the number of coins to 109 from 104.)

If we want both workers working together, we can compose a list of spawned invocations with a construct of the form

[ @spawn at i expr(i) for i in range]


This returns a list (technically, in Julia it is an array) of futures.   We can then grab the result of each future in turn.   The result is shown below.  This is run on a dual core desktop machine, so parallelism is limited but, in this case, the parallel version is 1.85 times faster than the individual call.   The reason that it is not exactly two time faster is partly due to the overhead in sequentially launching and resolving the futures.  But it is also due to communication delays and other OS scheduling delays.

A more compact and elegant way to write this parallel program is to use the Julia distributed parallel map function  pmap(f, args).  pmap takes a function and applies it to each element of a set of arguments and uses the available workers to do the work.   The results are returned in an array.

In this case count_headse did not need an argument so we constructed an anonymous function with one parameter to provide to the pmap function.  In this execution we were only concerned with dividing the work into two parts and then letting the system schedule and execute them with the available worker resources.   We chose 2 parts because we know there is two workers.   However, we could have divided into 10 parts and applied 10 argument values and the task would have been accomplished using the available workers.

## Julia distributed across multiple host machines.

To create a worker instance on another host Julia uses secure shell (ssh) tunnels to talk to it.  Hence you need five things:  the IP address of the host, the port that secure shell uses, the identity of the “user” on that host and the private ssh key for that user.  The ssh key pair must be password-less.  The location of the Julia command on that host is also needed.

In this and the next example we simplify the task of deploying Julia on the remote host by deploying our Julia package as a docker container launched on that host.  To make the ssh connection work we have mapped the ssh port 22 on the docker container to port 3456 on the host.   (We describe the container construction and how it is launched in the next section.)

In the  previous section we provided “addprocs()” with a single integer representing the number of worker we wanted launched locally.   As shown below, the remote version requires a bit more.  We supply an array of tuples to addprocs() where each tuple provides the contact point and the number of workers we want there.  In this example we spawn 2 workers on one remote node and one worker on the other.  We also provide the local location of the private key (here called pubkey) in the sshflags argument.

We also want each worker to have the Distributed.jl package and another package called “PyCall” which enables calling python library functions.  We demonstrate the python call with a call to socket.hostname() in each worker.   Notice that the remote function invocation returns the “print” output to the master process.  The strange hostnames that are printed are the synthetic host names from the docker containers.

This example did not address performance.  We treat that subject briefly at the end of the next section.

### Channels

In addition to spawning remote tasks Julia supports a remote channel mechanism.   This allows you to declare a channel that can carry messages of a given type and hold them for remote listeners to pickup.  In the example below we declare a remote channel that carries string messages and a function defined for the listeners to use.   The worker can use the “take!()” function to remove an item from the channel and the master uses “put!()” to fill the channel.  The message “-1”  tells the listener to stop listening.

Using the channel mechanism one can use Julia to have a group of workers respond to a queue of incoming messages.   In the Github site for this post we have put an example where the workers take the channel messages and put the results into an Azure table.

## Julia Distributed Computing on Kubernetes

Depending upon your favorite flavor of cloud there are usually three or four ways to spin up a cluster of nodes to run a distributed computation.   The most basic way to do this is to launch a group of virtual machines where each has the needed resources for the problem you want to solve.   For Julia, the best thing to do is launch instances of a ‘data science VM” that is available on AWS or Azure.  There are two things your VM needs: an installation of Julia version 1.0.0 or later and the ability to ssh to it without using a password.

Here is how to do this on Azure.   First, run “ssh-keygen” and it will step you through the process of generating a key-pair and give you the option of having no password.  Then from the Azure portal select “create a resource” and search for Linux data science VM.   As you go through the installation process when it asks for a key, paste in the public key you just generated.   When the VM is up you can ssh to it to verify that it has the needed version of Julia installed by typing “Julia” at the command line.  If it can’t find Julia you will need to download and install it.   While you are logged and running Julia, you should also install some of the libraries your distributed program will need.  For example, Distributed.jl and PyCall.jl or any package specific to your application.   If you have a large cluster of VMs, this is obviously time consuming.

A better solution is to package your worker as a Docker container and then use Kubernetes to manage the scale-out step.   We have set up a docker container dbgannon/juliacloud2 in the docker hub that was be used for the following experiments.  This container is based on Jupyter/datascience-notebook so it has a version of Jupyter and Julia 1.0.0 already installed.    However to make it work has a Julia distributed worker it must be running the OpenSsh daemon sshd. A passwordless key pair has been preloaded into the appropriate ssh directory.  We have also installed the Jullia libraries Distributed.jl, PyCall.jl and IJulia.jl.   PyCall is needed because we need to call some python libraries and Ijulia.jl is critical for running Julia in Jupyter.  We have included all the files needed to build and test this container in Github.

## Launching the container directly on your local network

The docker command to launch the container on your local network is

docker run -it -d -p 3456:22 -p 8888:8888  dbgannon/juliacloud2


This exposes the ssh daemon on port 3456 and, if you run the jupyter notebook that is on port 8888.  To connect to the server you will need the ssh key which is found on the Github  site.   (If you wish to use your own passwordless keypair you will need to rebuild the docker container using the files in Github. Just replace pubkey and pubkey.pub. and rebuild the container.)   To connect to the container and start Jupyter use the key pubkey.

ssh -i pubkey jovyan@localhost -p 3456
…

# Conclusion

Cloud data centers are getting high performance networks (with latencies of only a few microseconds in the case of Azure Brainwave) and immense computing capacity such as the tensor processing capability of Google’s TPU.  At the same time designers of supercomputers are having to deal with more failure resilience and complexity in the design of the first exascale supercomputers.  For the next generation exascale systems the nodes will be variations on a theme of multicore and GPU-style accelerators.

Observed from a distance, one might conclude the architectures of cloud data centers and the next generation of supercomputers are converging.  However, it is important to keep in mind that the two are designed for different purposes. The cloud is optimized for fast response for services supporting many concurrent globally distributed clients. Supers are optimized for exceptionally fast execution of programs on behalf of a small number of concurrent users.   However, it may be the case that an exascale system may be so large that parts of it can run many smaller parallel jobs at once.  Projects like Singularity provide a solution for running containerized application on supercomputers in a manner similar to the way microservices are run on a cloud.

## Possible Futures

### The continuum: edge+cloud+supercomputer

There are interesting studies showing how supercomputers are very good at training very large, deep neural networks.  Specifically, NERSC scientists have show the importance of this capability in many science applications[4]. However, if you need to perform inference on models that are streamed from the edge you need the type of edge+cloud strategy described here.   It not hard to imagine scenarios where vast numbers of instrument streams are handled by the edge and fed to inference models on the cloud and those models are being continuously improved on a back-end supercomputer.

### A data garden

In the near future, the most important contribution clouds can make to science is to provide access to important public data collections.  There is already reasonable start.   AWS has an opendata registry that has 57 data sets covering topics ranging from astronomy to genomics.   Microsoft Research has a Data Science for Research portal with a curated collection of datasets relating to human computer interaction, data mining, geospatial, natural language processing and more.  Google cloud has a large collection of public genomics datasets.  The US NIH has launch three new cloud data and analytics projects.  They include the Cancer Genomics Cloud led by the Institute for Systems Biology with Google’s cloud, FireCloud from the Broad Institute also using Google’s cloud and Cancer Genomics Cloud (CGC), powered by Seven Bridges.   These NIH facilities also provide analytics frameworks designed to help research access and effective use the resources.

I am often asked about research challenges in cloud computing that student may wish to undertake.   There are many.  The fact that the IEEE cloud computing conference being held in San Francisco in July received nearly 300 submissions shows that the field is extremely active.   I find the following topics very interesting.

1. Find new ways to extract knowledge from the growing cloud data garden.   This is a big challenge because the data is so heterogeneous and discovery of the right tool to use to explore it requires expert knowledge.  Can we capture that community knowledge so that non-experts can find their way?  What are the right tools to facility collaborative data exploration?
2. There are enormous opportunities for systems research in the edge-to-cloud-to-supercomputer path.  How does one create a system to manage and optimize workflows of activities that span this continuum?  Is there a good programming model for describing computations involving the edge and the cloud?  Can a program be automatically decomposed into the parts that are best run on the edge and the parts on cloud?  Can such a decomposition be dynamically adjusted to account for load, bandwidth constraints, etc.?
3. Concerning the intelligent assistant for research, there are a number of reasonable projects short of build the entire thing.  Some may be low hanging fruit, and some may be very hard.  For example, ArXiv, Wikipedia and Google search and Bing are great for discovery but in different ways.   Handling complex queries like “what is the role of quantum entanglement in the design of a quantum computer?” should lead to a summary of the answer with links.   There is a lot of research on summarization and there are a lot of sources of data.  Another type of query is “How can I access data on genetic indicators related to ALS?”  Google will go in the right direction, but it takes more digging to find data.

These are rather broad topics, but progress on even the smallest part may be fun.

[1] L. Ramakrishnan, P. T. Zbiegel, S. Campbell, R. Bradshaw, R. S. Canon, S. Coghlan, I. Sakrejda, N. Desai, T. Declerck, and A. Liu. Magellan: Experiences from a science cloud. In 2nd International Workshop on Scientiﬁc Cloud Computing, pages49–58., ACM, 2011.

P. Mehrotra, J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazanoﬀ, S. Saini, and R. Biswas. Performance evaluation of Amazon EC2 for NASA HPC applications. In 3rd Workshop on Scientiﬁc Cloud Computing, pages 41–50. ACM, 2012

[3] https://blogs.microsoft.com/green/2018/05/23/achievement-unlocked-nearly-200-million-images-into-a-national-land-cover-map-in-about-10-minutes/  from Lucas Joppa – Chief Environmental Scientist, Microsoft

# Parallel Programming in the Cloud with Python Dask

I am always looking for better ways to write parallel programs.  In chapter 7 of our book “Cloud Computing for Science and Engineering” we looked at various scalable parallel programming models that are used in the cloud.   We broke these down into five models: (1) HPC-style “Single Program Multiple Data” (SPMD) in which a single program communicates data with copies of itself running in parallel across a cluster of machines, (2) many task parallelism that uses many nearly identical workers processing independent data sets, (3) map-reduce and bulk synchronous parallelism in which computation is applied in parallel to parts of a data set and intermediate results of a final solution are shared at well defined, synchronization points,  (4) graph dataflow transforms a task workflow graph into sets of parallel operators communicating according to the workflow data dependencies and (5) agents and microservices  in which a set of small stateless services process incoming data messages and generate messages for other microservices to consume.  While some applications that run in the cloud are very similar to the batch style of HPC workloads, parallel computing in the cloud is often driven by different classes application requirements.  More specifically, many cloud applications require massive parallelism to respond external events in real time.  This includes thousands of users that are using apps that are back-ended by cloud compute and data.   It also includes applications that are analyzing streams of data from remote sensors and other instruments.   Rather than running in batch-mode with a start and end, these applications tend to run continuously.

A second class of workload is interactive data analysis.   In these cases, a user is exploring a large collection of cloud resident data.   The parallelism is required because the size of the data: it is too big to download and if you could the analysis would be too slow for interactive use.

We have powerful programming tools that can be used for each of the parallel computing models described above but we don’t have a single programming tool that support them all.   In our book we have used Python to illustrate many of the features and services available in the commercial clouds.  We have taken this approach because Python and Jupyter are so widely used in the science and data analytics community.  In 2014 the folks at Continuum (now just called Anaconda, Inc) and a several others released a Python tool called Dask which supports a form of parallelism similar to at least three of the five models described above.  The design objective for Dask is really to support parallel data analytics and exploration on data that was too big to keep in memory.   Dask was not on our radar when we wrote the drafts for our book,  but it certainly worth discussing now.

This is not intended as a full Dask tutorial.   The best tutorial material is the on-line YouTube videos of talks by Mathew Rocklin from Anaconda.   The official  tutorials from Anaconda are also available.  In the examples we will discuss here we used three different Dask deployments.  The most trivial (and the most reliable) deployment was a laptop installation.  This worked on a Windows 10 PC and a Mac without problem.  As Dask is installed with the most recent release of Anaconda, simply update your Anaconda deployment and bring up a Jupyter notebook and “import dask”.    We also used the same deployment on a massive Ubuntu linux VM on a 48 core server on AWS.  Finally, we deployed Dask on Kubernetes clusters on Azure and AWS.

Our goal here is to illustrate how we can use Dask to illustrate several of the cloud programming models described above.    We begin with many task parallelism, then explore bulk synchronous and a version of graph parallelism and finally computing on streams.  We say a few words about SPMD computing at the end, but the role Dask plays there is very limited.

## Many Task Parallelism and Distributed Parallel Data Structures

Data parallel computing is an old important concept in parallel computing.  It describes a programming style where a single operation is applied to collections of data as a single parallel step. A number of important computer architectures supported data parallelism by providing machine instructions that can be applied to entire vectors or arrays of data in parallel.  Called Single instruction, multiple data (SIMD) computers, these machines were the first supercomputers and included the Illiac IV and the early Cray vector machines.  And the idea lives on as the core functionality of modern GPUs.   In the case of clusters computers without a single instruction stream we traditionally get data parallelism by distributed data structures over the memories of each node in the cluster and then coordinating the application of the operation in a thread on each node in parallel.   This is an old idea and it is central to Hadoop, Spark and many other parallel data analysis tools.   Python already has a good numerical array library called numpy, but it only supports sequential operations for array in the memory of a single node.

Dask computations are carried out in two phases.   In the first phase the computation is rendered into a graph where the nodes are actual computations and the arcs represent data movements.   In the second phase the graph is scheduled to run on a set of resources.  This is illustrated below.  We will return to the details in this picture later.

Figure 1.  Basic Dask operations: compile graph and then schedule on cluster

There are three different sets of “resources” that can be used.   One is a set of threads on the host machine.   Another is a set of process and the third is a cluster of machines.   In the case of threads and local processes the scheduling is done by the “Single machine scheduler”.   In the case of a cluster it called the distributed cluster.  Each scheduler consumes a task graph and executes it on the corresponding host or cluster.   In our experiments we used a 48 core VM on AWS for the single machine scheduler. In the cluster case the preferred host is a set of containers managed by Kubernetes.   We deployed two Kubernetes clusters:  a three node cluster on Azure and a 6 node cluster on AWS.

Python programmers are used to numpy arrays, so Dask takes the approach to distributing arrays by maintaining as much of the semantics of numpy as possible.  To illustrate this idea consider the following numpy computation that creates a random 4 by 4 array, then zeros out all elements lest than 0.5 and computes the sum of the array with it’s transpose.

x = np.random.random((4,4))
x[x<0.5] = 0
y = x+x.T


We can use Dask to make a distributed version of the same matrix and perform the same computations in parallel.

Import dask.array as da
x = da.random.random(size = (4,4), chunks =(4,1))
x[x<0.5] = 0
y = x+x.T


The important new detail here is that we give explicit instructions on how we want the array to be distributed by specifying the shape of the chunks on each node.   In this case we have said we want each “chunk” to be a 4×1 slice of the 4×4 array.   We could have partitioned it into square blocks of size 2×2.   Dask takes care of managing each chunk and the needed communication between the processes that handle each chunk.   The individual chunks are managed on each thread/process/worker as numpy arrays.

As stated above, there are two parts to a dask computation.   The first phase is the construction of a graph representing the computation involving each chunk. We can actually take a look at the graph.   For example, in the computation above we can use the “visualize()” method as follows.

y = x+x.T
y.visualize()


Figure 2.   Sample Dask Graph for x+x.T

The nodes represent data or operations and the lines are data movements from one node to another.  As can be seen this is a rather communication intensive graph.   This is becase the transpose operation requires element on the rows (which are distributed) must be moved to columns on the appropriate node to do the addition.  The way we chunck the array can have a huge impact on the complexity of the distributed computation.  For example, 2×2 chuncking makes this one very easy.   There are 4 chunks and doing the transpose involves only a simple swap of the “off diagonal” chunks.   In this case the graph is much simpler (and easier to read!)

Figure 3.  Task graph for x+x.T with 2×2 chunking of data

The second step for Dask is to send the graph to the scheduler to schedule the subtasks and execute them on the available resources. That step is accomplished with a call to the compute method.

y.compute()


Dask arrays support almost all the standard numpy array operations except those that involve complex communications such as sorting.

In addition to numpy-style arrays, Dask also has a feature called Dask dataframes that are distributed versions of Pandas dataframes.   In this case each Dask dataframe is partitioned by blocks of rows where each block is an actual Pandas dataframe.  In other words, Dask dataframes operators are wrappers around the corresponding Pandas wrappers in the same way that Dask array operators are wrappers around the corresponding numpy array operators.    The parallel work is done primarily by the local Pandas and Numpy operators working simultaneously on the local blocks and this is followed by the necessary data movement and computation required to knit the partial results together.  For example, suppose we have a dataframe, df, where each row is a record consisting of a name and a value and we would like to compute the sum of the values associated with each name.   We assume that names are repeated so we need to group all records with the same name and then apply a sum operator.  We set this up on a system with three workers.  To see this computational graph we write the following.

df.groupby(['names']).sum().visualize()


Figure 4.  Dataframe groupby reduction

Figure 5.  Processing Yellow Cab data for New York City

The persist method moves the dataframe into memory as a persistent object that can be reused without being recomputed.  (Note:  the read_cvs method did not work on our kubernetes clusters because of a missing module s3fs in the dask container, but it did work on our massive shared memory VM which has 200 GB of memory.)

Having loaded the data we can now follow the dask demo example and compute the best hour to be a taxi driver based on the fraction of tip received for the ride.

Figure 6.  New York City cab data analysis.
As you can see, it is best to be a taxi driver about 4 in the morning.

A more general distributed data structure is the Dask Bag that can hold items of less structured type than array and dataframes.   A nice example http://dask.pydata.org/en/latest/examples/bag-word-count-hdfs.html illustrates using Dask bags to explore the Enron public email archive.

One of the more interesting Dask operators is one that implements a version of the old programming language concept of a future   A related concept is that of lazy evaluation and this is implemented with the dask.delayed function.   If you invoke a function with the delayed operator it simply builds the graph but does not execute it.  Futures are different.    A future is a promise to deliver the result of a computation later.  The future computation begins executing but the calling thread is handed a future object which can be passed around as a proxy for the result before the computation is finished.

The following example is a slightly modified version of one of the demo programs.   Suppose you have four functions

def foo(x):
return result
def bar(x):
return result
def linear(x, y):
return result
def three(x, y, z):
return result


We will use the distributed scheduler to illustrate this example. We first must create a client for the scheduler. Running this on our Azure Kubernetes cluster we get the following.


c = Client()
c


To illustrate the delayed interface, let us build a graph that composes our example functions

from dask import visualize, delayed
i = 3
x = delayed(foo)( I )
y = delayed(bar)( x )
z = delayed(linear)(x, y)
q = delayed(three)( x, y, z)
q.visualize(rankdir='LR')


In this example q is now a placeholder for the graph of a delated computation.   As with the dask array examples, we can visualize the graph (plotting it from Left to Right).

Figure 7.  Graph of a delayed computation.

A call to compute will evaluate our graph.   Note that we have implemented the  four functions each with about 1 second of useless computational math (computing the sum of a geometric series) so that we can measure some execution times.   Invoking compute on our delayed computation gives us

which shows us that there is no parallelism exploited here because the graph has serial dependences.

To create a future, we “submit” the function and its argument to the scheduler client.  This immediately returns a reference to future value and starts the computation.  When you need the result of the computation the future has a method “result()” that can be invoked and cause the calling thread to wait until the computation is done.

Now let us consider the case where the we need to evaluate this graph on 200 different values and then sum the results.   We can use futures to kick off a computation for each instance and wait for them to finish and sum the results.   Again, following the example in the Dask demos, we ran the following on our Azure Kubernetes cluster:

Ignore the result of the computation (it is correct). The important result is the time. Calculating the time to run this sequentially (200*4.19 = 838 seconds) and dividing by the parallel execution time we get a parallel speed-up of about 2, which is not very impressive. Running the same computation on the AWS Kubernetes cluster we get a speed-up of 4. The Azure cluster has 6 cores and the AWS cluster has 12, so it is not surprising that it is twice as fast. The disappointment is that the speed-ups are not closer to 6 and 12 respectively.

Results with AWS Kubernetes Cluster

However, the results are much more impressive on our 48 core AWS virtual machine.

Results with AWS 48-core VM

In this case we see a speed-up of 24.   The difference is the fact that the scheduling is using shared memory and threads.

Dask futures are a very powerful tool when used correctly.   In the example above, we spawned off 200 computations in less than a second.   If the work in the individual tasks is large, that execution time can mask much of the overhead of scheduler communication and the speed-ups can be much greater.

Dask has a module called streamz that implements a basic streaming interface that allows you to compose graphs for stream processing.   We will just give the basic concepts here.   For a full tour look at https://streamz.readthedocs.io.   Streamz graphs have sources,  operators and sinks.   We can start by defining some simple functions as we did for the futures case:

def inc(x):
return x+13
def double(x):
return 2*x
def fxy(x): #expects a tuple
return x[0]+ x[1]
return x+y
from streamz import Stream
source = Stream()


The next step will be to create a stream object and compose our graph.   We will describe the input to the stream later.   We use four special stream operators here.    Map is how we can attach a function to the stream.   We can also merge two streams with a zip operator.   Zip waits until there is an available object on each stream and then creates a tuple that combines both into one object.   Our function fxy(x) above takes a tuple and adds them.   We can direct the output of a stream to a file, database, console output or another stream with the sink operator.  Shown below our graph has two sink operators.

Figure 8.  Streamz stream processing pipeline.

Visualizing the graph makes this clear.   Notice there is also an accumulate operator.   This allows state flowing through the stream to be captured and retained.   In this case we use it to create a running total.  To push  something into the stream we can use the emit() operator as shown below.

The emit() operator is not the only way to send data into a stream. You can create the stream so that it takes events from kafka, or reads lines from a file or it can monitor a file system directory looking for new items. To illustrate that we created another stream to look at the home director of our kubernetes cluster on Azure. Then we started this file monitor. The names of the that are there are printed. Next, we added another file “xx” and it picked it up. Next, we invoked the stream from above and then added another file “xxx”.

# Handling Streams of Big Tasks

Of the five types of parallel programming Dask covers 2 and a half:  many task parallelism, map-reduce and bulk synchronous parallelism and part of graph dataflow.   Persistent microservices  are not part of the picture.   However, Dask and Streamz can be used together to handle one of the use cases for microservices.  For example, suppose you have a stream of tasks and you need to do some processing on each task but the arrival rate of tasks exceed the rate at which you can process them.   We treated this case with Microservices while processing image recognition with MxNet and the resnet-152 deep learning model (see this article.)  One can  use the Streams sink operation to invoke a future to spawn the task on the Kubernetes  cluster.   As the tasks finish the results can be pushed to other processes for further work or to a table or other storage as illustrated below.

Figure 9 Extracting parallelism from a stream.

In the picture we have a stream called Source which gathers the events from external sources.  We then map it to a function f() for initial processing. The result of that step is sent to a function called spawn_work which creates a future around a function that does some deep processing and sends a final result to an AWS DynamoDB table.   (The function putintable(n) below shows an example.  It works by invoking a slow computation then create the appropriate DynamoDB metadata and put the item in the table “dasktale”.)

def putintable(n):
import boto3
e = doexp(n*1000000)
dyndb = boto3.resource('dynamodb', … , region_name='us-west-2' )
table.put_item(Item= item )
return e

def spawn_work(n):
x = cl.submit(putintable, n)


This example worked very well. Using futures allowed the input stream to work at full speed by exploiting the parallelism. (The only problem is that boto3 needs to be installed on all the kubernetes cluster processes. Using the 48 core shared memory machine worked perfectly.)
Dask also has a queue mechanism so that results from futures can be pushed to a queue and another thread can pull these results out. We tried as well, but the results were somewhat unreliable.

# Conclusion

There are many more stream, futures, dataframe and bag operators that are described in the documents.   While it is not clear if this stream processing tool will be robust enough to replace any of the other systems current available, it is certainly a great, easy-to-use teaching tool.   In fact, this statement can be made about the entire collection of Dask related tools.   I would not hesitate to use it in an undergraduate course on parallel programming.   And I believe that Dask Dataframes technology is very well suited to the challenge of big data analytics as is Spark.

The example above that uses futures to extract parallelism from a stream challenge is interesting because it is completely adaptive. However, it is essential to be able to launch arbitrary application containers from futures to make the system more widely applicable.   Some interesting initial work has been done on this at the San Diego Supercomputer center using singularity to launch jobs on their resources using Dask.   In addition the UK Met Office is doing interesting things with autoscaling dask clusters.   Dask and StreamZ are still young.   I expect them to continue to evolve and mature in the year ahead.

# Algorithmia™: A Cloud Marketplace for Algorithms and Deep Learning

One area of great frustration encountered by application developers involves the challenge of integrating new algorithms into a code base.  There are many reasons for this.   For example, the algorithm may be described in a journal article where many details of the implementation are omitted or it is available only in a programming language different from the one being used.  The code may have software dependencies that are hard to resolve.  The new algorithm may also have hardware dependencies, such as reliance on a GPU to get performance and you may not have access to this hardware.  On the other hand, if you are the author of a great new algorithm you may be disappointed that your new invention is not being used for these very same reasons.

About 18 months ago a company called Algorithmia™  was founded in Seattle that provides an elegant solution to these problems.  They provide a very simple multi-language API that can be used to invoke any of their catalog of 3,500 different cloud-based algorithms. While we may be getting tired of reading about X-as-a-Service for different versions of X, there is one binding for X that has been around for a while in various forms and, as much as it pains me to do so, it begs to be called Algorithms as a Service.   And AaaS is just one of the things Algorithmia provides.

AaaS is indeed not a new idea.  Jack Dongarra and his ICL team at the University of Tennessee created NetSolve/GridSove in 2003  to provide scientists and engineers with access to state-of-the-art numerical algorithms running on a distributed network of high performance computers.   As cool as NetSolve is, Algorithmia goes several steps beyond this concept.

One of Algorithmia’s cofounders and CEO,  Diego Oppenheimer has a deep background in building business intelligence tools.   While working on that he developed an appreciation of the power of being able to call out to powerful algorithms from inside a user facing application.  This capability allows the application to have access to deeper knowledge and more powerful computational resources than available on the user’s device.  A key insight from this experience is that algorithms must be discoverable an invokable from any user application runtime.   These ideas are all central to Algorithmia.  In the following paragraphs we will look at Algoritmia’s marketplace,  explore building a new  algorithm and discuss a bit of the system microservice architecture.

# Algorithmia is a marketplace.

There are over 50,000 developers that use Algorithmia services and the platform encourages these developers to contribute new algorithms to the collection.   Invoking an algorithm is dead simple and it can be done from any programming language that can formulate a JSON doc and send a REST message.   We will provide some detailed illustrations at the end of this document.

To use it, you need to set up an account.   Doing so will get you a starter award of 5000 or so “credits”.   When you invoke an algorithm, credits are deducted from your account.   Typically, there is a “royalty” cost of about 10 credits and then the cost is usually around one credit per second of execution.   A fun example from their library of deep learning collection is an image colorizer.   Input is a PNG file of a black and white image and the returned value is a link to the output colorized image.  We took a color image from a visit to Red Square a few years ago.   We converted it to a grayscale image and gave that to the colorizer.  The result is shown illustrated below.  The original is on the left, grayscale in the middle and the colorized image on the right.   While it is not as good as the best hand-colored photos, it is not too bad.     It lost the amazing color of St. Bazil’s Cathedral which is not too surprising,  but it was great with sky and skin tones of those people in foreground.   (It seemed to think the bricks of the square would look better with some grass color.)

The Python code to upload the grayscale image and invoke the service was incredibly simple.

import Algorithmia
client = Algorithmia.client(‘youruserkeyfromaccountrecation’)
result = client.algo("deeplearning/ColorfulImageColorization/1.1.6")
.pipe(input).result
path_to_local_copy_of_result_image= client.file(result[‘output’]).getFile()


The cost in credits was 154.   The exchange rate for credits is 1$= 10,000 credits (approximately) so this invocation would have cost about 1.5 cents. This algorithm is from their extensive machine learning and AI collection. A related algorithm is one that computes the salience of objects in an image. Salience is the degree to which an object in the image attracts the attention of the viewer’s eye. The algorithm is called SalNet and it is based on ideas from the paper, Shallow and Deep Convolutional Networks for Saliency Prediction by Pan et. al. (see arXiv:1603.00845v1). As with the colorizer, salnet it is easy to invoke. input = { "image": "data://.algo/deeplearning/SalNet/perm/an-uploaded-image.png" } result2 = client.algo("deeplearning/SalNet/0.2.0").pipe(input).result  Note that in this case we have loaded the image from one that we uploaded to Algorithmia’s data cloud. In fact, it is the same grayscale image of red square. As you can see below, the algorithm picks out the woman in the foreground and also notices the church in the background. Salience computation can be very helpful in identifying and labeling objects in an image. Image tagging is also something that Algorithmia supports. Running the same image through their tagger returned the observations that the image was “safe” and that there were multiple boys and multiple girls and sky and clouds and it seem to be near a palace. There are many other AI related image algorithms such as nudity detection, character recognition, face detection and a very impressive car make and model recognition algorithm. A quick look at https://algorithmia.com/use-cases will show many other fascinating use cases. Another very cool capability of Algorithmia is its ability to host your trained machine learning model. Suppose you have a model you have built with MsXNet, TensorFlow, Scikit-Learns, CNTK or any of the other popular ML frameworks, you can upload your model to Algorithmia so that it can be available as a service. This is explained in here. We will explore this capability in a later post. While the main emphasis and attraction of the Algorithmia collection is machine learning and AI, there are many more algorithm categories represented there. For example, there is an excellent collection of utilities for managing data and making certain programming tasks extremely easy: such as extracting text from web pages, Wikipedia search tools, computing the timezone and elevation from lat, lon coordinates. There is also a large collection of time series analysis algorithms. These include forecasting, outlier detection, Fourier filters, auto-correlation computation and many more. # Algorithmia is cloud of microservices In an excellent talk at the 2017 Geekwire cloud summit, Oppenheimer described some key elements of Algorithmia’s architecture. In this talk he makes the critically important observation that two phases of machine learning, training and prediction, if used in production require very different execution environments. Training is often done on a dedicated system consuming many hours of compute and as much memory as is available. The result of training is a model codified as data. Prediction (also called Inference) uses the model to make predictions or inferences about a sample case. Prediction can be done on the same hardware platform that was used for the training, but if the model is to be used to make predictions concerning thousands of cases for thousands of concurrent users, one need a completely different design. Their approach to the scale problem for predictions (and for any high demand algorithm in their collection) is based on serverless microservices. They use a Kubernetes microservice foundation with algorithms deployed in Docker containers. Requests from remote client applications are load balanced across API servers who dispatch requests to container instances for the requested function. The challenge is making the latency from request to reply very low. If a container for an algorithm is already in system memory, it requires very little time to spawn a new instance on Kubernetes. Another technique they use it to dynamically load algorithms into running containers. (We don’t know the exact mechanism Algorithmia uses here, but we expect it is exploiting these facts.) They have made some very interesting optimizations. For example, if the data used in the computation is stored in one of their cloud regions, the docker instance will be instantiated nearby. Just as important, if an algorithm invokes another algorithm they will attempt to co-locate the two containers and reduce the inter-process latency. Composability of algorithms is one of their guiding concepts. Turning your own algorithm into a microservice The process of turning your own algorithm into a microservice is remarkably simple. From the Algorithmia portal there is a “+” symbol in the upper right-hand corner. This give you a dialog box to fill out. You provide a name of your algorithm, the programming language you are using (from a long list .. but sorry, no Fortran or Julia but there are lots of alternatives), and several other choices including: your source license policy, does your function invoke other Algorithmia functions, does your function invoke things on the open internet? Answering these questions causes Algorithmia to create a nice GitHub repo for your function. Your next step is to install the Algorithmia command line interface and then you can clone your functions GitHub repo. Once you have done that you can edit the function so that it does what you want. The basic skeleton is already there for you in the “src” directory. Here is the basic skeleton in Python rendered as a hello world function. import Algorithmia # API calls will begin at the apply() method, # with the request body passed as 'input' # For more details, see algorithmia.com/developers/algorithm- # development/languages def apply(input): return "hello {}".format(input)  You can edit the function directly from an editor built into the Algorithmia portal or, now that you have a clone of the repo you can use your own tools to transform this skeleton into your algorithm. If you have done this work on your clone you need to use the Github commands to push your code back to the master. We tried this with a small experiment. We built a function called KeyPhrases that takes English language text as input and breaks it down into subjects (s), actions (a) which are like verb clauses and objects (o). The algorithm is not very useful or sophisticated. In fact, it uses another Algorithmia microservice called Parsey McParseface which was originally released by Goolge (see https://arxiv.org/ pdf/1603.06042v1.pdf) . This is truly a deep parser that build a very sophisticated tree. For example the figure below illustrates the tree for a pars of the sentence Einstein’s general theory of relativity explains gravity in terms of the curvature of spacetime. Parsey McParseface tree output. Our function KeyPhrases walks the tree and groups the terms, subjects(s), objects(o) and actions (a) and returns a JSON document with the original string and the list of phrases. It also breaks out separate subphrases with “/” marks. In this case it returns {"phrases":[ "s: Einstein's general theory /of relativity ", "a: explains ", "s: gravity /in terms /of the curvature /of spacetime. " ], "text":"Einstein's general theory of relativity explains gravity in terms of the curvature of spacetime." }  A more complex example is Facebook Incs chief security officer warned that the fake news problem is more complicated to solve than the public thinks. The phrase output is ['s: Facebook Incs chief security officer ', 'a: warned ', 'o: that the fake news problem ', 'a: is more ', 'o: complicated and dangerous /to solve /than the public thinks ']  This is clearly not as rich in detail as the Parsey output, but it does extract some useful key phrases. To complete the creation of the microservice for this algorithm one need only issue the git commands $ git add src/KeyPhrases.py
$git commit -m "added src mods"$ git push origin master

The last push causes a compile step to happen and the microservice is now created.   Algorithmia also provides an easy template to add documentation and instructions about how to invoke your function.  From the Algorithmia editor there is a function that allows you to “publish” your algorithm.   After pushing that button, the KeyPhrase example was put in their library.   You can see it here: https://algorithmia.com/algorithms/dbgannon/KeyPhrases (If you use it, remember it has not been tested very well, so it may break.)

# Algorithmia as an enterprise platform

The Algorithmia serverless microservice platform is robust enough that they offer it as an enterprise product.   This allows enterprises to host their own version on one of the public clouds or on their own clusters or across multiple cloud in a hybrid system.    This allows their own internally used algorithm to be hosted and invoked by their in-house analytics tools and pipelines in a totally scalable way.   This enterprise version comes with a management dashboard and monitoring tools.

# Conclusions

Algorithmia is a fascinating company with very interesting products.   It is extremely easy to sign up for a free account and it is fun to use.   The team was extremely helpful when we had questions.  A Jupyter Notebook with some of the examples mentioned above will be posted very soon.   We found experimenting with the various algorithms from an interactive notebook was a pleasure.   Creating the hosted version of the KeyPhrases algorithm took less than an hour after the original python code was debugged.   In our next experiment we will explore hosting deep learning models with Algorithmia.

# Cloud-Native Applications – call for papers

The IEEE Cloud Computing Journal is going to publish a special issue on the topic of Cloud-Native applications. This is an extremely interesting topic and it cuts to the heart of what makes the cloud a platform that is, in many ways, fundamentally different from what we have seen before.

What is “cloud-native”? That is what we want you to tell us. Roger Barga from Amazon, Neel Sundaresan from Microsoft and this blogger have been invited to be guest editors. But we want you to tell us soon. The deadline is March 1, 2017. The papers do not need to be long (3,000 to 5,000 words) and some of the topics possible include:

• Frameworks to make it easier for industry to build cloud-native applications;
• Educational approaches and community based organizations that can promote cloud-native design concepts;
• The role of open source for building cloud-native applications;
• VM and container orchestration systems for managing cloud-native designs;
• Efficient mechanisms to make legacy applications cloud-native;
• Comparing applications – one cloud-native and the other not – in terms of performance, security, reliability, maintainability, scalability, etc.;

# A Brief Look at Google’s Cloud Datalab

Google recently released a beta version of a new tool for data analysis using the cloud called Datalab.  In the following paragraphs we take a brief look at it through some very simple examples.  While there are many nice features of Datalab, the easiest way to describe it would be to say that it is a nice integration of the IPython Jupyter notebook system with Google’s BigQuery data warehouse.  It also integrates standard IPython libraries such as graphics and scikit-learn and Google’s own machine learning toolkit TensorFlow.

The Google public data sets that are hosted in the BigQuery warehouse are fun to explore.  They include

• The names on all US social security cards for births after 1879.  (The table rows contain only the year of birth, state, first name, gender and number as long as it is greater than 5.  No social security numbers.),
• The New York City Taxi trips from 2009 to 2015,
• All stories and comments from “Hacker News”,
• The US Dept of Health weekly records of diseases reported from each city and state from 1888 to 2013,
• The public data from the HathiTrust and the Internet Book Archive,
• The global summary of the day’s (GSOD) weather from the national oceanographic and atmospheric administration from 9000 weather stations between 1929 and 2016.

And more, including the 1000 genome database.

To run Datalab on your laptop you need to have Docker installed.   Once Docker is running then and you have created a Google cloud account and created a project, you can launch Datalab with simple docker command as illustrated in their quick-start guide.  When the container is up and running you can view it at http://localhost:8081.  What you see at first is shown in Figure 1.  Keep in mind that this is beta release software so you can expect it will change or go away completely.

Figure 1.  Datalab Top level view.

Notice the icon in the upper right corner consisting of a box with an arrow.   Clicking this allows you to login to the Google cloud and effectively giving your authorization to allow you container to run on your gcloud account.

The view you see is the initial notebook hierarchy.   Inside docs is a directory called notebooks that contain many great tutorials and samples.

## A Few Simple Examples of Using Datalab

As mentioned above, one of the public data collections is the list of first names from social security registrations.   Using Datalab we can look at a sample of this data by using one of the built-in Bigquery functions as shown in Figure 2.

Figure 2.   Sampling the names data.

In modern America there is a movement to “post-gender” names.   Typical examples cited on the web are “Dakota”, “Skyler” and  “Tatum”.   A very simple SQL query can be formulated to see how the gender breakdown for these names show up in the data.  In Datalab, we can formulate the query as shown in Figure 3.

Figure 3.   Breakdown by gender of three “post-gender” names.

As we can see, this is very nearly gender balanced.  A closer inspection using each of the three names separately show that “Skyler” tends to be ‘F’ and “Tatum” tends to ‘M’. On the other hand, “Dakota” does seem to be truly post-gender with 1052 ‘F’ and 1200 ‘M’ occurrences.

We can also consider the name “Billy” which, in the US, is almost gender neutral.   (Billy Mitchel was a famous World Work I general and also a contemporary Jazz musician.  Both male. And Billy Tipton and Billy Halliday were female musicians though Billy Halliday was actually named Billie and Billy Tipton lived her life as a man, so perhaps they don’t count.   We can ask how often Billy was used as a name associated with gender ‘F’ in the database?  It turns out it is most common in the southern US. We can then group these by state and create a count and show the top five.   The SQL command is easily inserted into the Datalab note book as shown in Figure 4.

Figure 4.   Search for Billy with gender ‘F’ and count and rank by state of birth.

## Rubella in Washington and Indiana

A more interesting data collection is Center for Disease Control and Prevention dataset concerning diseases reported by state and city over a long period.   An interesting case is Rubella, which is virus also known as the “German measles”.   Through our vaccination programs it has been eliminated in the U.S. except for those people who catch it in other countries where it still exists.  But in the 1960s it was a major problem with an estimated 12 million cases in the US and a significant number of newborn deaths and birth defects.   The vaccine was introduced in 1969 and by 1975 the disease was almost gone.   The SQL script shown below is a slight modified version of one from the Google Bigquery example.   It has been modified to look for occurrences of Rubella in two states, Washington and Indiana, over the years 1970 and 1971.

%%sql --module rubella
SELECT
*
FROM (
SELECT
*, MIN(z___rank) OVER (PARTITION BY cdc_reports_epi_week) AS z___min_rank
FROM (
SELECT
*, RANK() OVER (PARTITION BY cdc_reports_state ORDER BY cdc_reports_epi_week ) AS z___rank
FROM (
SELECT
cdc_reports.epi_week AS cdc_reports_epi_week,
cdc_reports.state AS cdc_reports_state,
COALESCE(CAST(SUM((FLOAT(cdc_reports.cases))) AS FLOAT),0)
AS cdc_reports_total_cases
FROM
[lookerdata:cdc.project_tycho_reports] AS cdc_reports
WHERE
(cdc_reports.disease = 'RUBELLA')
AND (FLOOR(cdc_reports.epi_week/100) = 1970
OR FLOOR(cdc_reports.epi_week/100) = 1971)
AND (cdc_reports.state = 'IN'
OR cdc_reports.state = 'WA')
GROUP EACH BY
1, 2) ww ) aa ) xx
WHERE
z___min_rank <= 500
LIMIT
30000


We can now invoke this query as part of a python statement so we can capture its result as a pandas data frame and pull apart the time stamp fields and data values.

rubel = bq.Query(rubella).to_dataframe()
rubelIN = rubel[rubel['cdc_reports_state']=='IN']
.sort_values(by=['cdc_reports_epi_week'])
rubelWA = rubel[rubel['cdc_reports_state']=='WA']
.sort_values(by=['cdc_reports_epi_week'])
epiweekIN = rubelIN['cdc_reports_epi_week']
epiweekWA = rubelWA['cdc_reports_epi_week']
rubelINval = rubelIN['cdc_reports_total_cases']
rubelWAval = rubelWA['cdc_reports_total_cases']


At this point a small adjustment must be made to the time stamps.  The CDC reports times in epidemic weeks and there are 52 weeks in a year.    So the time stamps for the first week of 1970 is 197000 and the time stamp for the last week is 197051.  The next week is 197100.  To make these into timestamps that appear contiguous we need to make a small “time compression” as follows.

realweekI = np.empty([len(epiweekIN)])
realweekI[:] = epiweekIN[:]-197000
realweekI[51:] = realweekI[51:]-48


Doing the same thing with epiweekWA we now have the basis of something we can graph.  Figure 5 shows the progress of rubella in Washington and Indiana over two years.  Washington is the red line and Indiana is blue.   Note that the outbreaks occur about the same time in both states and that by late 1971 the disease is nearly gone.

Figure 5.   Progress of Rubella in Washington (red) and Indiana (blue) from 1970 through 1971.

Continuing the plot over 1972 and 1973 show there are flare-ups of the disease each year but their maximum size is diminishes rapidly.

(Datalab has some very nice plotting functions, but we could not figure out how to do a double plot, so we used the mathplot library with the “fivethirtheight” format.)

## A Look at the Weather

From the national oceanographic and atmospheric administration we have the global summary of the day’s (GSOD) weather from the from 9000 weather stations between 1929 and 2016.   While not all of these stations were operating during that entire period, there is still a wealth of weather data here.   To illustrate it, we can use another variation on one of Google’s examples.  Let’s find the hottest spots in the state of Washington for 2015.   This was a particularly warm year that brought unusual droughts and fires to the state. The following query will list the hottest spots in the state for the year.

%%sql
SELECT
max, (max-32)*5/9 celsius, mo, da, state, stn, name
FROM (
SELECT
max, mo, da, state, stn, name
FROM
[bigquery-public-data:noaa_gsod.gsod2015] a
JOIN
[bigquery-public-data:noaa_gsod.stations] b
ON
a.stn=b.usaf
AND a.wban=b.wban
WHERE
state="WA"
AND max

The data set ‘gsod2015’ is the table of data for the year 2015.  To get a list that also shows the name of the station we need to do a join with the ‘station’ table over the corresponding station identifiers.  We order the results descending from the warmest recordings.    The resulting table is shown in Figure 6 for the top 10.

Figure 6.   The top 10 hottest spots in Washington State for 2015

The results are what we would expect.   Walla Walla, Moses Lake and Tri Cities are in the eastern part of the state and summer was very hot there in 2015.   But  Skagit RGNL is in the Skagit Valley near Puget Sound.   Why is it 111 degrees F there in September?   If it is hot there what was the weather like in the nearby locations?   To find out which stations were nearby we can look at the stations on a map.   The query is simple but it took some trial and error.

%%sql --module stationsx
DEFINE QUERY locations
SELECT FLOAT(lat/1000.0) AS lat, FLOAT(lon/1000.0) as lon, name
FROM [bigquery-public-data:noaa_gsod.stations]
WHERE state="WA" AND name != "SPOKANE NEXRAD"


It seems that the latitude and longitude for the Spokane NEXRAD station are incorrect and resolve to some point in Mongolia.  By removing it we get a good picture of the nearby stations as shown in Figure 7.

Figure 7.   Location of weather stations in western Washington using the Bigquery chart map function.

This is an interactive map, so we can get the names of the nearby stations.   There is one only a few miles away  called PADILLA BAY RESERVE and the next closest is BELLINGHAM INTL.   We can now compare the weather for 2015 at these three locations.

To get the weather for each of these we need the station ID.   We can do that with a simple query.

%%sql
SELECT
usaf, name
FROM [bigquery-public-data:noaa_gsod.stations]
WHERE
name="BELLINGHAM INTL" OR name="PADILLA BAY RESERVE" OR name = "SKAGIT RGNL"


Once we have our three station IDs we can use the follow to build a parameterized Bigquery expression.

qry = "SELECT max AS temperature, \
TIMESTAMP(STRING(year) + '-' + STRING(mo) + \
'-' + STRING(da)) AS timestamp \
FROM [bigquery-public-data:noaa_gsod.gsod2015] \
WHERE stn = '%s' and max /< 500 \
ORDER BY year DESC, mo DESC, da DESC"

stationlist = ['720272','727930', '727976']

dflist = [bq.Query(qry % station).to_dataframe() for station in stationlist]



We can now render an image of the weather for our three stations as shown in Figure 8.

Figure  8.  Max daily temperatures for Skagit (blue), Padilla Bay (red) and Bellingham (yellow)

We can clearly see the anomaly for Skagit in September and it is also easy to spot another problem in March where the instruments seemed to be not recording.   Other than that there is close alignment of the readings.

## Conclusions

There are many features of Datalab that we have not demonstrated here.   The documentation gives an example of using Datalab with Tensorflow and the charting capabilities are more extensive than demonstrated here.  (The Google maps example here was not reproducible in any other notebook beyond the demo in the samples which we modified to run the code here.)  It is also easy to upload your own data to the warehouse and analyze it with Datalab.

Using Datalab is almost addictive.  For every one of the data collections we demonstrated here there were many more questions we wanted to explore.  For example, where and when did the name “Dakota” start being used and how did its use spread?   Did the occurrence of Rubella outbreaks correspond to specific weather events?  Can we automate the process of detecting non-functioning weather instruments over the years where records exist?  These are all relatively standard data mining tasks, but the combination of Bigquery and IPython in the notebook format makes it fun.

It should be noted that Datalab is certainly not the first use of the IPython notebook as a front-end to cloud hosted analysis tools.  The IPython notebook has been used frequently with Spark as we have previously described.  Those interested in an excellent overview of data science using Python should look at “Python Data Science Handbook”  by Jake VanderPlas which makes extensive use of IPython notebooks.     There are a variety of articles about using Jupyter on AWS  and Azure for data analytics.  A good one is by Cathy Ye about deep learning using Jupyter in the cloud where she gives detailed instruction for how to install Jupyter on AWS and deploy Caffe there.

.

# Kubernetes and the Google Cloud Container Service: Fun with Pods of Celery.

In a previous post I talked about using Mesosphere on Azure for scaling up many-tasks parallel jobs and I promised to return to Kubernetes when I figured out how to bring it up.   Google just made it all very simple with their new Google Cloud container services.   And, thanks to their good tutorials, I learned about a very elegant way to do remote procedure calls using another open source tool called Celery.

So let me set the stage with a variation on an example I have used in the past.   Suppose we have 10000 scientific documents that are stored in the cloud.   I would like to use a simple machine learning method to classify each of these by topic.    I would like to do this quickly as possible and, because the analysis of each document is independent of the others, I can try to process as many as possible in parallel.  This is the basic “many task” parallel model and one of the most common uses of the cloud for scientific computing purposes.      To do this we will use the Celery distributed task queue mechanism to take a list of our documents and send each one to a work queue where the tasks will be parceled out to workers who will do the analysis and respond.

## The Google Cloud Container Service and a few words about Kubernetes.

Figure 1.   Creating a Google cloud cluster and connecting the cloud shell to it.

Interacting with Kubernetes, which is now running on our small cluster, is through command lines which can be entered into the cloud shell.   Kubernetes has a different, and somewhat more interesting architectures than other container management tools.   The basic unit of scheduling in Kubernetes is  launching pods. A pod consists of a set of one or more Docker-style containers together with a set of resources that are shared by the containers in that pod.  When launched a pod resides on a single server or VM.   This has several advantages for the containers in that pod.   For example, because the containers in a pod are all running  on the same VM, the all share the same IP and port space so the containers can find each other through conventional means like “localhost”.   They can also share storage volumes that are local to the pod.

To start let’s consider a simple single container pod to run the Jupiter notebook.  There is a standard Docker container that contains Jupyter and the scipy software stack.  Using the Kubernetes control command kubectl we can launch Jupyter and expose its port 8888 with the following statement.

$kubectl run jupyter --image=jupyter/scipy-notebook --port=8888  To see that it is up and running we can issue the command “kubectl get pods” which will return the status of all of our running pods. Though we have launched jupyter it is still not truly visible. To do that we will associate a load balancer with the pod. This will expose the port 888 to the open Internet. $ kubectl expose deployment jupyter --type=LoadBalancer


Once that has been run you can get the IP address for jupyter from the “LoadBalancer Ingress:” field of the service description when you run the following.   If it doesn’t appear, try again.

$kubectl describe services jupyter  One you have verified that it is working at that address on port 8888, you should shut it down immediately because, as you can see, there is no security with this deployment. Deleting a deployment is easy. $ kubectl delete deployment jupyter


There is another point that one must be aware of when building containers that need to directly interact with the google cloud APIs.  To make this work you will need to get application default credentials to run in your container.   For example if you container is going to interact with the storage services you will need this.   To get the default application credentials follow the instructions here.  We will say a few more words about this below.

## The Analysis Example in detail.

Now to describe  Celery and how to use Celery and Kubernetes in the many-task scenario described above.

To use Celery we start with our analysis program.   We have previously described the analysis algorithms in detail in another post, so we won’t duplicate that here.  Let’s start by assuming we have a function predict(doc) that takes a document as a string as an argument and returns a string containing the result from our trained machine learnging classifiers.  Our categories are “Physics”, “Math”, “Bio”, “Computer Science” and “Finance” and the result from each classifier is simply the category that that classifier determines to be the most likely correct answer.

Celery is a distributed remote procedure call system for Python programs.   The Celery view of the world is you have a set of worker processes running on remote machines and a client process that is invoking functions that are executed on the remote machines.   The workers and the clients all coordinate through a message broker running somewhere else on the network.

Here we use a RabbitMQ service that is running on a Linux VM on the NSF JetStream cloud as illustrated in Figure 2.

Figure 2.  Experimental Configuration with Celery workers running on Kubernetes in the Google Cloud Container Service,  the RabbitMQ broker running in a VM on the NSF Jetstream Cloud and a client program running as a notebook on a laptop.

The code block below illustrates the basic Celery worker template.    Celery is initialized with a constructor that takes the name of the project and a link to the broker service which can be something like a Redis cache or MongoDB.   The main Celery magic is invoked with a special Python “decorator” associated with the Celery object as shown in the predictor.py file below.

from celery import Celery
app = Celery('predictor', backend='amqp')

#Now initialize and load all the data structure that will be constant
#and recused for each analysis.  In our case this will include
#all the machine learning models that were trained on the data
#previously. And create a main worker function to invoke the models.
def invokeMLModels(statement):
....
return analysis

#define the functions we will call remotely here
def predict(statement):
prediction = invokeMLModels(statement)
return [prediction]


What this decorator accomplishes is to wrap the function in a manner that it can be invoked by a remote client.   To make this work we need create a Celery worker from our predictor.py file with the command below which registers a worker instance as a listener on the RabbitMQ queue.

>celery worker -A predictor -b 'amqp://guest@brokerIPaddr'


Creating a client program for our worker is very simple.   It is similar to the worker template except that our version of the predict( ) function does nothing because we are going to invoke it with the special Celery apply_async( ) method that will push the argument to the broker queue and return control immediately to the client.   The object that is returned from this call is similar to what is sometimes called a “future” or a “promise” in the programming language literature.  What it is a placeholder for the returned value.   Once we attempt to evaluate the get() method on this object our client will wait until a reply is returned from the remote worker that picked up the task.

from celery import Celery

def predict(statement):
return ["stub call"]

res = predict.apply_async(["this is a science document ..."])

print res.get()


Now if we have 10000 documents to analyze we can send them in sequence to the queue as follows.

#load all the science abstracts into a list

res = []
for doc in documents:
res.append(predict.apply_async([doc])

#now wait for them all to be done
predictions = [result.get() for result in res]
#now do an analysis of the predictions


Here we push each analysis task into the queue and save the async returned objects in a list.  Then we create a new list by waiting for each prediction value to be returned.   Our client can run anywhere there is Internet access.  For example this one was debugged on a Jupyter instances running on a laptop.  All you need to do is “pip install celery” and run Juypter.

There is much more to say about Celery and the interested reader should look at the Celery Project site for the definitive guide. Let us now turn to using this with Kubernetes.  We must first create a container to hold the analysis code and all the model data.   For that we will need a Docker file and a shell script to correctly launch celery one the container is deployed.   For those actually interested in trying this, all the files and data are in OneDrive here.   The Docker file shown below has more than we need for this experiment.

# Version 0.1.0
FROM ipython/scipystack
MAINTAINER yourdockername "youremail"
RUN easy_install celery
RUN pip install -U Sphinx
RUN pip install Gcloud
RUN easy_install pattern
RUN easy_install nltk
RUN easy_install gensim
COPY bookproject-key.json /
COPY models /
COPY config /
COPY sciml_data_arxiv.p /
COPY predictor.py /
COPY script.sh /
ENTRYPOINT ["bash", "/script.sh"]


To build the image we first put all the machine learning configuration files in a directory called config and all the learned model files in a directory called models.  At same level we have the predictor.py source.   For reasons we will explain later we will also include the full test data set: sciml_data_arxiv.p. The Docker build starts with the ipython/scipystack container.   We then use easy_install to install Celery as well as four packages used by the ML analyzers: pattern, nltk and gensim.   Though we are not going to use the Gcloud APIs here, we include them with a pip install.   But to make that work we need an updated copy of Sphinx.  To make the APIs work we would need our default client authentication keys.   They are stored in a json file called bookproject-key.json that was obtained from the Gcloud portal as described previously.   Finally we copy all of the files and directory to the root path ‘/’.  Note that the copy from a directory is a copy of the all contained files to the path ‘/’ and not to a new directory.   The ENTRYPOINT runs our script which is shown below.

cp /predictor.py .
export C_FORCE_ROOT='true'
echo $C_FORCE_ROOT celery worker -A predictor -b$1


Bash will run our script in a temp directory, so we need to copy our predictor.py file to that directory.  Because our bash is running as root, we need to convince Celery that it is o.k. to do that.  Hence we export C_FORCE_ROOT as true.  Next, if we were using the Gcloud APIs we need to export the application credentials.   Finally we invoke celery but this time we use the -b flag to indicate that we are going to provide the IP address of the RabbitMQ amqp broker as a parameter and we remove it from the explicit reference in the predictor.py file.  When run the predictor file will look for all the model and configuration data in ‘/’.   We can now build the docker image with the command

>docker build -t “yourdockername/predictor” .


And we can test the container on our laptop with

>docker run -i -t “yourdockername/predictor ‘amqp://guest@rabbitserverIP’


Using “-i -t” allows you to see any error output from the container.   Once it seems to be working we can now push the image to the docker hub.   (to use our version directly, just pull dbgannon/predictor)

We can now return to our Google cloud shell and pull a version of the container there.   If we want to launch the predictor container on the cluster, we can do it one at a time with the “kubectl run” command.  However Kubernetes has a better way to do this using a pod configuration file where we can specify the number of pod instances we want to create.  In the file below, which we will call predict-job.json we specify a job name, the container image in the docker hub,   and the  parameter to pass to the container to pass to the shell script.   We also specify the number of pods to create.   In this case that is 6 as identified in the “parallelism” parameter.

apiVersion: batch/v1
kind: Job
name:predict-job
spec:
parallelism: 6
template:
name: job-wq
spec:
containers:
- name: c
image: dbgannon/predictor
restartPolicy: OnFailure


One command in the cloud shell will now launch six pods each running our predictor container.

\$kubectl create -f predict-job.json


## Some Basic Performance Observations.

When using many tasks system based on a distributed worker model there are always three primary questions about the performance of the system.

1. What is the impact of wide-area distribution on the performance?
2. How does performance scale with the number of worker containers that are deployed? More specifically, if we N workers, how does the system speed up as N increases?    Is there a point of diminishing return?
3. Is there a significant per/task overhead that the system imposes?  In other words, If the total workload is T and if it is possible to divide that workload into k tasks each  of size T/k , then what is the best value of k that will maximize performance?

Measuring the behavior of a Celery application as a function of the number workers is complicated by a number of factors.   The first concern we had was the impact of widely distributing the computing resources on the overall performance.   Our message broker (RabbitMQ) was running in a virtual machine in Indiana on the JetStream cloud.   Our client was running a Jupyter notebook on a laptop and the workers were primarily on the Midwest Google datacenter and on a few on other machines in the lab.   We compared this to a deployment where all the workers, the message broker and the client notebook were all running together on the Google datacenter.   Much to our surprise there was little difference in performance between the two deployments.   There are two ways to view this result.   One way is to say that the overhead of wide area distribution was not significant.   The other way to say this is that the overhead of wide area distribution was negligible compared to other performance problems.

A second factor that has an impact on performance as a function of the number of workers is the fact that a single Celery worker may have multiple threads that are responding to asynchronous function calls.   While we monitored the execution we noticed that the number of active threads in one worker could change over time. This made performance somewhat erratic.  Celery’s policy is that it will never have more threads than the number of available cores, so to limit the thread variability we ran workers in container pods on VMs with only one core.

Concerning the question of the granularity of the work partitioning we configured the program so that a number of documents could be processed in one invocation and this number could be set remotely.    By taking a set of 1000 documents and a fixed set of workers, we divided the document set into blocks of size K where K ranged from 1 to 100.   In general, larger blocks were better because the number of Celery invocations was smaller, but the difference was not great.   Another factor involved Celery’s scheduling for deciding which worker get the next invocation.   For large blocks this was not the most efficient because this left holes in the execution schedule when workers were occasionally idle while another was over scheduled.   For very small blocks these holes tended to be small.   We found that a value of K=2 gave reasonably consistent performance.

Finally to test scalability we used three different programs.

1. The document topic predictor described above where each invocation classified two documents.
2. A simple worker program that does no computation but just sleeps for 10 seconds before returning a “hello world” string.
3. A worker that computes part of the Euler sequence sum(1/i**2, i=1..n) where n = 109 .   Each worker computes a block of 107 terms of the sequence and the 100 partial results are added together to get the final result  (which approximates pi2/6  to about 7 decimal places).

The document predictor is very computational intensive and uses some rather large data matrices for the trained machine learning models.   The size of these arrays are about 150 megabytes total.  While this does fit in memory, the computation is going to involve a great deal of processor cache flushing and there may be memory paging effects.   The example that computes the Euler sum requires no data other than the starting point index and the size of the block to sum.   It is pure computation and it will have no cache flushing or memory paging effects, but it will keep the CPU very busy.   The “sleep” example leaves the memory and the CPU completely idle.

We ran all three with one to seven workers.   (6 workers using 6 cores from the small Google demo account and one on another other remote machine).    To compare the results, we computed the time for each program on one worker and plotted the speed-up ratio for 2, 3, 4, 5, 6 and 7 workers.   The results are shown in the graphs below.

Figure 3.  Performance as speed-up for each of the three applications with up to 7 workers.

As can be seen, the sleeper scales linearly in the number of workers.   In fact, when executed on multi-core machines it is almost super-linear because of the extra threads that can be used.  (It is very easy for a large number of threads to sleep in parallel.)   On the other hand, the predictor and the Euler examples reached a maximum speed up with around five workers.  Adding more worker pods to the servers did not show improvement because these applications are already very compute intensive.    This was a surprise as we expect all three experiments to scale well beyond seven workers.  Adding more worker pods to the servers did not show improvement because these applications are already very compute intensive.   When looking for the cause of this limited performance, we considered the possibility that the RabbitMQ broker was a bottleneck, but our previous experience with it has allowed us to scale applications to dozens of concurrent reader and writers.   We are also convinced that the Google Container Engine performed extremely well and it was not the source of any of these performance limitations. We suspect (but could not prove) that the Celery work distribution and result gathering mechanisms have overheads that limit scalability as the number of available workers grows.

## Conclusion

Google has made it very easy to deploy containerized applications using Kubernetes on their cloud container service.  Kubernetes has some excellent architectural features that allow multiple containers to be co-located on a single server within a pod.   We did not have time here to demonstrate this, but their documentation gives some excellent examples.

Celery is an extremely elegant way to do remote procedure calls in Python.  One only needs to define the function and annotate it with a Celery object.   It can then be remotely invoked with an asynchronous call that returns control to the caller.  A future like object is returned.  By calling a special method on the returned object the caller will pause until the remote call completes and the value is provided to the caller.

Our experiments demonstrated that Celery has limited scalability if it is used without modification and with the RabbitMQ message broker.   However, celery has many parameters and it may be possible that the right combinations will improve our results.  We will report any improved results we discover in a later version of this document.