We live in a world of data. It is generated by the sensors, cameras and other instruments that are imbedded in our environment. It streams out of our large scale physics and astronomy experiments. We are sequencing genomes in laboratories and hospitals at an accelerating rate and, as our scientific simulations have scaled up, so has the magnitude of the data they generate. However, the most well-known source of data are the vast collections that comes from our use of the Internet.
Analyzing this Internet data has been the overwhelming preoccupation of companies like Google, Amazon, Yahoo!, Microsoft and dozens more. These companies built massive data center that were initially used to construct search indexes, but to improve search results, they had to do deep analysis of user intent. This activity led to breakthroughs in machine learning and big data analytics. Machine translation of natural language, automatic object recognition in images and text classification were all areas that have been fundamentally transformed by this work. On the software side, the Hadoop MapReduce system was released from Yahoo! and has grown into a growing stack of powerful tools designed to support data analysis in the cloud data centers.
At the same time that the Internet industry was cranking out data tools appropriate for public clouds, the science community was scaling up their own data wars. LHC began generating data and a massive global network of data reduction and analysis tools had to be deployed to study the results. Other areas of science began building systems to analyze and visualize the results of supercomputer simulations.
It is clear that the outcome of the Internet industries investment in scalable data analytics and machine learning tools are of value in the science community. In fact the open source nature of many of these tools has enabled many in the academic computer science community to make substantial improvements in the underlying algorithms. Unfortunately most academic researchers do not have access to the scale of parallel computing resources the cloud companies have at hand. Most make do with small clusters running Hadoop. We can do better.
There is a real opportunity to leverage the NSF’s and DOE’s investment in our national supercomputers and the availability of public cloud resources to create a new generation of data analysis services. It may now be possible to use a hybrid approach that exploits the cloud’s superior scalable IO with the high performance of an HPC system to make another major leap in capability. To accomplish this goal we must explore the algorithmic challenges to parallelizing the critical data analytics and ML algorithms and the system issues such as rapidly staging data to HPC systems from cloud storage, or how to build end-user tools. Among the questions that must be addressed are:
- Where does the standard open source stack need to be extended to make advanced parallel data analysis tools available?
- How can we build software service that allow users to easily access advanced tools for data analytics?
- Where is there need for new algorithmic research?
- Are there unique data analysis challenges in scientific disciplines that are not encountered elsewhere?
Future posts here will explore these issues.