Storage in the Cloud
Software defined storage is a major component of any cloud architecture. Unfortunately it is not as well defined as SDN, but some there has been some progress and API convergence. Public cloud providers have always provided several standard form of storage. The first was the object store S3 from Amazon. Others that followed include Azure Blob storage , Ceph and the Swift Object store . Object store systems are different from the POSIX standard file storage in that they provide a way to store binary object together with some amount of metadata. They do not have a file system hierarchy but usually have a way that objects can be grouped into buckets or containers. The APIs are based on RESTful http interfaces with simple write and read protocols. For example to access an object in the Microsoft Azure object store one can use a request of the form
Typically object store systems provide a security key for each storage account that must be used to access the account. The cloud providers each provide bindings to standard programming languages and web and desktop client programs to manage the user’s storage accounts.
An important feature of public cloud storage systems is reliability of the storage system. This is accomplished through distributed replication, where the user can usually chose the geolocation of the data. For example one can specify if the data shall not leave a certain geographic domain or if worldwide distribution of is permitted. Typically systems create at least 3 replicas, but there has been progress on some very efficient erasure code recovery schemes. Most replicated object store systems use an eventual consistency protocol.
In addition to blob object storage public cloud vendors typically provide other storage systems. For example Amazon provides elastic block storage which allows one to associate a 1TB volume with a VM. Microsoft Azure provide a structured, distributed table storage system. Google cloud datastore is a NoSQL, schemaless database for storing non-relational data. NoSQL means “Not Only SQL” and that means it is table-like and scalable and supports some types of SQL queries. Typically the full relational algebra is not supported, but NoSQL is valuable for parallel access to semi-structured data. (See http://en.wikipedia.org/wiki/NoSQL for a discussion of the various issues related to ACID and CAP theorem topics as it relates to NoSQL databases.)
Unfortunately there is a fundamental misfit with the storage models presented above and the needs of the analytics layer that rests above it. The reason for the problem is that the analysis models require both rapid parallel read and write access to the data. The data must be distributed and, because of the unreliable nature of large scale cloud systems writing blocks most often involves replication. A second problem is that of locality. The analytics frameworks often must iterate over data or, in the case of a complex workflow, one stage of computation may consume data produced by the preceding stage. Having data cached or available as an in-memory file system is required but if the file write phase involves replication and resolution of replicas this can be a performance killer. In the section on Frameworks we will return to this topic and discuss storage systems like HDFS and Tachyon that address some of these issues.