Manifold Learning and Deep Autoencoders in Science

One way to think about machine learning is to view it as building a model of a system based on samples of data that are artifacts of that system.  This view does not resonate very well when the problem at hand is identifying photos containing sail boats, but it is apt when ML is applied to scientific data.

The data from scientific experiments is often takes the form of vectors in a very high dimensional space and we are looking for an underling organization of that data that reflects properties our system.   Manifold learning is based on the assumption that the system you are trying to model generates data that lies on or near a lower dimensional surface in the higher dimension coordinate space of the data.   Picture the surface of a sphere or a curve in 3-D.   If this manifold assumption about the data is true, it may be possible to “unfold’’ the surface so that a projection or other linear analysis makes the data easier to understand.

Autoencoders are deep neural networks that can be used to discover properties of the manifold of data that characterizes the system you are trying to model.  Illustrated below, autoencoders have an encoder phase and a decoder phase with a space of much lower dimension than the input in the middle.  By studying the properties of this lower dimensional space, we have a better view of the data manifold. You train an autoencoder to be the identity function.  

autoencoder

Recently, a lovely blog article A look at deep learning for science by Prabhat gave us an excellent overview of some uses of deep learning technology in science applications and several of these were applications of autoencoders.  Partly inspired by that article and by our own experiments with some very interesting images of neuron cells collected by Maryana Alegro at UCSF, we put together a little tutorial on autoencoders in science.   In the article we discuss two types of autoencoders: denoising and variational.  The variational autoencoder we examine is applied to the cell images in an effort to create a model that can be used to both classify samples as well generate new samples by following a path along the surface of the manifold. 

We have published this little study as a supplement to our chapter on machine learning in the book “Cloud Computing for Science and Engineering”.   The link to the article is here.