Data and Code Available for Science Doc ML and microservice experiment

The data and code used in the azure python streaming data experiments that were described in the blog on microservice performance architecture and Processing Scholarly Event Streams in the Cloud. The code has been rewritten to be free of all azure dependencies with the exception of the use of Azure Tables for the final storage from the table web service. It is certainly possible to rewrite this to use another database.

There are four type of data

  1. The arxiv configuration files. They take the form of config_name.json where name can be all4 (the top level), bio (the arxiv q-bio objects), compsci (computer science), math, phy (Physics), finance (finance).
  2. The machine learning model files (as generated by doc_analysis_final.py described below)
  3. The raw daa from the streams. There are three of these. The sciml_data_arxiv is the original data set from arxiv. sciml_data_arxiv_new_9_28_15 is a recent snapshot of arxiv data not used in the training a portion of this was used for the training set. The sciml_data_scimags is the rss data from the various science mags.
  4. The output of the main (top level) classifer. This was used to push events directly to the message broker for use in the performance analysis. This takes the form dump_all_subtopic_name where name is one of q-bio, Physics, compsci, math or physics. (note these are not the same as the names on the config files.)

The data is stored in two places.

  1. The configuration data, the rss feed input data and model data is stored on a publicly readable oneDrive site. The url for this is http://1drv.ms/1PCOT8l (cut this link and paste it into your browser.)
  2. The oneDrive files are ok for download from the browser, but not as efficient for program level access. So the programs here read the files from an public, read-only account “http://esciencegroup.blob.core.windows.net/scimlpublic” The code for reading the files is included in the source codes.

The code is now on gitHub at https://github.com/dbgannon/sciml

it is not the prettiest code.   wouldn’t pass a code review at MS, but i am moving on to something else now.