Cookie Monster is a tool for triaging the huge amounts of sequencing (and related) data by its metadata, from various sources, for opportunistic/proactive downstream processing by HGI.
Sequencing pipelines generate a lot of data, which is only going to increase as time goes on. These data require further processing before being delivered to analysts. However, at the expanding rate, there is too much for one person (or even a team of people) to deal with efficiently. Given the time data can take to process, an untenable backlog builds, exacerbating the problem further.
Cookie Monster is an automated system that constantly monitors data that is pushed into iRODS (in our implementation), via its metadata and how it changes over time. A sequence of customisable rules are applied to each potential piece of data to either further enrich its metadata from different sources (e.g., Sequencescape or, in the case of BAM/CRAM files, the file headers, etc.) or, ultimately, decide what to do with them. That could mean disregarding the file altogether (which will apply to the majority of data), pushing it back upstream for reprocessing or correction, or pushing it downstream into our own processing pipelines.
Cookie Monster is written in Python 3.5 and is designed to be used as a general purpose module, using CouchDB as a backend persistence layer and InfluxDB for performance metric logging. The HGI implementation (also Python 3.5) then takes that module and plumbs it into various services appropriate to our needs -- not least, our downstream pipelines -- and provides a set of rules which are applied against metadata collections that match the needs of the Human Genetics Programme and its research interests.