Metadata-check is a tool for checking that the metadata associated with a BAM/CRAM file is consistent and complete. It is primarily designed for use at Sanger, in which metadata is stored in several places: iRODS metadata, file headers, and the central LIMS data warehouse. We have made it freely available in case developers at other institutes would like to modify it to suit similar needs, but it is unlikely to be particularly useful outside of Sanger without some customisation.
In order to check metadata, metadata-check fetches the metadata from iRODS using baton, streams the BAM/CRAM file header from the file stored in iRODS using samtools, and queries SequencescapeDB for the attributes found within the iRODS metadata. It compares the metadata from these three sources (or a combination of them if not run with the default parameters) and outputs the list of inconsistencies found between the three sources of metadata. In addition to this it also:
- checks that the metadata of a file in iRODS is complete in the sense that it compares the list of iRODS metadata attributes and their frequency with the attribute frequencies in a given config file (as parameter)
- checks that the md5 in the iRODS metadata attribute has the same value with the value calculated by iRODS on the server side during the upload(the result of ichksum)
- checks that a lanelet's file name is consistent with the run_id and lane fields in the iRODS metadata
- checks that the reference in the metadata AVU is what the user gives as "desired reference" parameter
- checks that the same set of files is retrieved when querying by study id and study name and study accession number
- can output a file's metadata as extracted from different sources (independent of what tests are being run)
- a list of file paths in iRODS or
- a study name/accession_number/internal_id (so that all the files associated with it are being checked)
- a report containing the problems found with each file (regarding metadata inconsistencies)
- (optional) a list of metadata attributes gathered from different sources
This is a tool specifically designed to work with sequencing data from Sanger's 'seq' iRODS zone. It is likely to be useful elsewhere only with some developer effort to customise it for other metadata systems. It runs on a single machine.