March 21, 2017, by Stefan Rennick-Egglestone
Why subject-specific research data repositories are important and challenging
Continuing innovation in scientific instrument design is driving the production of ever-larger volumes of digital research data, and allowing new research questions to be addressed. Instruments that can produce 100 gigabytes per day are becoming more common, and the PromethION genetic sequencer is an extreme example of an instrument which can produce 4 terabytes per hour at maximum capacity.
All of this digital data has to be stored somewhere, and it has long been a common pattern for scientific communities to collaboratively design, implement and maintain data repositories with specific features to suit their needs. For the optical imaging community, an important example is Omero, a centralised repository which can be installed by an institution. To give you an idea of scale, the current installation of Omero used by School of Life Sciences Imaging centre sits on top of 17 terabytes of historic research data, which grows with every experiment.
Optical microscopes typically generate images in proprietary formats, and the Omero repository can be used to convert these to something more standard, whilst preserving metadata – making it much more likely that images will remain accessible for the long-term (a form of data preservation). Omero can also be used to collect, organise and share research data, by project, by experiment or by research group, giving it a management role. Because time on an instrument is often limited, then researchers will typically return to metadata stored in Omero to identify and re-use settings that worked for previous experiments. As such, long-term access to data and metadata is important for the scientific process, and metadata collected five years ago is still routinely referred to in current work.
In the context of centralised IT provision, subject-specific repositories pose an interesting challenge. They may need centralised IT services such as networking or storage to function, but may need some maintenance by local experts. As community projects, they may have evolved over time, and been developed informally by large groups of people. They may be poorly documented, especially in comparison to commercial projects, and they may be somewhat difficult to install and keep operational.
A particular challenge exists around long-term archiving of research data, and we need more understanding of how this will work. A variety of cloud-based services (Arkivum, Azure cool storage) offer cheap, reliable archiving of research data, and structures such as the RCUK Common Principles on Data Policy mandate the long-term maintenance of research data, and encourage sharing with others. Give the volumes of digital data produced by Life Sciences communities such as optical imaging or sequencing, long-term, on premises preservation may not be viable.
However, as community projects targeted at immediate needs, subject-specific repositories may not be cloud-enabled by default, making the effort required to transfer data to cloud archiving uncertain. This may change as the cloud becomes more integrated into work processes, but the effort required to integrate the cloud will need to be assesses for each subject-specific repository. This includes an assessment of how challenging it is to restore archived data into a repository if it becomes needed for an on-going project.
For an institution looking to implement cloud archiving, this could therefore represent a substantial volume of work to be planned around as part of an archiving implementation project.
Next blog in series: Data management planning for a signature research centre