The ‘Triage and Handover’ session (session 3B) of the JISC Managing Research Data programme progress and DCC institutional engagements workshop (24 – 25 October 2012) differed in structure from the other sessions: less about project experiences and more about sharing expertise from people working specifically in this area and generating discussion for projects attending in response.
For the sessions, we note-takers were tasked with establishing: a) what is working? b) challenges and lessons learned, and c) what the MRD programme or the DCC can do to help. Whilst the structure of this session didn’t lend itself so well to this task as some other sessions, I hope this summary will supply the salient points.
Angus Whyte (DCC) began this session by acknowledging the difficulties of the area. Because there is no way of knowing which digital objects will be useful in the future, there is no one foolproof way to decide which data should be retained for handover at project end to institutional data management services, and which can be disposed of.
‘Triage’ here is used in the business sense rather than the medical sense: it is meant to imply the existence of a process of decision-making which can determine resource allocation. ‘Selection’ suggests an either/or decision, which is useful to consider, but Angus makes the point that for institutions the greater need is to define a range of decisions. One of these will be disposal. Others might range from showcasing high-value data online to keeping low-value data on tape back-up.
As a co-author of the DCC ‘How to’ guide on appraisal and selection of data for curation, Angus has spent some time considering various models that are used by data centres and archives to guide their decision-making. He described the basic records management approach to this:
1. Define a policy, i.e. criteria and range of decisions
2. Archive management applies criteria: select the significant, dispose of the rest
However, he argues, there are a few complications for this model when it comes to dealing with research data, i.e.:
- Research processes may be more complex (need more explanation) than administrative processes
- Data purpose may change
- Needs more effort to make re-usable
- Complex relationships and rich contexts
- Originators should be engaged but may not have capacity to be
- Others may need to be involved too
- More than keep / dispose choice – need to prioritise attention and effort to make data fit for re-use.
So, for research data:
- First, characterise. What is this data? What are the relationships within it and what are the significant aspects of the context in which it was created?
- Appraisal criteria should establish: who has the duty of care? How accessible is the data? What is its re-use value, and what costs are involved?
- Categorise the responses to these criteria or questions i.e. combinations of high or low ratings. These are your triage levels; levels of effort and cost attached to making data accessible and discoverable, balanced against the likely range of reuse cases and benefits
- An important factor will be whether there are other natural homes for the data and, if so, whether there are benefits from retaining a copy with the institution.
- A tiered approach to data value could in theory map to a tiered approach to resource costs, e.g. for discoverability, access management, storage performance, preservation actions.
Clearly, some effort is required here. This may make senior management, as well as the researchers themselves, say, ‘why not just keep it all?’ Well, in the arguments for selection, costs are a significant issue. There has been an exponential growth in digital storage required in the last few years: this includes lots of types of digital content including research data, but of course other types of digital material can also be useful in the research process.
David Rosenthal estimated in his frequently-mentioned blogpost of 14 May 2012 how much it would cost to ‘keep everything forever in the cloud’. He speculated that, based on current cost trajectories, keeping 2018’s data in S3 (Amazon’s cloud storage service) will ‘consume more than the entire GWP [Gross World Product] for the year’. Whilst the DC/DP/RDM community may argue around the specifics of Rosenthal’s position here, his argument does help to demonstrate that whilst storage costs – never mind those for curation – have long been transparent to researchers, they are real and clarity here can help us to price curation (including storage) realistically and responsibly.
Selection presumes description. You can’t value what you don’t know about. Angus argued researchers can’t afford not to spend effort on minimal metadata description and organisation, because costs of retention will be much higher if they don’t. Description makes data affordable – is citation potential a concrete enough reward?
To summarise, we must identify what datasets are created and where they are, and differentiate priorities.
Marie-Therese Gramstadt then outlined the activity of the JISC MRD KAPTUR project relating to selection and retention. KAPTUR is aware of previous JISC MRD work in training. One of main questions addressed by KAPTUR is how to select and appraise research data. In their approach, they have referred to the DCC paper on this topic, and held an event earlier this year to further explore the issues. The event discussed the following aspects of research data in the creative arts and how to select it for management:
- Value and context, including scientific and historical value;
- Value creation;
- Ethics and legal issues;
- Enabling use and reuse;
- Enabling long-term access.
(More information on this KAPTUR event is available here http://kapturmrd01.eventbrite.co.uk/, which includes the presentations.)
Veerle Van den Eynden of the UKDA then presented a data centre view of the issue, as opposed to an institution-level view. She described the current process that applies to deposit in the ESRC-funded UK Data Service, including the data review form, the work of the acquisitions committee which evaluates applications for deposit, and the acceptance criteria they apply.
The acquisitions committee will give one of three decisions about a dataset offered:
- accept data into main ESDS collection for curation and longer-term preservation;
- processing determined: either A, B or C
- accept data into self-archive system, the ESRC data store, for short-term management and access; or,
- unable to accept data.
which is a useful reminder that selection for management (including preservation) need not be a binary matter of yes / no but can consist of a range of possible management solutions.
Acceptance criteria includes:
- Within scope
- Long-term value and re-use potential
- Data requested (by ESDS advisory committee, users)
- Data from ESRC-funded research
- Viable for preservation (acceptable file format, well documented)
Common reasons for non-acceptance:
- Value of data in publications
- Legal obstacles (copyright, IPR)
- Ethical constraints (consent, anonymisation)
- Depositor wishes unnecessarily stringent access conditions
Usually about 5-10% of data offered currently falls into these categories of non-acceptance.
There are currently some draft categories for the data collections accepted by UKDS.
- Data collections selected for long term curation
- Data collections selected for ‘short term’ management
- Data collections selected for ‘delivery’ only
- Data collections selected for ‘discovery’ only.
The Data Service has a Collections Development Policy currently in draft. This addresses factors such as
- Scientific or historical value
- Replication data and resourses (materials required for replicating research)
Even if other projects and services don’t have the same levels of experience and capacity as the Research Data Service, these aspects of Data Service policy and structure provide an example of a functional approach to ‘triage’ and selection of research data.
Veerle also mentioned the repository engagement project, to support institutional data management / repository managers in their local role as ESRC data curators. Through this, they aim to provide guidance and training in appraising data for social science research for IR staff and other good practice. This is helpful in the current environment where there is more expectation from funders that institutions can take more responsibility for archiving data. You can see Veerle’s presentation here.
Marie-Therese then briefly showed material from Sam Peplar of NERC who was unable to attend at short notice. This described the development of the NERC data value checklist which aims to make selection better, more consistent and more objective. It emerged from consultancy in the research sector and has been modified in response to user feedback.
NERC funding requires an outline DMP at proposal stage with a detailed DMP when funding is agreed. The data value checklist is intended to be useful when preparing this full DMP but, Sam’s material cautioned, the checklist should not be expected to give some authoritative or definitive response to whether the data should be retained. Rather, it supplies questions on which to reflect around aspects of the data such as storage, access, formats, origin, conditions, etc. Sam is clear that there are not neat solutions for selecting data; objective rules are not possible. He is also clear that scientists are not generally prepared to do the selection alone – this is an area of RDM which requires support.
The group feedback was included various pertinent questions, and concluded that whilst there is no one methodology for discerning the future value of data, it is currently important for institutions to understand where they fit in, in the current landscape in terms of their responsibility to assist researchers in responsible selection and deposit of data. Veerle confirmed that funders expect data to go to the IR where available, and a data centre if not. In either case, it is massively helpful if acceptance criteria are public: this can help researchers and research support staff to discern the most appropriate data for selection.
What are your main challenges in selecting and disposing of research data? What could the JISC MRD programme or the DCC do to help? Tell us in the comments.