Information Management

Paper: Research Abstract

Quantifying Biomedical Data Reuse: Do Citations Tell the Whole Story?

Monday, May 6
5:20 PM - 5:35 PM
Room: Columbus GH (East Tower, Ballroom/Gold Level)

Objectives : Many funders and journals now require researchers to share their final research data. Understanding how these data are reused could strengthen sharing policies, inform decision-making about curation, and facilitate development of metrics to reward sharing. However, tracking reuse remains challenging. This study explores the extent to which article citations to datasets accurately reflect their reuse.
Methods : This study measures the correlation between data reuse and citation, as well as characterizing types of reuse underlying data citations, by analyzing use requests for and citations to datasets from three biomedical repositories; two collect clinical data and one collects genomic data. Comparing use requests, which serve as a proxy for reuse, to citations, provides insight into how accurately data citations reflect reuse. Citing articles were analyzed to understand how datasets are reused, such as for original studies, meta-analyses, or methods validation, as well as how authors cited the dataset they had reused. Finally, semantic similarity was used to compare MeSH terms for the articles to the terms assigned to their corresponding datasets. This analysis provided a quantitative measure of whether data were being reused in similar contexts for which they had been collected or in novel topics.
Results : While use requests and citations for datasets in this study are correlated, the average dataset had only one citation for about every 8 requests. Articles citing data represented many types of reuse, with different patterns of reuse for clinical versus genomic data. While most articles reused datasets in a context similar to that for which the dataset was collected, 10% of the article/dataset pairs had a semantic similarity score of 0, meaning they were reused in a very different context. Citations themselves lacked consistency, with authors indicating they had reused datasets in a range of locations within the article.
Conclusions : The large disparity between citations and use requests suggests that citations do not adequately capture the extent or characteristics of data reuse. These results have implications for how data reuse is measured and evaluated, and therefore, how impact of datasets can be assessed to reward researchers who share their data. These findings could provide guidance to journals, funders, repositories, and researchers who share data about how to increase the visibility of reuse of datasets.

Lisa Federer, AHIP

Data Science and Open Science Librarian
National Library of Medicine
North Bethesda, Maryland

Lisa Federer is the Data Science and Open Science Librarian at the National Library of Medicine, focusing on developing efforts to support workforce development and enhance capacity in the biomedical research community for data science and open science. Prior to joining NLM, Lisa spent five years as the Research Data Informationist at the National Institutes of Health Library, where she developed and ran the Library’s Data Services Program. An active member of the Medical Library Association, she serves on the JMLA editorial board and as chair of the Information Management Curriculum Committee, and was the editor of the Medical Library Association Guide to Data Management for Librarians. She holds a PhD in information studies from the University of Maryland and an MLIS from the University of California-Los Angeles, as well as graduate certificates in data science and data visualization.


Send Email for Lisa Federer


Quantifying Biomedical Data Reuse: Do Citations Tell the Whole Story?

Audio Slides Video

Attendees who have favorited this

Please enter your access key

The asset you are trying to access is locked. Please enter your access key to unlock.

Send Email for Quantifying Biomedical Data Reuse: Do Citations Tell the Whole Story?