Paper: Research Abstract
Quantifying Biomedical Data Reuse: Do Citations Tell the Whole Story?
Monday, May 6
5:20 PM - 5:35 PM
Room: Columbus GH (East Tower, Ballroom/Gold Level)
Lisa Federer, AHIP
Data Science and Open Science Librarian
National Library of Medicine
North Bethesda, Maryland
Objectives : Many funders and journals now require researchers to share their final research data. Understanding how these data are reused could strengthen sharing policies, inform decision-making about curation, and facilitate development of metrics to reward sharing. However, tracking reuse remains challenging. This study explores the extent to which article citations to datasets accurately reflect their reuse.
Methods : This study measures the correlation between data reuse and citation, as well as characterizing types of reuse underlying data citations, by analyzing use requests for and citations to datasets from three biomedical repositories; two collect clinical data and one collects genomic data. Comparing use requests, which serve as a proxy for reuse, to citations, provides insight into how accurately data citations reflect reuse. Citing articles were analyzed to understand how datasets are reused, such as for original studies, meta-analyses, or methods validation, as well as how authors cited the dataset they had reused. Finally, semantic similarity was used to compare MeSH terms for the articles to the terms assigned to their corresponding datasets. This analysis provided a quantitative measure of whether data were being reused in similar contexts for which they had been collected or in novel topics.
Results : While use requests and citations for datasets in this study are correlated, the average dataset had only one citation for about every 8 requests. Articles citing data represented many types of reuse, with different patterns of reuse for clinical versus genomic data. While most articles reused datasets in a context similar to that for which the dataset was collected, 10% of the article/dataset pairs had a semantic similarity score of 0, meaning they were reused in a very different context. Citations themselves lacked consistency, with authors indicating they had reused datasets in a range of locations within the article.
Conclusions : The large disparity between citations and use requests suggests that citations do not adequately capture the extent or characteristics of data reuse. These results have implications for how data reuse is measured and evaluated, and therefore, how impact of datasets can be assessed to reward researchers who share their data. These findings could provide guidance to journals, funders, repositories, and researchers who share data about how to increase the visibility of reuse of datasets.