Fang Liri, Salami Malik Oyewale, Weber Griffin M, Torvik Vetle I
School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, United States.
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States.
Data Brief. 2025 Apr 2;60:111535. doi: 10.1016/j.dib.2025.111535. eCollection 2025 Jun.
There has been a recent push to make public, aggregate, and increase coverage of bibliographic citation data. Here we describe uCite, a citation dataset containing 564 million PubMed citation pairs aggregated from the following nine sources: PubMed Central, iCite, OpenCitations, Dimensions, Microsoft Academic Graph, Aminer, Semantic Scholar, Lens, and OpCitance. Of these, 51 million (9%) were labeled unreliable, as determined by patterns of source discrepancies explained by ambiguous metadata, crosswalk, and typographical errors, citing future publications, and multi-paper documents. Each source contributes to improved coverage and reliability, but varies dramatically in precision and recall, estimates of which are contrasted with the Web of Science and Scopus herein.
最近一直在推动公开、汇总并增加书目引用数据的覆盖范围。在此,我们描述了uCite,这是一个包含5.64亿对PubMed引用的数据集,这些引用对汇总自以下九个来源:PubMed Central、iCite、OpenCitations、Dimensions、Microsoft Academic Graph、Aminer、Semantic Scholar、Lens和OpCitance。其中,5100万条(9%)被标记为不可靠,这是根据模糊元数据、交叉引用和排版错误、引用未来出版物以及多论文文档所解释的来源差异模式确定的。每个来源都有助于提高覆盖范围和可靠性,但在精确率和召回率方面差异很大,本文将其估计值与科学引文索引(Web of Science)和Scopus进行了对比。