National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, United States.
Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT 06510, United States.
Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae672.
Over 55% of author names in PubMed are ambiguous: the same name is shared by different individual researchers. This poses significant challenges on precise literature retrieval for author name queries, a common behavior in biomedical literature search. In response, we present a comprehensive dataset of disambiguated authors. Specifically, we complement the automatic PubMed Computed Authors algorithm with the latest ORCID data for improved accuracy. As a result, the enhanced algorithm achieves high performance in author name disambiguation, and subsequently our dataset contains more than 21 million disambiguated authors for over 35 million PubMed articles and is incrementally updated on a weekly basis. More importantly, we make the dataset publicly available for the community such that it can be utilized in a wide variety of potential applications beyond assisting PubMed's author name queries. Finally, we propose a set of guidelines for best practices of authors pertaining to use of their names.
The PubMed Computed Authors dataset is publicly available for bulk download at: https://ftp.ncbi.nlm.nih.gov/pub/lu/ComputedAuthors/. Additionally, it is available for query through web API at: https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/authors/.
在 PubMed 中,超过 55%的作者名是模糊的:同一个名字被不同的个体研究人员共享。这给基于作者名查询的精确文献检索带来了重大挑战,这是生物医学文献检索中的常见行为。针对这一问题,我们提供了一个全面的去歧义作者数据集。具体来说,我们使用最新的 ORCID 数据来补充自动 PubMed Computed Authors 算法,以提高准确性。结果表明,增强后的算法在作者名去歧义方面表现出色,随后我们的数据集包含了超过 2100 万去歧义作者的 3500 多万篇 PubMed 文章,并每周进行增量更新。更重要的是,我们将数据集公开提供给社区,以便在除了辅助 PubMed 的作者名查询之外的各种潜在应用中使用。最后,我们提出了一套关于作者使用其姓名的最佳实践指南。
PubMed Computed Authors 数据集可在以下网址进行批量下载:https://ftp.ncbi.nlm.nih.gov/pub/lu/ComputedAuthors/。此外,还可以通过以下网址进行查询:https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/authors/。