Suppr超能文献

一种新的方法和金标准,用于 MEDLINE 中的作者去重。

A new approach and gold standard toward author disambiguation in MEDLINE.

机构信息

Roche Pharmaceutical Research and Early Development, pRED Informatics, Roche Innovation Center, Basel, Switzerland.

Institute of Computational Linguistics, University of Zurich, Switzerland.

出版信息

J Am Med Inform Assoc. 2019 Oct 1;26(10):1037-1045. doi: 10.1093/jamia/ocz028.

Abstract

OBJECTIVE

Author-centric analyses of fast-growing biomedical reference databases are challenging due to author ambiguity. This problem has been mainly addressed through author disambiguation using supervised machine-learning algorithms. Such algorithms, however, require adequately designed gold standards that reflect the reference database properly. In this study we used MEDLINE to build the first unbiased gold standard in a reference database and improve over the existing state of the art in author disambiguation.

MATERIALS AND METHODS

Following a new corpus design method, publication pairs randomly picked from MEDLINE were evaluated by both crowdsourcing and expert curators. Because the latter showed higher accuracy than crowdsourcing, expert curators were tasked to create a full corpus. The corpus was then used to explore new features that could improve state-of-the-art author disambiguation algorithms that would not have been discoverable with previously existing gold standards.

RESULTS

We created a gold standard based on 1900 publication pairs that shows close similarity to MEDLINE in terms of chronological distribution and information completeness. A machine-learning algorithm that includes new features related to the ethnic origin of authors showed significant improvements over the current state of the art and demonstrates the necessity of realistic gold standards to further develop effective author disambiguation algorithms.

DISCUSSION AND CONCLUSION

An unbiased gold standard can give a more accurate picture of the status of author disambiguation research and help in the discovery of new features for machine learning. The principles and methods shown here can be applied to other reference databases beyond MEDLINE. The gold standard and code used for this study are available at the following repository: https://github.com/amorgani/AND/.

摘要

目的

由于作者身份不明确,对快速增长的生物医学参考数据库进行以作者为中心的分析具有挑战性。这个问题主要通过使用监督机器学习算法进行作者消歧来解决。然而,这些算法需要设计合理的黄金标准,以正确反映参考数据库。在这项研究中,我们使用 MEDLINE 构建了第一个参考数据库中的无偏黄金标准,并改进了现有的作者消歧技术。

材料和方法

采用一种新的语料库设计方法,从 MEDLINE 中随机抽取的出版物对通过众包和专家编辑进行评估。由于后者的准确性高于众包,因此专家编辑负责创建完整的语料库。然后,该语料库被用于探索新的特征,这些特征可以改进现有的作者消歧算法,而这些特征是以前的黄金标准所无法发现的。

结果

我们创建了一个基于 1900 对出版物的黄金标准,在时间分布和信息完整性方面与 MEDLINE 非常相似。一个包含与作者种族起源相关的新特征的机器学习算法在性能上明显优于现有技术,这表明需要真实的黄金标准来进一步开发有效的作者消歧算法。

讨论和结论

无偏黄金标准可以更准确地反映作者消歧研究的现状,并有助于发现机器学习的新特征。这里展示的原则和方法可以应用于除 MEDLINE 之外的其他参考数据库。本研究使用的黄金标准和代码可在以下存储库中获得:https://github.com/amorgani/AND/。

相似文献

4
Author Name Disambiguation for PubMed.PubMed的作者姓名消歧
J Assoc Inf Sci Technol. 2014 Apr;65(4):765-781. doi: 10.1002/asi.23063. Epub 2013 Nov 21.

本文引用的文献

1
Author Name Disambiguation for PubMed.PubMed的作者姓名消歧
J Assoc Inf Sci Technol. 2014 Apr;65(4):765-781. doi: 10.1002/asi.23063. Epub 2013 Nov 21.
4
Quantifying the complexity of medical research.量化医学研究的复杂性。
Bioinformatics. 2013 Nov 15;29(22):2918-24. doi: 10.1093/bioinformatics/btt505. Epub 2013 Aug 31.
6
Understanding PubMed user search behavior through log analysis.通过日志分析了解PubMed用户的搜索行为。
Database (Oxford). 2009;2009:bap018. doi: 10.1093/database/bap018. Epub 2009 Nov 27.
7
Author Name Disambiguation in MEDLINE.医学在线数据库(MEDLINE)中的作者姓名消歧
ACM Trans Knowl Discov Data. 2009 Jul 1;3(3). doi: 10.1145/1552303.1552304.
8
Visualizing evolution and impact of biomedical fields.可视化生物医学领域的发展与影响。
J Biomed Inform. 2008 Dec;41(6):1050-2. doi: 10.1016/j.jbi.2008.05.002. Epub 2008 May 11.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验