Roche Pharmaceutical Research and Early Development, pRED Informatics, Roche Innovation Center, Basel, Switzerland.
Institute of Computational Linguistics, University of Zurich, Switzerland.
J Am Med Inform Assoc. 2019 Oct 1;26(10):1037-1045. doi: 10.1093/jamia/ocz028.
Author-centric analyses of fast-growing biomedical reference databases are challenging due to author ambiguity. This problem has been mainly addressed through author disambiguation using supervised machine-learning algorithms. Such algorithms, however, require adequately designed gold standards that reflect the reference database properly. In this study we used MEDLINE to build the first unbiased gold standard in a reference database and improve over the existing state of the art in author disambiguation.
Following a new corpus design method, publication pairs randomly picked from MEDLINE were evaluated by both crowdsourcing and expert curators. Because the latter showed higher accuracy than crowdsourcing, expert curators were tasked to create a full corpus. The corpus was then used to explore new features that could improve state-of-the-art author disambiguation algorithms that would not have been discoverable with previously existing gold standards.
We created a gold standard based on 1900 publication pairs that shows close similarity to MEDLINE in terms of chronological distribution and information completeness. A machine-learning algorithm that includes new features related to the ethnic origin of authors showed significant improvements over the current state of the art and demonstrates the necessity of realistic gold standards to further develop effective author disambiguation algorithms.
An unbiased gold standard can give a more accurate picture of the status of author disambiguation research and help in the discovery of new features for machine learning. The principles and methods shown here can be applied to other reference databases beyond MEDLINE. The gold standard and code used for this study are available at the following repository: https://github.com/amorgani/AND/.
由于作者身份不明确,对快速增长的生物医学参考数据库进行以作者为中心的分析具有挑战性。这个问题主要通过使用监督机器学习算法进行作者消歧来解决。然而,这些算法需要设计合理的黄金标准,以正确反映参考数据库。在这项研究中,我们使用 MEDLINE 构建了第一个参考数据库中的无偏黄金标准,并改进了现有的作者消歧技术。
采用一种新的语料库设计方法,从 MEDLINE 中随机抽取的出版物对通过众包和专家编辑进行评估。由于后者的准确性高于众包,因此专家编辑负责创建完整的语料库。然后,该语料库被用于探索新的特征,这些特征可以改进现有的作者消歧算法,而这些特征是以前的黄金标准所无法发现的。
我们创建了一个基于 1900 对出版物的黄金标准,在时间分布和信息完整性方面与 MEDLINE 非常相似。一个包含与作者种族起源相关的新特征的机器学习算法在性能上明显优于现有技术,这表明需要真实的黄金标准来进一步开发有效的作者消歧算法。
无偏黄金标准可以更准确地反映作者消歧研究的现状,并有助于发现机器学习的新特征。这里展示的原则和方法可以应用于除 MEDLINE 之外的其他参考数据库。本研究使用的黄金标准和代码可在以下存储库中获得:https://github.com/amorgani/AND/。