School of Information Management, Wuhan University, Wuhan, China.
J Am Med Inform Assoc. 2021 Aug 13;28(9):1919-1927. doi: 10.1093/jamia/ocab095.
PubMed has suffered from the author ambiguity problem for many years. Existing studies on author name disambiguation (AND) for PubMed only used internal metadata for development. However, some of them are incomplete (eg, a large number of names are only abbreviated and their full names are not available) or less discriminative. To this end, we present a new disambiguation method, namely AggAND, by aggregating information from external databases.
We address this issue by exploring Microsoft Academic Graph, Semantic Scholar, and PubMed Knowledge Graph to enhance the built-in name metadata, and extend the internal metadata with some external and more discriminative metadata.
Experimental results on enhanced name metadata demonstrate comparable performance to 3 author identifier systems, as well as show superiority over the original name metadata. More importantly, our method, AggAND, incorporating both enhanced name and extended metadata, yields F1 scores of 95.80% and 93.71% on 2 datasets and outperforms the state-of-the-art method by a large margin (3.61% and 6.55%, respectively).
The feasibility and good performance of our methods not only help better understand the importance of external databases for disambiguation, but also point to a promising direction for future AND studies in which information aggregated from multiple bibliographic databases can be effective in improving disambiguation performance. The methodology shown here can be generalized to broader bibliographic databases beyond PubMed. Our code and data are available online (https://github.com/carmanzhang/PubMed-AND-method).
PubMed 多年来一直存在作者歧义问题。现有的 PubMed 作者名称消歧(AND)研究仅使用内部元数据进行开发。然而,其中一些元数据不完整(例如,大量名称仅缩写,其全名不可用)或区分度较低。为此,我们通过探索 Microsoft Academic Graph、Semantic Scholar 和 PubMed Knowledge Graph 提出了一种新的消歧方法 AggAND,以聚合来自外部数据库的信息。
我们通过探索 Microsoft Academic Graph、Semantic Scholar 和 PubMed Knowledge Graph 来解决这个问题,以增强内置的名称元数据,并使用一些外部和更具区分度的元数据扩展内部元数据。
增强名称元数据的实验结果表明,与 3 个作者标识符系统的性能相当,并且优于原始名称元数据。更重要的是,我们的方法 AggAND,结合了增强的名称和扩展的元数据,在 2 个数据集上的 F1 得分为 95.80%和 93.71%,明显优于最先进的方法(分别为 3.61%和 6.55%)。
我们方法的可行性和良好性能不仅有助于更好地理解外部数据库对于消歧的重要性,而且为未来的 AND 研究指明了一个有希望的方向,即从多个书目数据库聚合信息可以有效地提高消歧性能。这里展示的方法可以推广到更广泛的书目数据库,而不仅仅是 PubMed。我们的代码和数据可在网上获取(https://github.com/carmanzhang/PubMed-AND-method)。