医学在线数据库（MEDLINE）中的作者姓名消歧

Author Name Disambiguation in MEDLINE.

作者信息

Torvik Vetle I, Smalheiser Neil R

机构信息

University of Illinois at Chicago.

出版信息

ACM Trans Knowl Discov Data. 2009 Jul 1;3(3). doi: 10.1145/1552303.1552304.

DOI:10.1145/1552303.1552304

PMID:20072710

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2805000/

Abstract

BACKGROUND

We recently described "Author-ity," a model for estimating the probability that two articles in MEDLINE, sharing the same author name, were written by the same individual. Features include shared title words, journal name, coauthors, medical subject headings, language, affiliations, and author name features (middle initial, suffix, and prevalence in MEDLINE). Here we test the hypothesis that the Author-ity model will suffice to disambiguate author names for the vast majority of articles in MEDLINE. METHODS: Enhancements include: (a) incorporating first names and their variants, email addresses, and correlations between specific last names and affiliation words; (b) new methods of generating large unbiased training sets; (c) new methods for estimating the prior probability; (d) a weighted least squares algorithm for correcting transitivity violations; and (e) a maximum likelihood based agglomerative algorithm for computing clusters of articles that represent inferred author-individuals. RESULTS: Pairwise comparisons were computed for all author names on all 15.3 million articles in MEDLINE (2006 baseline), that share last name and first initial, to create Author-ity 2006, a database that has each name on each article assigned to one of 6.7 million inferred author-individual clusters. Recall is estimated at ~98.8%. Lumping (putting two different individuals into the same cluster) affects ~0.5% of clusters, whereas splitting (assigning articles written by the same individual to >1 cluster) affects ~2% of articles. IMPACT: The Author-ity model can be applied generally to other bibliographic databases. Author name disambiguation allows information retrieval and data integration to become person-centered, not just document-centered, setting the stage for new data mining and social network tools that will facilitate the analysis of scholarly publishing and collaboration behavior. AVAILABILITY: The Author-ity 2006 database is available for nonprofit academic research, and can be freely queried via http://arrowsmith.psych.uic.edu.

摘要

背景

我们最近描述了“作者权威”模型，该模型用于估计医学文献数据库（MEDLINE）中两篇署名相同的文章由同一作者撰写的概率。其特征包括共享的标题词汇、期刊名称、共同作者、医学主题词、语言、所属机构以及作者姓名特征（中间名首字母、后缀和在MEDLINE中的出现频率）。在此，我们检验“作者权威”模型足以消除MEDLINE中绝大多数文章作者姓名歧义这一假设。方法：改进包括：（a）纳入名字及其变体、电子邮件地址，以及特定姓氏与所属机构词汇之间的相关性；（b）生成大型无偏训练集的新方法；（c）估计先验概率的新方法；（d）用于纠正传递性违规的加权最小二乘算法；（e）基于最大似然的凝聚算法，用于计算代表推断出的作者个体的文章聚类。结果：对MEDLINE（2006年基线）中所有1530万篇文章上共享姓氏和名字首字母的所有作者姓名进行成对比较，创建了“作者权威2006”数据库，该数据库将每篇文章上的每个姓名分配到670万个推断出的作者个体聚类中的一个。召回率估计约为98.8%。合并（将两个不同个体归入同一聚类）影响约0.5%的聚类，而拆分（将同一作者撰写的文章分配到多个聚类）影响约2%的文章。影响：“作者权威”模型可普遍应用于其他书目数据库。作者姓名消歧使信息检索和数据整合以人而非仅以文档为中心，为新的数据挖掘和社交网络工具奠定基础，这些工具将有助于分析学术出版和合作行为。可用性：“作者权威2006”数据库可用于非营利性学术研究，可通过http://arrowsmith.psych.uic.edu免费查询。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

医学在线数据库（MEDLINE）中的作者姓名消歧

Author Name Disambiguation in MEDLINE.

作者信息

机构信息

出版信息

BACKGROUND

背景

相似文献

引用本文的文献

本文引用的文献

医学在线数据库（MEDLINE）中的作者姓名消歧

Author Name Disambiguation in MEDLINE.

作者信息

机构信息

出版信息

BACKGROUND

背景

相似文献

引用本文的文献

本文引用的文献