Jonnalagadda Siddhartha Reddy, Topham Philip
Lnx Research LLC, 750 The City Drive Suite 490, Orange, CA 92868, USA.
J Biomed Discov Collab. 2010 Oct 4;5:50-75.
Today, there are more than 18 million articles related to biomedical research indexed in MEDLINE, and information derived from them could be used effectively to save the great amount of time and resources spent by government agencies in understanding the scientific landscape, including key opinion leaders and centers of excellence. Associating biomedical articles with organization names could significantly benefit the pharmaceutical marketing industry, health care funding agencies and public health officials and be useful for other scientists in normalizing author names, automatically creating citations, indexing articles and identifying potential resources or collaborators. Large amount of extracted information helps in disambiguating organization names using machine-learning algorithms.
We propose NEMO, a system for extracting organization names in the affiliation and normalizing them to a canonical organization name. Our parsing process involves multi-layered rule matching with multiple dictionaries. The system achieves more than 98% f-score in extracting organization names. Our process of normalization that involves clustering based on local sequence alignment metrics and local learning based on finding connected components. A high precision was also observed in normalization.
NEMO is the missing link in associating each biomedical paper and its authors to an organization name in its canonical form and the Geopolitical location of the organization. This research could potentially help in analyzing large social networks of organizations for landscaping a particular topic, improving performance of author disambiguation, adding weak links in the co-author network of authors, augmenting NLM's MARS system for correcting errors in OCR output of affiliation field, and automatically indexing the PubMed citations with the normalized organization name and country. Our system is available as a graphical user interface available for download along with this paper.
如今,MEDLINE中索引的与生物医学研究相关的文章超过1800万篇,从中获取的信息可有效用于节省政府机构在了解科学格局(包括关键意见领袖和卓越中心)方面所花费的大量时间和资源。将生物医学文章与组织名称相关联可显著造福制药营销行业、医疗保健资助机构和公共卫生官员,并且对其他科学家在规范作者姓名、自动创建引用、为文章编制索引以及识别潜在资源或合作者方面也很有用。大量提取的信息有助于使用机器学习算法消除组织名称的歧义。
我们提出了NEMO,这是一个用于提取 affiliations 中的组织名称并将其规范化为标准组织名称的系统。我们的解析过程涉及与多个词典进行多层规则匹配。该系统在提取组织名称方面的F值超过98%。我们的规范化过程包括基于局部序列比对指标进行聚类以及基于查找连通分量进行局部学习。在规范化方面也观察到了高精度。
NEMO是将每篇生物医学论文及其作者与标准形式的组织名称以及该组织的地缘政治位置相关联的缺失环节。这项研究可能有助于分析组织的大型社交网络以勾勒特定主题、提高作者消歧的性能、在作者的共同作者网络中添加弱链接、增强NLM的MARS系统以纠正affiliation字段的OCR输出中的错误,以及使用规范化的组织名称和国家自动为PubMed引用编制索引。我们的系统作为图形用户界面提供,可随本文一起下载。