National Center for Biotechnology Information, 8600 Rockville Pike, Bethesda, MD 20894, USA and Department of Biomedical Informatics, Arizona State University, 13212 East Shea Blvd, Scottsdale, AZ 85259, USA.
Bioinformatics. 2013 Nov 15;29(22):2909-17. doi: 10.1093/bioinformatics/btt474. Epub 2013 Aug 21.
Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text-the task of disease name normalization (DNorm)-compared with other normalization tasks in biomedical text mining research.
In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH® and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval.
We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macro-averaged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively.
The source code for DNorm is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm, along with a web-based demonstration and links to the NCBI disease corpus. Results on PubMed abstracts are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator .
尽管疾病在生物医学研究中起着核心作用,但与生物医学文本挖掘研究中的其他规范化任务相比,自动确定文本中提到的疾病的尝试要少得多,即疾病名称规范化(DNorm)任务。
在本文中,我们介绍了用于 DNorm 的第一个机器学习方法,该方法使用 NCBI 疾病语料库和 MEDIC 词汇表,该词汇表结合了 MeSH®和 OMIM。我们的方法是一种高性能的、基于数学原理的框架,用于直接从训练数据中学习提及和概念名称之间的相似性。该技术基于对排序的成对学习,以前没有应用于规范化任务,但在信息检索的大型优化问题中已被证明是成功的。
我们将我们的方法与基于词汇规范化和匹配的几种技术、MetaMap 和 Lucene 进行了比较。我们的算法实现了 0.782 的微平均 F1 度量和 0.809 的宏平均 F1 度量,分别比性能最高的基线方法提高了 0.121 和 0.098。
DNorm 的源代码可在 http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm 上获得,同时提供基于网络的演示以及与 NCBI 疾病语料库的链接。PubMed 摘要上的结果可在 PubTator 上获得:http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator 。