Pakhomov Serguei V S, Finley Greg, McEwan Reed, Wang Yan, Melton Genevieve B
College of Pharmacy, University of Minnesota, Minneapolis, MN 55455, USA.
Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55455, USA.
Bioinformatics. 2016 Dec 1;32(23):3635-3644. doi: 10.1093/bioinformatics/btw529. Epub 2016 Aug 16.
Automatically quantifying semantic similarity and relatedness between clinical terms is an important aspect of text mining from electronic health records, which are increasingly recognized as valuable sources of phenotypic information for clinical genomics and bioinformatics research. A key obstacle to development of semantic relatedness measures is the limited availability of large quantities of clinical text to researchers and developers outside of major medical centers. Text from general English and biomedical literature are freely available; however, their validity as a substitute for clinical domain to represent semantics of clinical terms remains to be demonstrated.
We constructed neural network representations of clinical terms found in a publicly available benchmark dataset manually labeled for semantic similarity and relatedness. Similarity and relatedness measures computed from text corpora in three domains (Clinical Notes, PubMed Central articles and Wikipedia) were compared using the benchmark as reference. We found that measures computed from full text of biomedical articles in PubMed Central repository (rho = 0.62 for similarity and 0.58 for relatedness) are on par with measures computed from clinical reports (rho = 0.60 for similarity and 0.57 for relatedness). We also evaluated the use of neural network based relatedness measures for query expansion in a clinical document retrieval task and a biomedical term word sense disambiguation task. We found that, with some limitations, biomedical articles may be used in lieu of clinical reports to represent the semantics of clinical terms and that distributional semantic methods are useful for clinical and biomedical natural language processing applications.
The software and reference standards used in this study to evaluate semantic similarity and relatedness measures are publicly available as detailed in the article.
pakh0002@umn.eduSupplementary information: Supplementary data are available at Bioinformatics online.
自动量化临床术语之间的语义相似性和相关性是从电子健康记录中进行文本挖掘的一个重要方面,电子健康记录越来越被视为临床基因组学和生物信息学研究中表型信息的宝贵来源。语义相关性度量发展的一个关键障碍是,除了主要医疗中心之外,研究人员和开发人员难以获得大量临床文本。普通英语和生物医学文献的文本是免费可得的;然而,它们作为临床领域的替代品来表示临床术语语义的有效性仍有待证明。
我们构建了在一个公开可用的基准数据集中找到的临床术语的神经网络表示,该数据集已针对语义相似性和相关性进行了手动标注。以该基准为参考,比较了从三个领域(临床笔记、PubMed Central文章和维基百科)的文本语料库中计算出的相似性和相关性度量。我们发现,从PubMed Central存储库中的生物医学文章全文计算出的度量(相似性的rho值为0.62,相关性的rho值为0.58)与从临床报告中计算出的度量相当(相似性的rho值为0.60,相关性的rho值为0.57)。我们还评估了基于神经网络的相关性度量在临床文档检索任务和生物医学术语词义消歧任务中的查询扩展应用。我们发现,尽管存在一些局限性,但生物医学文章可用于替代临床报告来表示临床术语的语义,并且分布语义方法对临床和生物医学自然语言处理应用很有用。
本研究中用于评估语义相似性和相关性度量的软件和参考标准如文章中所述可公开获取。
pakh0002@umn.edu补充信息:补充数据可在《生物信息学》在线获取。