Ning Wenxin, Yu Ming, Kong Dehua
Health Care Services Research Center, Department of Industrial Engineering, Tsinghua University, Beijing 100084, China.
J Biomed Inform. 2016 Dec;64:273-287. doi: 10.1016/j.jbi.2016.10.017. Epub 2016 Nov 1.
Semantic similarity estimation significantly promotes the understanding of natural language resources and supports medical decision making. Previous studies have investigated semantic similarity and relatedness estimation between biomedical terms through resources in English, such as SNOMED-CT or UMLS. However, very limited studies focused on the Chinese language, and technology on natural language processing and text mining of medical documents in China is urgently needed. Due to the lack of a complete and publicly available biomedical ontology in China, we only have access to several modest-sized ontologies with no overlaps. Although all these ontologies do not constitute a complete coverage of biomedicine, their coverage of their respective domains is acceptable. In this paper, semantic similarity estimations between Chinese biomedical terms using these multiple non-overlapping ontologies were explored as an initial study.
Typical path-based and information content (IC)-based similarity measures were applied on these ontologies. From the analysis of the computed similarity scores, heterogeneity in the statistical distributions of scores derived from multiple ontologies was discovered. This heterogeneity hampers the comparability of scores and the overall accuracy of similarity estimation. This problem was addressed through a novel language-independent method by combining semantic similarity estimation and score normalization. A reference standard was also created in this study.
Compared with the existing task-independent normalization methods, the newly developed method exhibited superior performance on most IC-based similarity measures. The accuracy of semantic similarity estimation was enhanced through score normalization. This enhancement resulted from the mitigation of heterogeneity in the similarity scores derived from multiple ontologies.
We demonstrated the potential necessity of score normalization when estimating semantic similarity using ontology-based measures. The results of this study can also be extended to other language systems to implement semantic similarity estimation in biomedicine.
语义相似性估计显著促进了对自然语言资源的理解,并支持医学决策。先前的研究通过英语资源(如SNOMED-CT或UMLS)调查了生物医学术语之间的语义相似性和相关性估计。然而,针对中文的研究非常有限,中国迫切需要医学文档的自然语言处理和文本挖掘技术。由于中国缺乏完整且公开可用的生物医学本体,我们只能访问几个规模适中且无重叠的本体。尽管所有这些本体并未完全覆盖生物医学,但它们对各自领域的覆盖是可以接受的。本文将探索使用这些多个不重叠本体对中文生物医学术语进行语义相似性估计,作为一项初步研究。
在这些本体上应用了典型的基于路径和基于信息内容(IC)的相似性度量。通过对计算出的相似性分数的分析,发现了多个本体得出的分数统计分布中的异质性。这种异质性阻碍了分数的可比性以及相似性估计的整体准确性。通过一种结合语义相似性估计和分数归一化的新型语言无关方法解决了这个问题。本研究还创建了一个参考标准。
与现有的与任务无关的归一化方法相比,新开发的方法在大多数基于IC的相似性度量上表现出更好的性能。通过分数归一化提高了语义相似性估计的准确性。这种提高源于减轻了多个本体得出的相似性分数中的异质性。
我们证明了在使用基于本体的度量估计语义相似性时进行分数归一化的潜在必要性。本研究结果也可扩展到其他语言系统,以实现生物医学中的语义相似性估计。