Carolina Health Informatics Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
J Biomed Inform. 2022 Jul;131:104118. doi: 10.1016/j.jbi.2022.104118. Epub 2022 Jun 9.
To propose a new vector-based relatedness metric that derives word vectors from the intrinsic structure of biomedical ontologies, without consulting external resources such as large-scale biomedical corpora.
SNOMED CT on the mapping layer of UMLS was used as a testbed ontology. Vectors were created for every concept at the end of all semantic relations-attribute-value relations and descendants as well as is_a relation-of the defining concept. The cosine similarity between the averages of those vectors with respect to each defining concept was computed to produce a final semantic relatedness.
Two benchmark sets that include a total of 62 biomedical term pairs were used for evaluation. Spearman's rank coefficient of the current method was 0.655, 0.744, and 0.742 with the relatedness rated by physicians, coders, and medical experts, respectively. The proposed method was comparable to a word-embedding method and outperformed path-based, information content-based, and another multiple relation-based relatedness metrics.
The current study demonstrated that the addition of attribute relations to the is_a hierarchy of SNOMED CT better conforms to the human sense of relatedness than models based on taxonomic relations. The current approach also showed that it is robust to the design inconsistency of ontologies.
Unlike the previous vector-based approach, the current study exploited the intrinsic semantic structure of an ontology, precluding the need for external textual resources to obtain context information of defining terms. Future research is recommended to prove the validity of the current method with other biomedical ontologies.
提出一种新的基于向量的关联度量方法,该方法从生物医学本体的内在结构中推导词向量,而不参考外部资源,如大规模生物医学语料库。
使用 UMLS 的 SNOMED CT 映射层作为测试本体。为定义概念的所有语义关系-属性值关系以及后代以及 is_a 关系的末端的每个概念创建向量。针对每个定义概念计算这些向量平均值之间的余弦相似度,以生成最终的语义关联度。
使用了包含总共 62 对生物医学术语的两个基准集进行评估。当前方法的 Spearman 等级相关系数分别为 0.655、0.744 和 0.742,与医师、编码员和医学专家评定的相关性相对应。该方法可与词嵌入方法相媲美,并且优于基于路径、基于信息内容和另一种基于多个关系的关联度量方法。
本研究表明,在 SNOMED CT 的 is_a 层次结构中添加属性关系比基于分类关系的模型更符合人类的相关性概念。当前方法还表明,它对本体设计不一致具有鲁棒性。
与之前的基于向量的方法不同,本研究利用了本体的内在语义结构,无需外部文本资源即可获取定义术语的上下文信息。建议未来的研究使用其他生物医学本体来证明当前方法的有效性。