School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.
Department of Neurology, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX, USA.
BMC Med Inform Decis Mak. 2023 May 9;23(Suppl 1):87. doi: 10.1186/s12911-023-02183-7.
Biomedical ontologies are representations of biomedical knowledge that provide terms with precisely defined meanings. They play a vital role in facilitating biomedical research in a cross-disciplinary manner. Quality issues of biomedical ontologies will hinder their effective usage. One such quality issue is missing concepts. In this study, we introduce a logical definition-based approach to identify potential missing concepts in SNOMED CT. A unique contribution of our approach is that it is capable of obtaining both logical definitions and fully specified names for potential missing concepts.
The logical definitions of unrelated pairs of fully defined concepts in non-lattice subgraphs that indicate quality issues are intersected to generate the logical definitions of potential missing concepts. A text summarization model (called PEGASUS) is fine-tuned to predict the fully specified names of the potential missing concepts from their generated logical definitions. Furthermore, the identified potential missing concepts are validated using external resources including the Unified Medical Language System (UMLS), biomedical literature in PubMed, and a newer version of SNOMED CT.
From the March 2021 US Edition of SNOMED CT, we obtained a total of 30,313 unique logical definitions for potential missing concepts through the intersecting process. We fine-tuned a PEGASUS summarization model with 289,169 training instances and tested it on 36,146 instances. The model achieved 72.83 of ROUGE-1, 51.06 of ROUGE-2, and 71.76 of ROUGE-L on the test dataset. The model correctly predicted 11,549 out of 36,146 fully specified names in the test dataset. Applying the fine-tuned model on the 30,313 unique logical definitions, 23,031 total potential missing concepts were identified. Out of these, a total of 2,312 (10.04%) were automatically validated by either of the three resources.
The results showed that our logical definition-based approach for identification of potential missing concepts in SNOMED CT is encouraging. Nevertheless, there is still room for improving the performance of naming concepts based on logical definitions.
生物医学本体是生物医学知识的表示形式,它为术语提供了精确定义的含义。它们在促进跨学科的生物医学研究方面发挥着至关重要的作用。生物医学本体的质量问题将阻碍它们的有效使用。其中一个质量问题是缺少概念。在这项研究中,我们引入了一种基于逻辑定义的方法来识别 SNOMED CT 中的潜在缺失概念。我们的方法的一个独特贡献是,它能够为潜在缺失概念获得逻辑定义和完全指定的名称。
在非格子网图中,将完全定义的概念之间的不相关对子的逻辑定义进行交叉,以生成潜在缺失概念的逻辑定义。对一个文本摘要模型(称为 PEGASUS)进行微调,以根据生成的逻辑定义预测潜在缺失概念的完全指定名称。此外,使用外部资源(包括统一医学语言系统(UMLS)、PubMed 中的生物医学文献和较新版本的 SNOMED CT)验证所识别的潜在缺失概念。
从 2021 年 3 月的 SNOMED CT 美国版中,我们通过交叉过程获得了总共 30313 个潜在缺失概念的唯一逻辑定义。我们使用 289169 个训练实例对 PEGASUS 摘要模型进行了微调,并在 36146 个实例上进行了测试。该模型在测试数据集上的 ROUGE-1 为 72.83,ROUGE-2 为 51.06,ROUGE-L 为 71.76。该模型在测试数据集中正确预测了 36146 个完全指定名称中的 11549 个。将经过微调的模型应用于 30313 个独特的逻辑定义,总共确定了 23031 个潜在的缺失概念。其中,共有 2312 个(10.04%)被三种资源中的任意一种自动验证。
结果表明,我们在 SNOMED CT 中识别潜在缺失概念的基于逻辑定义的方法令人鼓舞。然而,在基于逻辑定义的概念命名性能方面仍有改进的空间。