Suppr超能文献

使用大语言模型进行本体丰富:将基于词汇、语义和知识网络的相似性应用于概念放置。

Ontology enrichment using a large language model: Applying lexical, semantic, and knowledge network-based similarity for concept placement.

作者信息

Kollapally Navya Martin, Geller James, Keloth Vipina Kuttichi, He Zhe, Xu Julia

机构信息

Kean University, United States.

New Jersey Institute of Technology, United States.

出版信息

J Biomed Inform. 2025 Aug;168:104865. doi: 10.1016/j.jbi.2025.104865. Epub 2025 Jun 19.

Abstract

OBJECTIVE

Ontologies are essential for representing the knowledge of a domain. To make ontologies useful, they must encompass a comprehensive domain view. To achieve ontology enrichment, there is a need to discover new concepts to be added, either because they were missed in the first place, or the state-of-the-art has advanced to develop new real-world concepts. Our goal is to develop an automatic enrichment pipeline using a seed ontology, a Large Language Model (LLM), and source of text. The pipeline is applied to the domain of Social Determinants of Health (SDoH), using PubMed as a source of concepts. In this work, the applicability and effectiveness of the enrichment pipeline is demonstrated by extending the SDoH Ontology called SOHOv1, however our methodology could be used in other domains as well.

METHODS

We first retrieved PubMed abstracts of candidate articles with existing SOHOv1 concepts as search terms. Next, we used GPT-4-1201 to extract semantic triples from the abstracts. We identified concepts from these triples utilizing lexical, semantic, and knowledge network-based filtering. We also compared the granularity of semantic triples extracted with our method to the triples in the SemMedDB (Semantic MEDLINE Database). The results were evaluated by human experts and standard ontology tools for checking consistency and semantic correctness.

RESULTS

We expanded SOHOv1, which contained 173 concepts and 585 axioms, including 207 logical axioms to SOHOv2, which contains 572 concepts, 1,542 axioms, including 725 logical axioms. Our methods identified more concepts than those extracted from SemMedDB for the same task. While we have shown the feasibility of our approach for an SDoH ontology, the methodology is generalizable to other ontologies with an existing seed ontology and text corpus.

CONCLUSIONS

The contributions of this work are: Extracting semantic triples from PubMed abstracts using GPT-4-1201 utilizing prompt chaining; showing the superiority of triples from GPT-4-1201 over triples from SemMedDB for SDoH; using lexical and semantic similarity search techniques with knowledge network-based search to identify the concepts to be added to the ontology; confirming the quality of the new concepts with human experts.

摘要

目的

本体对于表示一个领域的知识至关重要。为使本体有用,它们必须包含全面的领域视图。为实现本体丰富,有必要发现新的概念以添加进来,这要么是因为一开始就遗漏了这些概念,要么是因为当前技术水平已经发展到产生了新的现实世界概念。我们的目标是使用种子本体、大语言模型(LLM)和文本源开发一个自动丰富管道。该管道应用于健康的社会决定因素(SDoH)领域,使用PubMed作为概念源。在这项工作中,通过扩展名为SOHOv1的SDoH本体来证明丰富管道的适用性和有效性,然而我们的方法也可用于其他领域。

方法

我们首先以现有的SOHOv1概念作为搜索词检索候选文章的PubMed摘要。接下来,我们使用GPT - 4 - 1201从摘要中提取语义三元组。我们利用基于词汇、语义和知识网络的过滤从这些三元组中识别概念。我们还将用我们的方法提取的语义三元组的粒度与语义医学文献数据库(SemMedDB)中的三元组进行了比较。结果由人类专家和标准本体工具进行评估,以检查一致性和语义正确性。

结果

我们将包含173个概念和585个公理(包括207个逻辑公理)的SOHOv1扩展为包含572个概念、1542个公理(包括725个逻辑公理)的SOHOv2。对于相同任务,我们的方法识别出的概念比从SemMedDB中提取的更多。虽然我们已经展示了我们的方法对于SDoH本体的可行性,但该方法可推广到具有现有种子本体和文本语料库的其他本体。

结论

这项工作的贡献在于:利用提示链使用GPT - 4 - 1201从PubMed摘要中提取语义三元组;展示了对于SDoH,GPT - 4 - 1201的三元组优于SemMedDB的三元组;使用基于词汇和语义相似性搜索技术以及基于知识网络的搜索来识别要添加到本体中的概念;通过人类专家确认新概念的质量。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验