基于本体增强大语言模型的罕见病知识图谱构建自动端到端系统：开发研究

Cao Lang, Sun Jimeng, Cross Adam

Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, United States.

Department of Pediatrics, University of Illinois College of Medicine Peoria, Peoria, IL, United States.

JMIR Med Inform. 2024 Dec 18;12:e60665. doi: 10.2196/60665.

BACKGROUND

Rare diseases affect millions worldwide but sometimes face limited research focus individually due to low prevalence. Many rare diseases do not have specific International Classification of Diseases, Ninth Edition (ICD-9) and Tenth Edition (ICD-10), codes and therefore cannot be reliably extracted from granular fields like "Diagnosis" and "Problem List" entries, which complicates tasks that require identification of patients with these conditions, including clinical trial recruitment and research efforts. Recent advancements in large language models (LLMs) have shown promise in automating the extraction of medical information, offering the potential to improve medical research, diagnosis, and management. However, most LLMs lack professional medical knowledge, especially concerning specific rare diseases, and cannot effectively manage rare disease data in its various ontological forms, making it unsuitable for these tasks.

OBJECTIVE

Our aim is to create an end-to-end system called automated rare disease mining (AutoRD), which automates the extraction of rare disease-related information from medical text, focusing on entities and their relations to other medical concepts, such as signs and symptoms. AutoRD integrates up-to-date ontologies with other structured knowledge and demonstrates superior performance in rare disease extraction tasks. We conducted various experiments to evaluate AutoRD's performance, aiming to surpass common LLMs and traditional methods.

METHODS

AutoRD is a pipeline system that involves data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction. We implemented this system using GPT-4 and medical knowledge graphs developed from the open-source Human Phenotype and Orphanet ontologies, using techniques such as chain-of-thought reasoning and prompt engineering. We quantitatively evaluated our system's performance in entity extraction, relation extraction, and knowledge graph construction. The experiment used the well-curated dataset RareDis2023, which contains medical literature focused on rare disease entities and their relations, making it an ideal dataset for training and testing our methodology.

RESULTS

On the RareDis2023 dataset, AutoRD achieved an overall entity extraction F1-score of 56.1% and a relation extraction F1-score of 38.6%, marking a 14.4% improvement over the baseline LLM. Notably, the F1-score for rare disease entity extraction reached 83.5%, indicating high precision and recall in identifying rare disease mentions. These results demonstrate the effectiveness of integrating LLMs with medical ontologies in extracting complex rare disease information.

CONCLUSIONS

AutoRD is an automated end-to-end system for extracting rare disease information from text to build knowledge graphs, addressing critical limitations of existing LLMs by improving identification of these diseases and connecting them to related clinical features. This work underscores the significant potential of LLMs in transforming health care, particularly in the rare disease domain. By leveraging ontology-enhanced LLMs, AutoRD constructs a robust medical knowledge base that incorporates up-to-date rare disease information, facilitating improved identification of patients and resulting in more inclusive research and trial candidacy efforts.

背景

罕见病影响着全球数百万人，但由于发病率低，有时单个疾病面临的研究关注度有限。许多罕见病没有特定的《国际疾病分类》第九版（ICD - 9）和第十版（ICD - 10）编码，因此无法从“诊断”和“问题列表”条目中这样的详细字段可靠提取，这使得识别患有这些疾病的患者的任务变得复杂，包括临床试验招募和研究工作。大语言模型（LLMs）的最新进展在医学信息提取自动化方面显示出前景，有望改善医学研究、诊断和管理。然而，大多数大语言模型缺乏专业医学知识，尤其是关于特定罕见病的知识，并且无法有效管理各种本体形式的罕见病数据，使其不适用于这些任务。

目的

我们的目标是创建一个名为自动罕见病挖掘（AutoRD）的端到端系统，该系统能自动从医学文本中提取与罕见病相关的信息，重点关注实体及其与其他医学概念（如体征和症状）的关系。AutoRD将最新的本体与其他结构化知识集成在一起，并在罕见病提取任务中表现出卓越性能。我们进行了各种实验来评估AutoRD的性能，旨在超越常见的大语言模型和传统方法。

方法

AutoRD是一个管道系统，涉及数据预处理、实体提取、关系提取、实体校准和知识图谱构建。我们使用GPT - 4以及从开源的人类表型和孤儿病本体开发的医学知识图谱，采用思维链推理和提示工程等技术来实现这个系统。我们定量评估了我们的系统在实体提取、关系提取和知识图谱构建方面的性能。实验使用了精心策划的RareDis2023数据集，该数据集包含专注于罕见病实体及其关系的医学文献，使其成为训练和测试我们方法的理想数据集。

结果

在RareDis2023数据集上，AutoRD实现了总体实体提取F1分数为56.1%，关系提取F1分数为38.6%，比基线大语言模型提高了14.4%。值得注意的是，罕见病实体提取的F1分数达到83.5%，表明在识别罕见病提及方面具有高精度和召回率。这些结果证明了将大语言模型与医学本体集成在提取复杂罕见病信息方面的有效性。

结论

AutoRD是一个用于从文本中提取罕见病信息以构建知识图谱的自动化端到端系统，通过改进对这些疾病的识别并将它们与相关临床特征联系起来，解决了现有大语言模型的关键局限性。这项工作强调了大语言模型在变革医疗保健方面的巨大潜力，特别是在罕见病领域。通过利用本体增强的大语言模型，AutoRD构建了一个强大的医学知识库，其中纳入了最新的罕见病信息，有助于更好地识别患者，并使研究和试验候选工作更具包容性。

相似文献

An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study.

JMIR Med Inform. 2024 Dec 18;12:e60665. doi: 10.2196/60665.

A hybrid framework with large language models for rare disease phenotyping.

BMC Med Inform Decis Mak. 2024 Oct 8;24(1):289. doi: 10.1186/s12911-024-02698-7.

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study.

J Med Internet Res. 2025 Mar 18;27:e66279. doi: 10.2196/66279.

Prompt Framework for Extracting Scale-Related Knowledge Entities from Chinese Medical Literature: Development and Evaluation Study.

J Med Internet Res. 2025 Mar 18;27:e67033. doi: 10.2196/67033.

Leveraging Medical Knowledge Graphs Into Large Language Models for Diagnosis Prediction: Design and Application Study.

JMIR AI. 2025 Feb 24;4:e58670. doi: 10.2196/58670.

Improving large language models for clinical named entity recognition via prompt engineering.

J Am Med Inform Assoc. 2024 Sep 1;31(9):1812-1820. doi: 10.1093/jamia/ocad259.

Improving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media.

J Biomed Inform. 2025 Mar;163:104789. doi: 10.1016/j.jbi.2025.104789. Epub 2025 Feb 7.

Large Language Model-Driven Knowledge Graph Construction in Sepsis Care Using Multicenter Clinical Databases: Development and Usability Study.

J Med Internet Res. 2025 Mar 27;27:e65537. doi: 10.2196/65537.

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.

Large Language Model Applications for Health Information Extraction in Oncology: Scoping Review.

JMIR Cancer. 2025 Mar 28;11:e65984. doi: 10.2196/65984.

引用本文的文献

Empowering standardized residency training in China through large language models: problem analysis and solutions.

Ann Med. 2025 Dec;57(1):2516695. doi: 10.1080/07853890.2025.2516695. Epub 2025 Jul 15.

Diagnostic Accuracy of a Custom Large Language Model on Rare Pediatric Disease Case Reports.

Am J Med Genet A. 2025 Feb;197(2):e63878. doi: 10.1002/ajmg.a.63878. Epub 2024 Sep 13.

本文引用的文献

Identifying and Extracting Rare Diseases and Their Phenotypes with Large Language Models.

J Healthc Inform Res. 2024 Jan 5;8(2):438-461. doi: 10.1007/s41666-023-00155-0. eCollection 2024 Jun.

Knowledge and Awareness of Rare Diseases Among Healthcare Professionals in the Kingdom of Bahrain.

Cureus. 2023 Oct 25;15(10):e47676. doi: 10.7759/cureus.47676. eCollection 2023 Oct.

AutoCriteria: a generalizable clinical trial eligibility criteria extraction system powered by large language models.

J Am Med Inform Assoc. 2024 Jan 18;31(2):375-385. doi: 10.1093/jamia/ocad218.

Clustering rare diseases within an ontology-enriched knowledge graph.

J Am Med Inform Assoc. 2023 Dec 22;31(1):154-164. doi: 10.1093/jamia/ocad186.

REDCap and the National Mesothelioma Virtual Bank-a scalable and sustainable model for rare disease biorepositories.

J Am Med Inform Assoc. 2023 Sep 25;30(10):1634-1644. doi: 10.1093/jamia/ocad132.

Embracing Large Language Models for Medical Applications: Opportunities and Challenges.

Cureus. 2023 May 21;15(5):e39305. doi: 10.7759/cureus.39305. eCollection 2023 May.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

A guide for the diagnosis of rare and undiagnosed disease: beyond the exome.

Genome Med. 2022 Feb 28;14(1):23. doi: 10.1186/s13073-022-01026-w.

The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms.

J Biomed Inform. 2022 Jan;125:103961. doi: 10.1016/j.jbi.2021.103961. Epub 2021 Dec 5.

The IDeaS initiative: pilot study to assess the impact of rare diseases on patients and healthcare systems.

Orphanet J Rare Dis. 2021 Oct 22;16(1):429. doi: 10.1186/s13023-021-02061-3.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study.

JMIR Med Inform. 2024 Dec 18;12:e60665. doi: 10.2196/60665.

A hybrid framework with large language models for rare disease phenotyping.

BMC Med Inform Decis Mak. 2024 Oct 8;24(1):289. doi: 10.1186/s12911-024-02698-7.

Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study.

J Med Internet Res. 2025 Mar 18;27:e66279. doi: 10.2196/66279.

Prompt Framework for Extracting Scale-Related Knowledge Entities from Chinese Medical Literature: Development and Evaluation Study.

J Med Internet Res. 2025 Mar 18;27:e67033. doi: 10.2196/67033.

Leveraging Medical Knowledge Graphs Into Large Language Models for Diagnosis Prediction: Design and Application Study.

JMIR AI. 2025 Feb 24;4:e58670. doi: 10.2196/58670.

Improving large language models for clinical named entity recognition via prompt engineering.

J Am Med Inform Assoc. 2024 Sep 1;31(9):1812-1820. doi: 10.1093/jamia/ocad259.

Improving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media.

J Biomed Inform. 2025 Mar;163:104789. doi: 10.1016/j.jbi.2025.104789. Epub 2025 Feb 7.

Large Language Model-Driven Knowledge Graph Construction in Sepsis Care Using Multicenter Clinical Databases: Development and Usability Study.

J Med Internet Res. 2025 Mar 27;27:e65537. doi: 10.2196/65537.

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.

JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391.

Large Language Model Applications for Health Information Extraction in Oncology: Scoping Review.

JMIR Cancer. 2025 Mar 28;11:e65984. doi: 10.2196/65984.

引用本文的文献

Empowering standardized residency training in China through large language models: problem analysis and solutions.

Ann Med. 2025 Dec;57(1):2516695. doi: 10.1080/07853890.2025.2516695. Epub 2025 Jul 15.

Diagnostic Accuracy of a Custom Large Language Model on Rare Pediatric Disease Case Reports.

Am J Med Genet A. 2025 Feb;197(2):e63878. doi: 10.1002/ajmg.a.63878. Epub 2024 Sep 13.

本文引用的文献

Identifying and Extracting Rare Diseases and Their Phenotypes with Large Language Models.

J Healthc Inform Res. 2024 Jan 5;8(2):438-461. doi: 10.1007/s41666-023-00155-0. eCollection 2024 Jun.

Knowledge and Awareness of Rare Diseases Among Healthcare Professionals in the Kingdom of Bahrain.

Cureus. 2023 Oct 25;15(10):e47676. doi: 10.7759/cureus.47676. eCollection 2023 Oct.

AutoCriteria: a generalizable clinical trial eligibility criteria extraction system powered by large language models.

J Am Med Inform Assoc. 2024 Jan 18;31(2):375-385. doi: 10.1093/jamia/ocad218.

Clustering rare diseases within an ontology-enriched knowledge graph.

J Am Med Inform Assoc. 2023 Dec 22;31(1):154-164. doi: 10.1093/jamia/ocad186.

REDCap and the National Mesothelioma Virtual Bank-a scalable and sustainable model for rare disease biorepositories.

J Am Med Inform Assoc. 2023 Sep 25;30(10):1634-1644. doi: 10.1093/jamia/ocad132.

Embracing Large Language Models for Medical Applications: Opportunities and Challenges.

Cureus. 2023 May 21;15(5):e39305. doi: 10.7759/cureus.39305. eCollection 2023 May.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

A guide for the diagnosis of rare and undiagnosed disease: beyond the exome.

Genome Med. 2022 Feb 28;14(1):23. doi: 10.1186/s13073-022-01026-w.

The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms.

J Biomed Inform. 2022 Jan;125:103961. doi: 10.1016/j.jbi.2021.103961. Epub 2021 Dec 5.

The IDeaS initiative: pilot study to assess the impact of rare diseases on patients and healthcare systems.

Orphanet J Rare Dis. 2021 Oct 22;16(1):429. doi: 10.1186/s13023-021-02061-3.

Suppr
超能文献

An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献

Suppr超能文献

基于本体增强大语言模型的罕见病知识图谱构建自动端到端系统：开发研究

An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献

Suppr
超能文献