Cao Lang, Sun Jimeng, Cross Adam
Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, United States.
Department of Pediatrics, University of Illinois College of Medicine Peoria, Peoria, IL, United States.
JMIR Med Inform. 2024 Dec 18;12:e60665. doi: 10.2196/60665.
Rare diseases affect millions worldwide but sometimes face limited research focus individually due to low prevalence. Many rare diseases do not have specific International Classification of Diseases, Ninth Edition (ICD-9) and Tenth Edition (ICD-10), codes and therefore cannot be reliably extracted from granular fields like "Diagnosis" and "Problem List" entries, which complicates tasks that require identification of patients with these conditions, including clinical trial recruitment and research efforts. Recent advancements in large language models (LLMs) have shown promise in automating the extraction of medical information, offering the potential to improve medical research, diagnosis, and management. However, most LLMs lack professional medical knowledge, especially concerning specific rare diseases, and cannot effectively manage rare disease data in its various ontological forms, making it unsuitable for these tasks.
Our aim is to create an end-to-end system called automated rare disease mining (AutoRD), which automates the extraction of rare disease-related information from medical text, focusing on entities and their relations to other medical concepts, such as signs and symptoms. AutoRD integrates up-to-date ontologies with other structured knowledge and demonstrates superior performance in rare disease extraction tasks. We conducted various experiments to evaluate AutoRD's performance, aiming to surpass common LLMs and traditional methods.
AutoRD is a pipeline system that involves data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction. We implemented this system using GPT-4 and medical knowledge graphs developed from the open-source Human Phenotype and Orphanet ontologies, using techniques such as chain-of-thought reasoning and prompt engineering. We quantitatively evaluated our system's performance in entity extraction, relation extraction, and knowledge graph construction. The experiment used the well-curated dataset RareDis2023, which contains medical literature focused on rare disease entities and their relations, making it an ideal dataset for training and testing our methodology.
On the RareDis2023 dataset, AutoRD achieved an overall entity extraction F1-score of 56.1% and a relation extraction F1-score of 38.6%, marking a 14.4% improvement over the baseline LLM. Notably, the F1-score for rare disease entity extraction reached 83.5%, indicating high precision and recall in identifying rare disease mentions. These results demonstrate the effectiveness of integrating LLMs with medical ontologies in extracting complex rare disease information.
AutoRD is an automated end-to-end system for extracting rare disease information from text to build knowledge graphs, addressing critical limitations of existing LLMs by improving identification of these diseases and connecting them to related clinical features. This work underscores the significant potential of LLMs in transforming health care, particularly in the rare disease domain. By leveraging ontology-enhanced LLMs, AutoRD constructs a robust medical knowledge base that incorporates up-to-date rare disease information, facilitating improved identification of patients and resulting in more inclusive research and trial candidacy efforts.
罕见病影响着全球数百万人,但由于发病率低,有时单个疾病面临的研究关注度有限。许多罕见病没有特定的《国际疾病分类》第九版(ICD - 9)和第十版(ICD - 10)编码,因此无法从“诊断”和“问题列表”条目中这样的详细字段可靠提取,这使得识别患有这些疾病的患者的任务变得复杂,包括临床试验招募和研究工作。大语言模型(LLMs)的最新进展在医学信息提取自动化方面显示出前景,有望改善医学研究、诊断和管理。然而,大多数大语言模型缺乏专业医学知识,尤其是关于特定罕见病的知识,并且无法有效管理各种本体形式的罕见病数据,使其不适用于这些任务。
我们的目标是创建一个名为自动罕见病挖掘(AutoRD)的端到端系统,该系统能自动从医学文本中提取与罕见病相关的信息,重点关注实体及其与其他医学概念(如体征和症状)的关系。AutoRD将最新的本体与其他结构化知识集成在一起,并在罕见病提取任务中表现出卓越性能。我们进行了各种实验来评估AutoRD的性能,旨在超越常见的大语言模型和传统方法。
AutoRD是一个管道系统,涉及数据预处理、实体提取、关系提取、实体校准和知识图谱构建。我们使用GPT - 4以及从开源的人类表型和孤儿病本体开发的医学知识图谱,采用思维链推理和提示工程等技术来实现这个系统。我们定量评估了我们的系统在实体提取、关系提取和知识图谱构建方面的性能。实验使用了精心策划的RareDis2023数据集,该数据集包含专注于罕见病实体及其关系的医学文献,使其成为训练和测试我们方法的理想数据集。
在RareDis2023数据集上,AutoRD实现了总体实体提取F1分数为56.1%,关系提取F1分数为38.6%,比基线大语言模型提高了14.4%。值得注意的是,罕见病实体提取的F1分数达到83.5%,表明在识别罕见病提及方面具有高精度和召回率。这些结果证明了将大语言模型与医学本体集成在提取复杂罕见病信息方面的有效性。
AutoRD是一个用于从文本中提取罕见病信息以构建知识图谱的自动化端到端系统,通过改进对这些疾病的识别并将它们与相关临床特征联系起来,解决了现有大语言模型的关键局限性。这项工作强调了大语言模型在变革医疗保健方面的巨大潜力,特别是在罕见病领域。通过利用本体增强的大语言模型,AutoRD构建了一个强大的医学知识库,其中纳入了最新的罕见病信息,有助于更好地识别患者,并使研究和试验候选工作更具包容性。