Wu Guanchen, Ling Chen, Graetz Ilana, Zhao Liang
Department of Computer Science, Emory University, Atlanta, GA, United States.
Rollins School of Public Health, Emory University, Atlanta, GA, United States.
Front Big Data. 2024 Oct 7;7:1463543. doi: 10.3389/fdata.2024.1463543. eCollection 2024.
An ontology is a structured framework that categorizes entities, concepts, and relationships within a domain to facilitate shared understanding, and it is important in computational linguistics and knowledge representation. In this paper, we propose a novel framework to automatically extend an existing ontology from streaming data in a zero-shot manner. Specifically, the zero-shot ontology extension framework uses online and hierarchical clustering to integrate new knowledge into existing ontologies without substantial annotated data or domain-specific expertise. Focusing on the medical field, this approach leverages Large Language Models (LLMs) for two key tasks: Symptom Typing and Symptom Taxonomy among breast and bladder cancer survivors. Symptom Typing involves identifying and classifying medical symptoms from unstructured online patient forum data, while Symptom Taxonomy organizes and integrates these symptoms into an existing ontology. The combined use of online and hierarchical clustering enables real-time and structured categorization and integration of symptoms. The dual-phase model employs multiple LLMs to ensure accurate classification and seamless integration of new symptoms with minimal human oversight. The paper details the framework's development, experiments, quantitative analyses, and data visualizations, demonstrating its effectiveness in enhancing medical ontologies and advancing knowledge-based systems in healthcare.
本体是一个结构化框架,它对某个领域内的实体、概念和关系进行分类,以促进共享理解,并且在计算语言学和知识表示中很重要。在本文中,我们提出了一种新颖的框架,以零样本方式从流数据中自动扩展现有本体。具体而言,零样本本体扩展框架使用在线和层次聚类将新知识集成到现有本体中,而无需大量带注释的数据或特定领域的专业知识。专注于医学领域,这种方法利用大语言模型(LLMs)来完成两项关键任务:乳腺癌和膀胱癌幸存者的症状分类与症状分类法。症状分类涉及从非结构化的在线患者论坛数据中识别和分类医学症状,而症状分类法则将这些症状组织并集成到现有本体中。在线聚类和层次聚类的结合使用能够实现症状的实时结构化分类和集成。双阶段模型采用多个大语言模型,以确保在最少人工监督的情况下准确分类并无缝集成新症状。本文详细介绍了该框架的开发、实验、定量分析和数据可视化,证明了其在增强医学本体和推进医疗保健中基于知识的系统方面的有效性。