Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, 32224, USA.
McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
J Biomed Semantics. 2024 Aug 10;15(1):14. doi: 10.1186/s13326-024-00318-x.
Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects.
gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance.
In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, the Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate's accuracy and 90.0% on top 10 candidate's accuracy.
This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.
疫苗通过提供针对传染病的保护,彻底改变了公共卫生。它们刺激免疫系统并产生记忆细胞,以抵御靶向疾病。临床试验评估疫苗的性能,包括剂量、给药途径和潜在的副作用。
gov 是临床试验信息的宝贵资源库,但其中的疫苗数据缺乏标准化,导致在自动概念映射、疫苗相关知识开发、循证决策和疫苗监测方面面临挑战。
在这项研究中,我们开发了一个级联框架,利用多个领域知识源,包括临床试验、统一医学语言系统 (UMLS) 和疫苗本体 (VO),来提高针对从临床试验中自动映射 VO 的领域特定语言模型的性能。疫苗本体 (VO) 是一个基于社区的本体,旨在促进疫苗数据的标准化、集成和计算机辅助推理。我们的方法包括从各种来源提取和注释数据。然后,我们对 PubMedBERT 模型进行预训练,从而开发出 CTPubMedBERT。随后,我们通过使用 UMLS 预训练的 SAPBERT 来增强 CTPubMedBERT,从而得到 CTPubMedBERT+SAPBERT。进一步的改进是通过使用疫苗本体语料库和临床试验中的疫苗数据进行微调来实现的,从而得到 CTPubMedBERT+SAPBERT+VO 模型。最后,我们使用了一组预训练的模型和加权规则基集成方法来规范化疫苗语料库并提高该过程的准确性。概念规范化中的排序过程涉及对潜在概念进行优先级排序和排序,以确定给定上下文中最合适的匹配。我们对前 10 个概念进行了排名,实验结果表明,我们提出的级联框架在疫苗映射方面始终优于现有的有效基线,在最佳候选者的准确率达到 71.8%,在最佳候选者的准确率达到 90.0%。
本研究深入探讨了一种微调的领域特定语言模型级联框架,该框架可以提高从临床试验中映射 VO 的能力。通过有效地利用领域特定信息,并应用不同预训练 BERT 模型的加权规则基集成,我们的框架可以显著提高从临床试验中映射 VO 的能力。