使用级联微调的特定领域语言模型将临床试验中的疫苗名称映射到疫苗本体。

Li Jianfu, Li Yiming, Pan Yuanyi, Guo Jinjing, Sun Zenan, Li Fang, He Yongqun, Tao Cui

The University of Texas Health Science Center at Houston.

University of Michigan Medical School.

Res Sq. 2023 Sep 27:rs.3.rs-3362256. doi: 10.21203/rs.3.rs-3362256/v1.

BACKGROUND

Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects. ClinicalTrials.gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance.

RESULTS

In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate's accuracy and 90.0% on top 10 candidate's accuracy.

CONCLUSION

This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.

背景

疫苗通过提供针对传染病的保护，彻底改变了公共卫生状况。它们刺激免疫系统并产生记忆细胞，以抵御特定疾病。临床试验评估疫苗性能，包括剂量、给药途径和潜在副作用。ClinicalTrials.gov是临床试验信息的宝贵存储库，但其中的疫苗数据缺乏标准化，导致在自动概念映射、疫苗相关知识开发、循证决策和疫苗监测方面存在挑战。

结果

在本研究中，我们开发了一个级联框架，该框架利用了多个领域知识源，包括临床试验、统一医学语言系统（UMLS）和疫苗本体（VO），以提高特定领域语言模型从临床试验中自动映射VO的性能。疫苗本体（VO）是一个基于社区的本体，旨在促进疫苗数据的标准化、整合和计算机辅助推理。我们的方法包括从各种来源提取和注释数据。然后，我们在PubMedBERT模型上进行预训练，开发出CTPubMedBERT。随后，我们通过合并使用UMLS进行预训练的SAPBERT来增强CTPubMedBERT，得到CTPubMedBERT + SAPBERT。通过使用疫苗本体语料库和来自临床试验的疫苗数据进行微调，进一步优化，得到CTPubMedBERT + SAPBERT + VO模型。最后，我们利用一组预训练模型以及基于加权规则的集成方法，对疫苗语料库进行标准化并提高该过程的准确性。概念标准化中的排序过程涉及对潜在概念进行优先级排序和排序，以确定给定上下文中最合适的匹配。我们对前10个概念进行了排序，实验结果表明，我们提出的级联框架在疫苗映射方面始终优于现有的有效基线，在顶级1候选准确率上达到71.8%，在顶级10候选准确率上达到90.0%。

结论

本研究详细介绍了一个微调特定领域语言模型的级联框架，该框架改进了从临床试验中映射VO的过程。通过有效利用特定领域信息并应用不同预训练BERT模型的基于加权规则的集成，我们的框架可以显著增强从临床试验中映射VO的能力。

相似文献

Mapping Vaccine Names in Clinical Trials to Vaccine Ontology using Cascaded Fine-Tuned Domain-Specific Language Models.

Res Sq. 2023 Sep 27:rs.3.rs-3362256. doi: 10.21203/rs.3.rs-3362256/v1.

Mapping vaccine names in clinical trials to vaccine ontology using cascaded fine-tuned domain-specific language models.

J Biomed Semantics. 2024 Aug 10;15(1):14. doi: 10.1186/s13326-024-00318-x.

Prescription of Controlled Substances: Benefits and Risks

Enhancing Clinical Relevance of Pretrained Language Models Through Integration of External Knowledge: Case Study on Cardiovascular Diagnosis From Electronic Health Records.

JMIR AI. 2024 Aug 6;3:e56932. doi: 10.2196/56932.

Immunogenicity and seroefficacy of pneumococcal conjugate vaccines: a systematic review and network meta-analysis.

Health Technol Assess. 2024 Jul;28(34):1-109. doi: 10.3310/YWHA3079.

Short-Term Memory Impairment

Audit and feedback: effects on professional practice.

Cochrane Database Syst Rev. 2025 Mar 25;3(3):CD000259. doi: 10.1002/14651858.CD000259.pub4.

A Weighted Voting Approach for Traditional Chinese Medicine Formula Classification Using Large Language Models: Algorithm Development and Validation Study.

JMIR Med Inform. 2025 Jul 24;13:e69286. doi: 10.2196/69286.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.

Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

Factors that influence caregivers' and adolescents' views and practices regarding human papillomavirus (HPV) vaccination for adolescents: a qualitative evidence synthesis.

Cochrane Database Syst Rev. 2025 Apr 15;4(4):CD013430. doi: 10.1002/14651858.CD013430.pub2.

本文引用的文献

Towards quality improvement of vaccine concept mappings in the OMOP vocabulary with a semi-automated method.

J Biomed Inform. 2022 Oct;134:104162. doi: 10.1016/j.jbi.2022.104162. Epub 2022 Aug 25.

A simple neural vector space model for medical concept normalization using concept embeddings.

J Biomed Inform. 2022 Jun;130:104080. doi: 10.1016/j.jbi.2022.104080. Epub 2022 Apr 23.

Medical concept normalization in clinical trials with drug and disease representation learning.

Bioinformatics. 2021 Nov 5;37(21):3856-3864. doi: 10.1093/bioinformatics/btab474.

Clinical concept normalization with a hybrid natural language processing system combining multilevel matching and machine learning ranking.

J Am Med Inform Assoc. 2020 Oct 1;27(10):1576-1584. doi: 10.1093/jamia/ocaa155.

UMLS users and uses: a current overview.

J Am Med Inform Assoc. 2020 Jul 19;27(10):1606-11. doi: 10.1093/jamia/ocaa084.

Key steps in vaccine development.

Ann Allergy Asthma Immunol. 2020 Jul;125(1):17-27. doi: 10.1016/j.anai.2020.01.025. Epub 2020 Feb 7.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Estimation of clinical trial success rates and related parameters.

Biostatistics. 2019 Apr 1;20(2):273-286. doi: 10.1093/biostatistics/kxx069.

Understanding modern-day vaccines: what you need to know.

Ann Med. 2018 Mar;50(2):110-120. doi: 10.1080/07853890.2017.1407035. Epub 2017 Nov 27.

Vaccine Hesitancy: Where We Are and Where We Are Going.

Clin Ther. 2017 Aug;39(8):1550-1562. doi: 10.1016/j.clinthera.2017.07.003. Epub 2017 Jul 31.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Mapping Vaccine Names in Clinical Trials to Vaccine Ontology using Cascaded Fine-Tuned Domain-Specific Language Models.

Res Sq. 2023 Sep 27:rs.3.rs-3362256. doi: 10.21203/rs.3.rs-3362256/v1.

Mapping vaccine names in clinical trials to vaccine ontology using cascaded fine-tuned domain-specific language models.

J Biomed Semantics. 2024 Aug 10;15(1):14. doi: 10.1186/s13326-024-00318-x.

Prescription of Controlled Substances: Benefits and Risks

Enhancing Clinical Relevance of Pretrained Language Models Through Integration of External Knowledge: Case Study on Cardiovascular Diagnosis From Electronic Health Records.

JMIR AI. 2024 Aug 6;3:e56932. doi: 10.2196/56932.

Immunogenicity and seroefficacy of pneumococcal conjugate vaccines: a systematic review and network meta-analysis.

Health Technol Assess. 2024 Jul;28(34):1-109. doi: 10.3310/YWHA3079.

Short-Term Memory Impairment

Audit and feedback: effects on professional practice.

Cochrane Database Syst Rev. 2025 Mar 25;3(3):CD000259. doi: 10.1002/14651858.CD000259.pub4.

A Weighted Voting Approach for Traditional Chinese Medicine Formula Classification Using Large Language Models: Algorithm Development and Validation Study.

JMIR Med Inform. 2025 Jul 24;13:e69286. doi: 10.2196/69286.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.

Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

Factors that influence caregivers' and adolescents' views and practices regarding human papillomavirus (HPV) vaccination for adolescents: a qualitative evidence synthesis.

Cochrane Database Syst Rev. 2025 Apr 15;4(4):CD013430. doi: 10.1002/14651858.CD013430.pub2.

本文引用的文献

Towards quality improvement of vaccine concept mappings in the OMOP vocabulary with a semi-automated method.

J Biomed Inform. 2022 Oct;134:104162. doi: 10.1016/j.jbi.2022.104162. Epub 2022 Aug 25.

A simple neural vector space model for medical concept normalization using concept embeddings.

J Biomed Inform. 2022 Jun;130:104080. doi: 10.1016/j.jbi.2022.104080. Epub 2022 Apr 23.

Medical concept normalization in clinical trials with drug and disease representation learning.

Bioinformatics. 2021 Nov 5;37(21):3856-3864. doi: 10.1093/bioinformatics/btab474.

Clinical concept normalization with a hybrid natural language processing system combining multilevel matching and machine learning ranking.

J Am Med Inform Assoc. 2020 Oct 1;27(10):1576-1584. doi: 10.1093/jamia/ocaa155.

UMLS users and uses: a current overview.

J Am Med Inform Assoc. 2020 Jul 19;27(10):1606-11. doi: 10.1093/jamia/ocaa084.

Key steps in vaccine development.

Ann Allergy Asthma Immunol. 2020 Jul;125(1):17-27. doi: 10.1016/j.anai.2020.01.025. Epub 2020 Feb 7.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Estimation of clinical trial success rates and related parameters.

Biostatistics. 2019 Apr 1;20(2):273-286. doi: 10.1093/biostatistics/kxx069.

Understanding modern-day vaccines: what you need to know.

Ann Med. 2018 Mar;50(2):110-120. doi: 10.1080/07853890.2017.1407035. Epub 2017 Nov 27.

Vaccine Hesitancy: Where We Are and Where We Are Going.

Clin Ther. 2017 Aug;39(8):1550-1562. doi: 10.1016/j.clinthera.2017.07.003. Epub 2017 Jul 31.

Mapping Vaccine Names in Clinical Trials to Vaccine Ontology using Cascaded Fine-Tuned Domain-Specific Language Models.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献