通过合成病例报告和基于嵌入的检索增强人类表型本体术语提取：一种改进生物医学数据注释的新方法。

Enhancing human phenotype ontology term extraction through synthetic case reports and embedding-based retrieval: A novel approach for improved biomedical data annotation.

作者信息

Albayrak Abdulkadir, Xiao Yao, Mukherjee Piyush, Barnett Sarah S, Marcou Cherisse A, Hart Steven N

机构信息

Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, United States of America.

Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States of America.

出版信息

J Pathol Inform. 2024 Nov 16;16:100409. doi: 10.1016/j.jpi.2024.100409. eCollection 2025 Jan.

DOI:10.1016/j.jpi.2024.100409

PMID:39720417

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11667693/

Abstract

With the increasing utilization of exome and genome sequencing in clinical and research genetics, accurate and automated extraction of human phenotype ontology (HPO) terms from clinical texts has become imperative. Traditional methods for HPO term extraction, such as PhenoTagger, often face limitations in coverage and precision. In this study, we propose a novel approach that leverages large language models (LLMs) to generate synthetic sentences with clinical context, which were semantically encoded into vector embeddings. These embeddings are linked to HPO terms, creating a robust knowledgebase that facilitates precise information retrieval. Our method circumvents the known issue of LLM hallucinations by storing and querying these embeddings within a true database, ensuring accurate context matching without the need for a predictive model. We evaluated the performance of three different embedding models, all of which demonstrated substantial improvements over PhenoTagger. Top recall (sensitivity), precision (positive-predictive value, PPV), and F1 are 0.64, 0.64, and 0.64, respectively, which were 31%, 10%, and 21% better than PhenoTagger. Furthermore, optimal performance was achieved when we combined the best performing embedding model with PhenoTagger (a.k.a. Fused model), resulting in recall (sensitivity), precision (PPV), and F1 values of 0.7, 0.7, and 0.7, respectively, which are 10%, 10%, and 10% better than the best embedding models. Our findings underscore the potential of this integrated approach to enhance the precision and reliability of HPO term extraction, offering a scalable and effective solution for biomedical data annotation.

摘要

随着外显子组和基因组测序在临床和研究遗传学中的应用日益增加，从临床文本中准确、自动提取人类表型本体（HPO）术语变得势在必行。传统的HPO术语提取方法，如PhenoTagger，在覆盖范围和精度方面常常面临局限性。在本研究中，我们提出了一种新颖的方法，该方法利用大语言模型（LLM）生成具有临床背景的合成句子，这些句子被语义编码为向量嵌入。这些嵌入与HPO术语相关联，创建了一个强大的知识库，便于精确的信息检索。我们的方法通过在真实数据库中存储和查询这些嵌入来规避LLM幻觉的已知问题，确保准确的上下文匹配，而无需预测模型。我们评估了三种不同嵌入模型的性能，所有这些模型都比PhenoTagger有显著改进。最高召回率（灵敏度）、精度（阳性预测值，PPV）和F1分别为0.64、0.64和0.64，比PhenoTagger分别高出31%、10%和21%。此外，当我们将性能最佳的嵌入模型与PhenoTagger（即融合模型）相结合时，实现了最佳性能，召回率（灵敏度）、精度（PPV）和F1值分别为0.7、0.7和0.7，比最佳嵌入模型分别高出10%、10%和10%。我们的研究结果强调了这种综合方法在提高HPO术语提取的精度和可靠性方面的潜力，为生物医学数据注释提供了一种可扩展且有效的解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94ff/11667693/26b58db295e5/gr1.jpg

相似文献

Enhancing human phenotype ontology term extraction through synthetic case reports and embedding-based retrieval: A novel approach for improved biomedical data annotation.通过合成病例报告和基于嵌入的检索增强人类表型本体术语提取：一种改进生物医学数据注释的新方法。

J Pathol Inform. 2024 Nov 16;16:100409. doi: 10.1016/j.jpi.2024.100409. eCollection 2025 Jan.

Improving Automated Deep Phenotyping Through Large Language Models Using Retrieval Augmented Generation.通过使用检索增强生成的大语言模型改进自动深度表型分析

medRxiv. 2024 Dec 2:2024.12.01.24318253. doi: 10.1101/2024.12.01.24318253.

PhenoTagger: a hybrid method for phenotype concept recognition using human phenotype ontology.PhenoTagger：一种使用人类表型本体进行表型概念识别的混合方法。

Bioinformatics. 2021 Jul 27;37(13):1884-1890. doi: 10.1093/bioinformatics/btab019.

HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology.HPO2Vec+：利用异构知识资源丰富人类表型本体的节点嵌入。

J Biomed Inform. 2019 Aug;96:103246. doi: 10.1016/j.jbi.2019.103246. Epub 2019 Jun 27.

Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT.使用大语言模型增强临床记录中的表型识别：PhenoBCBERT和PhenoGPT

ArXiv. 2023 Nov 9:arXiv:2308.06294v2.

Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT.使用大语言模型增强临床笔记中的表型识别：PhenoBCBERT和PhenoGPT。

Patterns (N Y). 2023 Dec 5;5(1):100887. doi: 10.1016/j.patter.2023.100887. eCollection 2024 Jan 12.

Examining HPO by organ and system to facilitate practical use by clinicians.按器官和系统检查人类表型组学以方便临床医生实际应用。

Genomics Inform. 2024 Nov 12;22(1):23. doi: 10.1186/s44342-024-00024-1.

An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study.基于本体增强大语言模型的罕见病知识图谱构建自动端到端系统：开发研究

JMIR Med Inform. 2024 Dec 18;12:e60665. doi: 10.2196/60665.

Optimizing biomedical information retrieval with a keyword frequency-driven prompt enhancement strategy.基于关键词频率驱动的提示增强策略优化生物医学信息检索

BMC Bioinformatics. 2024 Aug 27;25(1):281. doi: 10.1186/s12859-024-05902-7.

A comparison of word embeddings for the biomedical natural language processing.生物医学自然语言处理中词嵌入的比较。

J Biomed Inform. 2018 Nov;87:12-20. doi: 10.1016/j.jbi.2018.09.008. Epub 2018 Sep 12.

本文引用的文献

ChatGPT for phenotypes extraction: one model to rule them all?用于表型提取的ChatGPT：一个能统管一切的模型？

Annu Int Conf IEEE Eng Med Biol Soc. 2023 Jul;2023:1-4. doi: 10.1109/EMBC40787.2023.10340611.

PhenoBERT: A Combined Deep Learning Method for Automated Recognition of Human Phenotype Ontology.PhenoBERT：一种用于自动识别人类表型本体的深度学习组合方法。

IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):1269-1277. doi: 10.1109/TCBB.2022.3170301. Epub 2023 Apr 3.

Best practices for the interpretation and reporting of clinical whole genome sequencing.临床全基因组测序解读与报告的最佳实践

NPJ Genom Med. 2022 Apr 8;7(1):27. doi: 10.1038/s41525-022-00295-z.

Exome and genome sequencing for pediatric patients with congenital anomalies or intellectual disability: an evidence-based clinical guideline of the American College of Medical Genetics and Genomics (ACMG).外显子组和基因组测序用于患有先天畸形或智力障碍的儿科患者：美国医学遗传学与基因组学学会（ACMG）的循证临床指南。

Genet Med. 2021 Nov;23(11):2029-2037. doi: 10.1038/s41436-021-01242-6. Epub 2021 Jul 1.

PhenoTagger: a hybrid method for phenotype concept recognition using human phenotype ontology.PhenoTagger：一种使用人类表型本体进行表型概念识别的混合方法。

Bioinformatics. 2021 Jul 27;37(13):1884-1890. doi: 10.1093/bioinformatics/btab019.

Doc2Hpo: a web application for efficient and accurate HPO concept curation.Doc2Hpo：一个用于高效准确的 HPO 概念编纂的网络应用程序。

Nucleic Acids Res. 2019 Jul 2;47(W1):W566-W570. doi: 10.1093/nar/gkz386.

ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis.ClinPhen 直接从病历中提取和优先考虑患者表型，以加速遗传疾病诊断。

Genet Med. 2019 Jul;21(7):1585-1593. doi: 10.1038/s41436-018-0381-1. Epub 2018 Dec 5.

The Human Phenotype Ontology in 2017.2017年的人类表型本体论。

Nucleic Acids Res. 2017 Jan 4;45(D1):D865-D876. doi: 10.1093/nar/gkw1039. Epub 2016 Nov 28.

Advances in understanding - genetic basis of intellectual disability.智力残疾遗传基础的理解进展。

F1000Res. 2016 Apr 7;5. doi: 10.12688/f1000research.7134.1. eCollection 2016.

Comparison of concept recognizers for building the Open Biomedical Annotator.比较概念识别器在构建开放生物医学标注器中的应用。

BMC Bioinformatics. 2009 Sep 17;10 Suppl 9(Suppl 9):S14. doi: 10.1186/1471-2105-10-S9-S14.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过合成病例报告和基于嵌入的检索增强人类表型本体术语提取：一种改进生物医学数据注释的新方法。

Enhancing human phenotype ontology term extraction through synthetic case reports and embedding-based retrieval: A novel approach for improved biomedical data annotation.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献