Suppr超能文献

DUVEL:一个用于识别寡基因组合的主动学习标注生物医学语料库。

DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations.

机构信息

Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Boulevard du Triomphe, CP 263, Brussels 1050, Belgium.

Machine Learning Group, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, Brussels 1050, Belgium.

出版信息

Database (Oxford). 2024 May 28;2024. doi: 10.1093/database/baae039.

Abstract

While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene-variant-gene-variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene-variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571.

摘要

虽然生物医学关系抽取(bioRE)数据集在开发支持从文本中提取单变体的生物注释方法方面发挥了重要作用,但目前尚无用于提取双基因甚至寡基因变体关系的数据集,尽管文献中有报道称不同基因座(或基因)中变体组合的上位效应对于理解疾病病因很重要。这项工作创建了一个独特的寡基因变体组合数据集,旨在培训工具来帮助生物医学文献的注释。为了克服未标记实例数量和专业知识成本带来的障碍,使用主动学习(AL)来优化注释,从而在找到最具信息量的样本子集进行标记方面获得帮助。通过使用 PubTator 对来自寡基因疾病数据库(OLIDA)的 85 篇包含相关关系的全文文章进行预注释,提取出包含潜在双基因变体组合的文本片段,即基因-变体-基因-变体。使用基于 AL 的注释平台 ALAMBIC 对提取的文本片段进行注释。由此产生的数据集称为 DUVEL,用于微调四种最先进的生物医学语言模型:BiomedBERT、BiomedBERT-large、BioLinkBERT 和 BioM-BERT。考虑了超过 500,000 个文本片段进行注释,最终得到一个包含 8442 个片段的数据集,其中 794 个是正实例,涵盖了原始注释文章的 95%。在应用于基因-变体对检测时,BiomedBERT-large 在微调后实现了最高的 F1 分数(0.84),与非微调模型相比,这表明了显著的改进,强调了 DUVEL 数据集的相关性。这项研究表明,主动学习在创建与生物医学注释应用相关的生物医学关系抽取数据集方面可能发挥重要作用。DUVEL 提供了一个独特的生物医学语料库,专注于两个基因和两个变体之间的 4-ary 关系。它在 GitHub 和 Hugging Face 上免费提供,供研究使用。数据库网址:https://huggingface.co/datasets/cnachteg/duvelhttps://doi.org/10.57967/hf/1571。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1678/11131422/d7bdda831c94/baae039f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验