Suppr超能文献

一种结合手动标注和深度学习自然语言处理的遗传性疾病相关生物医学文献中精确实体抽取方法的研究。

A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature.

机构信息

BGI Research, Shenzhen, 518083, China.

Clinical Laboratory of BGI Health, BGI-Shenzhen, Shenzhen, 518083, China.

出版信息

Interdiscip Sci. 2024 Jun;16(2):333-344. doi: 10.1007/s12539-024-00605-2. Epub 2024 Feb 10.

Abstract

We report a combined manual annotation and deep-learning natural language processing study to make accurate entity extraction in hereditary disease related biomedical literature. A total of 400 full articles were manually annotated based on published guidelines by experienced genetic interpreters at Beijing Genomics Institute (BGI). The performance of our manual annotations was assessed by comparing our re-annotated results with those publicly available. The overall Jaccard index was calculated to be 0.866 for the four entity types-gene, variant, disease and species. Both a BERT-based large name entity recognition (NER) model and a DistilBERT-based simplified NER model were trained, validated and tested, respectively. Due to the limited manually annotated corpus, Such NER models were fine-tuned with two phases. The F1-scores of BERT-based NER for gene, variant, disease and species are 97.28%, 93.52%, 92.54% and 95.76%, respectively, while those of DistilBERT-based NER are 95.14%, 86.26%, 91.37% and 89.92%, respectively. Most importantly, the entity type of variant has been extracted by a large language model for the first time and a comparable F1-score with the state-of-the-art variant extraction model tmVar has been achieved.

摘要

我们报告了一项结合手动注释和深度学习自然语言处理的研究,旨在对遗传性疾病相关生物医学文献进行准确的实体提取。根据北京基因组研究所(BGI)经验丰富的遗传解释器发布的指南,共手动注释了 400 篇完整的文章。我们通过与公开可用的结果进行重新注释的结果来评估我们的手动注释的性能。四种实体类型(基因、变体、疾病和物种)的整体 Jaccard 指数计算为 0.866。基于 BERT 的大型命名实体识别(NER)模型和基于 DistilBERT 的简化 NER 模型分别进行了训练、验证和测试。由于手动注释语料库有限,因此对这些 NER 模型进行了两个阶段的微调。基于 BERT 的 NER 模型在基因、变体、疾病和物种方面的 F1 分数分别为 97.28%、93.52%、92.54%和 95.76%,而基于 DistilBERT 的 NER 模型的 F1 分数分别为 95.14%、86.26%、91.37%和 89.92%。最重要的是,变体的实体类型首次由大型语言模型提取,并达到了与最先进的变体提取模型 tmVar 相当的 F1 分数。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/52cd/11289304/0c9474a6f4cd/12539_2024_605_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验