BGI Research, Shenzhen, 518083, China.
Clinical Laboratory of BGI Health, BGI-Shenzhen, Shenzhen, 518083, China.
Interdiscip Sci. 2024 Jun;16(2):333-344. doi: 10.1007/s12539-024-00605-2. Epub 2024 Feb 10.
We report a combined manual annotation and deep-learning natural language processing study to make accurate entity extraction in hereditary disease related biomedical literature. A total of 400 full articles were manually annotated based on published guidelines by experienced genetic interpreters at Beijing Genomics Institute (BGI). The performance of our manual annotations was assessed by comparing our re-annotated results with those publicly available. The overall Jaccard index was calculated to be 0.866 for the four entity types-gene, variant, disease and species. Both a BERT-based large name entity recognition (NER) model and a DistilBERT-based simplified NER model were trained, validated and tested, respectively. Due to the limited manually annotated corpus, Such NER models were fine-tuned with two phases. The F1-scores of BERT-based NER for gene, variant, disease and species are 97.28%, 93.52%, 92.54% and 95.76%, respectively, while those of DistilBERT-based NER are 95.14%, 86.26%, 91.37% and 89.92%, respectively. Most importantly, the entity type of variant has been extracted by a large language model for the first time and a comparable F1-score with the state-of-the-art variant extraction model tmVar has been achieved.
我们报告了一项结合手动注释和深度学习自然语言处理的研究,旨在对遗传性疾病相关生物医学文献进行准确的实体提取。根据北京基因组研究所(BGI)经验丰富的遗传解释器发布的指南,共手动注释了 400 篇完整的文章。我们通过与公开可用的结果进行重新注释的结果来评估我们的手动注释的性能。四种实体类型(基因、变体、疾病和物种)的整体 Jaccard 指数计算为 0.866。基于 BERT 的大型命名实体识别(NER)模型和基于 DistilBERT 的简化 NER 模型分别进行了训练、验证和测试。由于手动注释语料库有限,因此对这些 NER 模型进行了两个阶段的微调。基于 BERT 的 NER 模型在基因、变体、疾病和物种方面的 F1 分数分别为 97.28%、93.52%、92.54%和 95.76%,而基于 DistilBERT 的 NER 模型的 F1 分数分别为 95.14%、86.26%、91.37%和 89.92%。最重要的是,变体的实体类型首次由大型语言模型提取,并达到了与最先进的变体提取模型 tmVar 相当的 F1 分数。