tmVar 3.0：一种改进的变异概念识别和标准化工具。

tmVar 3.0: an improved variant concept recognition and normalization tool.

机构信息

National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.

Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA.

出版信息

Bioinformatics. 2022 Sep 15;38(18):4449-4451. doi: 10.1093/bioinformatics/btac537.

DOI:10.1093/bioinformatics/btac537

PMID:35904569

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9477515/

Abstract

MOTIVATION

Previous studies have shown that automated text-mining tools are becoming increasingly important for successfully unlocking variant information in scientific literature at large scale. Despite multiple attempts in the past, existing tools are still of limited recognition scope and precision.

RESULT

We propose tmVar 3.0: an improved variant recognition and normalization system. Compared to its predecessors, tmVar 3.0 recognizes a wider spectrum of variant-related entities (e.g. allele and copy number variants), and groups together different variant mentions belonging to the same genomic sequence position in an article for improved accuracy. Moreover, tmVar 3.0 provides advanced variant normalization options such as allele-specific identifiers from the ClinGen Allele Registry. tmVar 3.0 exhibits state-of-the-art performance with over 90% in F-measure for variant recognition and normalization, when evaluated on three independent benchmarking datasets. tmVar 3.0 as well as annotations for the entire PubMed and PMC datasets are freely available for download.

AVAILABILITY AND IMPLEMENTATION

https://github.com/ncbi/tmVar3.

摘要

动机

先前的研究表明，自动化文本挖掘工具对于在大规模的科学文献中成功解锁变体信息变得越来越重要。尽管过去有多次尝试，但现有工具的识别范围和精度仍然有限。

结果

我们提出了 tmVar 3.0：一种改进的变体识别和标准化系统。与之前的版本相比，tmVar 3.0 可以识别更广泛的变体相关实体（例如等位基因和拷贝数变体），并将同一文章中属于同一基因组位置的不同变体提及分组在一起，以提高准确性。此外，tmVar 3.0 提供了高级的变体标准化选项，例如来自 ClinGen 等位基因注册中心的等位基因特异性标识符。当在三个独立的基准测试数据集上进行评估时，tmVar 3.0 在变体识别和标准化方面的 F 度量超过 90%，表现出最先进的性能。tmVar 3.0 以及整个 PubMed 和 PMC 数据集的注释均可免费下载。

可用性和实现

https://github.com/ncbi/tmVar3.

相似文献

tmVar 3.0: an improved variant concept recognition and normalization tool.tmVar 3.0：一种改进的变异概念识别和标准化工具。

Bioinformatics. 2022 Sep 15;38(18):4449-4451. doi: 10.1093/bioinformatics/btac537.

tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine.tmVar 2.0：整合文献中的基因组变异信息与 dbSNP 和 ClinVar，以用于精准医学。

Bioinformatics. 2018 Jan 1;34(1):80-87. doi: 10.1093/bioinformatics/btx541.

tmVar: a text mining approach for extracting sequence variants in biomedical literature.tmVar：一种从生物医学文献中提取序列变异的文本挖掘方法。

Bioinformatics. 2013 Jun 1;29(11):1433-9. doi: 10.1093/bioinformatics/btt156. Epub 2013 Apr 5.

Beyond accuracy: creating interoperable and scalable text-mining web services.超越准确性：创建可互操作且可扩展的文本挖掘网络服务。

Bioinformatics. 2016 Jun 15;32(12):1907-10. doi: 10.1093/bioinformatics/btv760. Epub 2016 Feb 16.

NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库：一种用于疾病名称识别和概念规范化的资源。

J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.

GNorm2: an improved gene name recognition and normalization system.GNorm2：一种改进的基因名称识别和标准化系统。

Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad599.

nala: text mining natural language mutation mentions.纳拉：文本挖掘自然语言中的突变提及。

Bioinformatics. 2017 Jun 15;33(12):1852-1858. doi: 10.1093/bioinformatics/btx083.

BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale.生物概念向量：在大规模上创建和评估基于文献的生物医学概念嵌入。

PLoS Comput Biol. 2020 Apr 23;16(4):e1007617. doi: 10.1371/journal.pcbi.1007617. eCollection 2020 Apr.

Large-scale event extraction from literature with multi-level gene normalization.从文献中进行多层次基因标准化的大规模事件提取。

PLoS One. 2013 Apr 17;8(4):e55814. doi: 10.1371/journal.pone.0055814. Print 2013.

Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature.自动化方法在从生物医学文献中搜索和提取基因组变异信息方面的最新进展。

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa142.

引用本文的文献

Darling (v2.0): Mining disease-related databases for the detection of biomedical entity associations.达林（v2.0）：挖掘疾病相关数据库以检测生物医学实体关联。

Comput Struct Biotechnol J. 2025 Jun 14;27:2626-2637. doi: 10.1016/j.csbj.2025.06.025. eCollection 2025.

HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools.HunFlair2 在生物医学命名实体识别和标准化工具的跨语料库评估中的应用。

Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae564.

Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach.深度学习架构在增强生物医学关系抽取中的应用：一种流水线方法。

Database (Oxford). 2024 Aug 28;2024. doi: 10.1093/database/baae079.

The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop.生物创意 VIII 挑战赛和研讨会的 BioRED 专题生物医学关系语料库。

Database (Oxford). 2024 Aug 9;2024. doi: 10.1093/database/baae071.

Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes.基于多头条件随机场分类器的西班牙语临床文档中生物医学多类命名实体识别。

Database (Oxford). 2024 Jul 30;2024. doi: 10.1093/database/baae068.

Towards discovery: an end-to-end system for uncovering novel biomedical relations.探索之路：一个端到端的系统，用于揭示新的生物医学关系。

Database (Oxford). 2024 Jul 11;2024. doi: 10.1093/database/baae057.

PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge.PubTator 3.0：一款人工智能驱动的文献资源，用于解锁生物医学知识。

Nucleic Acids Res. 2024 Jul 5;52(W1):W540-W546. doi: 10.1093/nar/gkae235.

Cancer-Alterome: a literature-mined resource for regulatory events caused by genetic alterations in cancer.癌症变异组：一个通过文献挖掘得到的、关于癌症中基因改变所引发调控事件的资源。

Sci Data. 2024 Mar 2;11(1):265. doi: 10.1038/s41597-024-03083-9.

Comparison of literature mining tools for variant classification: Through the lens of 50 RYR1 variants.比较用于变异分类的文献挖掘工具：以 50 个 RYR1 变异体为例。

Genet Med. 2024 Apr;26(4):101083. doi: 10.1016/j.gim.2024.101083. Epub 2024 Jan 26.

BELB: a biomedical entity linking benchmark.BELB：一个生物医学实体链接基准。

Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad698.

本文引用的文献

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa142.

LitGen: Genetic Literature Recommendation Guided by Human Explanations.LitGen：基于人工解释的遗传文献推荐。

Pac Symp Biocomput. 2020;25:67-78.

ClinGen Allele Registry links information about genetic variants.ClinGen 变异基因登记库链接有关遗传变异的信息。

Hum Mutat. 2018 Nov;39(11):1690-1701. doi: 10.1002/humu.23637.

LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC.LitVar：一个语义搜索引擎，用于在 PubMed 和 PMC 中链接基因组变异数据。

Nucleic Acids Res. 2018 Jul 2;46(W1):W530-W536. doi: 10.1093/nar/gky355.

Bioinformatics. 2018 Jan 1;34(1):80-87. doi: 10.1093/bioinformatics/btx541.

nala: text mining natural language mutation mentions.纳拉：文本挖掘自然语言中的突变提及。

Bioinformatics. 2017 Jun 15;33(12):1852-1858. doi: 10.1093/bioinformatics/btx083.

SETH detects and normalizes genetic variants in text.SETH可检测并规范文本中的基因变异。

Bioinformatics. 2016 Sep 15;32(18):2883-5. doi: 10.1093/bioinformatics/btw234. Epub 2016 Jun 2.

Beyond accuracy: creating interoperable and scalable text-mining web services.超越准确性：创建可互操作且可扩展的文本挖掘网络服务。

Bioinformatics. 2016 Jun 15;32(12):1907-10. doi: 10.1093/bioinformatics/btv760. Epub 2016 Feb 16.

GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains.GNormPlus：一种用于标记基因、基因家族和蛋白质结构域的综合方法。

Biomed Res Int. 2015;2015:918710. doi: 10.1155/2015/918710. Epub 2015 Aug 25.

tmVar: a text mining approach for extracting sequence variants in biomedical literature.tmVar：一种从生物医学文献中提取序列变异的文本挖掘方法。

Bioinformatics. 2013 Jun 1;29(11):1433-9. doi: 10.1093/bioinformatics/btt156. Epub 2013 Apr 5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验