National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.
Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA.
Bioinformatics. 2022 Sep 15;38(18):4449-4451. doi: 10.1093/bioinformatics/btac537.
Previous studies have shown that automated text-mining tools are becoming increasingly important for successfully unlocking variant information in scientific literature at large scale. Despite multiple attempts in the past, existing tools are still of limited recognition scope and precision.
We propose tmVar 3.0: an improved variant recognition and normalization system. Compared to its predecessors, tmVar 3.0 recognizes a wider spectrum of variant-related entities (e.g. allele and copy number variants), and groups together different variant mentions belonging to the same genomic sequence position in an article for improved accuracy. Moreover, tmVar 3.0 provides advanced variant normalization options such as allele-specific identifiers from the ClinGen Allele Registry. tmVar 3.0 exhibits state-of-the-art performance with over 90% in F-measure for variant recognition and normalization, when evaluated on three independent benchmarking datasets. tmVar 3.0 as well as annotations for the entire PubMed and PMC datasets are freely available for download.
先前的研究表明,自动化文本挖掘工具对于在大规模的科学文献中成功解锁变体信息变得越来越重要。尽管过去有多次尝试,但现有工具的识别范围和精度仍然有限。
我们提出了 tmVar 3.0:一种改进的变体识别和标准化系统。与之前的版本相比,tmVar 3.0 可以识别更广泛的变体相关实体(例如等位基因和拷贝数变体),并将同一文章中属于同一基因组位置的不同变体提及分组在一起,以提高准确性。此外,tmVar 3.0 提供了高级的变体标准化选项,例如来自 ClinGen 等位基因注册中心的等位基因特异性标识符。当在三个独立的基准测试数据集上进行评估时,tmVar 3.0 在变体识别和标准化方面的 F 度量超过 90%,表现出最先进的性能。tmVar 3.0 以及整个 PubMed 和 PMC 数据集的注释均可免费下载。