tmVar 2.0：整合文献中的基因组变异信息与 dbSNP 和 ClinVar，以用于精准医学。

tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine.

机构信息

National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA.

出版信息

Bioinformatics. 2018 Jan 1;34(1):80-87. doi: 10.1093/bioinformatics/btx541.

DOI:10.1093/bioinformatics/btx541

PMID:28968638

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5860583/

Abstract

MOTIVATION

Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data.

RESULTS

We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ∼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research.

AVAILABILITY AND IMPLEMENTATION

The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/.

CONTACT

zhiyong.lu@nih.gov.

摘要

动机

尽管专家精心整理付出了巨大努力，但大多数 1.54 亿个 dbSNP 参考变异（RS）的临床相关性仍然未知。然而，大量关于变异生物功能/疾病影响的知识都隐藏在非结构化文献数据中。以前的研究试图使用文本挖掘技术来获取和利用这些信息，但由于它们的突变提取结果没有标准化或与已整理数据集成，因此用处有限。

结果

我们提出了一种自动方法来提取和规范化变异提及到唯一标识符（dbSNP RSIDs）。在基准测试结果中，我们的方法表现出约 90%的高 F 度量，并优于最新技术。接下来，我们将我们的方法应用于整个 PubMed，并通过验证每个提取的变异-基因对与基于映射基因组位置的 dbSNP 注释匹配，以及分析 ClinVar 中整理的变异来验证结果。然后，我们确定了哪些经过文本挖掘的变异和基因构成了新的发现。我们的分析揭示了 41889 个 RS 编号（与 9151 个基因相关）未在 ClinVar 中找到。此外，我们获得了一组值得进一步审查的丰富数据集：3849 个基因中的 12462 个罕见变异（MAF≤0.01），据推测这些变异是有害的，并且在普通人群中不常见。据我们所知，这是第一个大规模研究，分析和整合现有的数据库中已整理的知识库与文本挖掘的变异数据。我们的结果表明，数据库可以通过文本挖掘得到极大的丰富，并且组合信息可以极大地帮助人类在基因组研究中评估/优先考虑变异。

可用性和实现

tmVar 2.0 的源代码和语料库可在 https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/ 上免费获得。

联系方式

zhiyong.lu@nih.gov。

相似文献

tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine.tmVar 2.0：整合文献中的基因组变异信息与 dbSNP 和 ClinVar，以用于精准医学。

Bioinformatics. 2018 Jan 1;34(1):80-87. doi: 10.1093/bioinformatics/btx541.

tmVar 3.0: an improved variant concept recognition and normalization tool.tmVar 3.0：一种改进的变异概念识别和标准化工具。

Bioinformatics. 2022 Sep 15;38(18):4449-4451. doi: 10.1093/bioinformatics/btac537.

tmVar: a text mining approach for extracting sequence variants in biomedical literature.tmVar：一种从生物医学文献中提取序列变异的文本挖掘方法。

Bioinformatics. 2013 Jun 1;29(11):1433-9. doi: 10.1093/bioinformatics/btt156. Epub 2013 Apr 5.

LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC.LitVar：一个语义搜索引擎，用于在 PubMed 和 PMC 中链接基因组变异数据。

Nucleic Acids Res. 2018 Jul 2;46(W1):W530-W536. doi: 10.1093/nar/gky355.

Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine.从生物医学文献中挖掘基因型-表型关系以用于数据库管理和精准医学。

PLoS Comput Biol. 2016 Nov 30;12(11):e1005017. doi: 10.1371/journal.pcbi.1005017. eCollection 2016 Nov.

Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine.BioCreative VI 精准医学赛道概述：精准医学中的蛋白质相互作用和突变挖掘。

Database (Oxford). 2019 Jan 1;2019:bay147. doi: 10.1093/database/bay147.

Beyond accuracy: creating interoperable and scalable text-mining web services.超越准确性：创建可互操作且可扩展的文本挖掘网络服务。

Bioinformatics. 2016 Jun 15;32(12):1907-10. doi: 10.1093/bioinformatics/btv760. Epub 2016 Feb 16.

Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature.文本挖掘有助于数据库管理——从生物医学文献中提取突变与疾病的关联。

BMC Bioinformatics. 2015 Jun 6;16:185. doi: 10.1186/s12859-015-0609-x.

PGxMine: Text mining for curation of PharmGKB.PGxMine：用于 PharmGKB 策管的文本挖掘。

Pac Symp Biocomput. 2020;25:611-622.

Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature.精准医学的文本挖掘：从生物医学文献中自动提取疾病-突变关系

J Am Med Inform Assoc. 2016 Jul;23(4):766-72. doi: 10.1093/jamia/ocw041. Epub 2016 Apr 27.

引用本文的文献

GOAnnotator: accurate protein function annotation using automatically retrieved literature.GO注释器：利用自动检索的文献进行准确的蛋白质功能注释。

Bioinformatics. 2025 Jul 1;41(Supplement_1):i410-i419. doi: 10.1093/bioinformatics/btaf199.

PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science.PubMed知识图谱2.0：连接生物医学领域的论文、专利和临床试验

Sci Data. 2025 Jun 17;12(1):1018. doi: 10.1038/s41597-025-05343-8.

GPDminer: a tool for extracting named entities and analyzing relations in biological literature.GPDminer：一种用于从生物文献中提取命名实体和分析关系的工具。

BMC Bioinformatics. 2024 Mar 6;25(1):101. doi: 10.1186/s12859-024-05710-z.

A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature.一种结合手动标注和深度学习自然语言处理的遗传性疾病相关生物医学文献中精确实体抽取方法的研究。

Interdiscip Sci. 2024 Jun;16(2):333-344. doi: 10.1007/s12539-024-00605-2. Epub 2024 Feb 10.

PubMed and beyond: biomedical literature search in the age of artificial intelligence.PubMed 及其以外：人工智能时代的生物医学文献检索。

EBioMedicine. 2024 Feb;100:104988. doi: 10.1016/j.ebiom.2024.104988. Epub 2024 Feb 1.

Named Entity Recognition and Normalization for Alzheimer's Disease Eligibility Criteria.阿尔茨海默病纳入标准的命名实体识别与规范化

Proc (IEEE Int Conf Healthc Inform). 2023 Jun;2023:558-564. doi: 10.1109/ichi57859.2023.00100. Epub 2023 Dec 11.

HALD, a human aging and longevity knowledge graph for precision gerontology and geroscience analyses.HALD，一个人类衰老和长寿知识图谱，用于精准老年医学和老年科学分析。

Sci Data. 2023 Dec 1;10(1):851. doi: 10.1038/s41597-023-02781-0.

OncoCTMiner: streamlining precision oncology trial matching via molecular profile analysis.OncoCTMiner：通过分子谱分析简化精准肿瘤学试验匹配。

Database (Oxford). 2023 Nov 4;2023. doi: 10.1093/database/baad077.

A Novel CRYBB2 Silent Variant in Autosomal Dominant Congenital Cataracts (ADCC) in Pakistani families.巴基斯坦家族中常染色体显性先天性白内障（ADCC）的一种新型CRYBB2沉默变异体

Pak J Med Sci. 2023 Sep-Oct;39(5):1399-1405. doi: 10.12669/pjms.39.5.7061.

Changing word meanings in biomedical literature reveal pandemics and new technologies.生物医学文献中词汇意义的变化揭示了大流行病和新技术。

BioData Min. 2023 May 5;16(1):16. doi: 10.1186/s13040-023-00332-2.

本文引用的文献

CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer.CIViC 是一个社区知识库，用于专家众包对癌症变异的临床解释。

Nat Genet. 2017 Jan 31;49(2):170-174. doi: 10.1038/ng.3774.

DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants.DisGeNET：一个整合人类疾病相关基因和变异信息的综合平台。

Nucleic Acids Res. 2017 Jan 4;45(D1):D833-D839. doi: 10.1093/nar/gkw943. Epub 2016 Oct 19.

PLoS Comput Biol. 2016 Nov 30;12(11):e1005017. doi: 10.1371/journal.pcbi.1005017. eCollection 2016 Nov.

COSMIC: somatic cancer genetics at high-resolution.COSMIC：高分辨率体细胞癌遗传学

Nucleic Acids Res. 2017 Jan 4;45(D1):D777-D783. doi: 10.1093/nar/gkw1121. Epub 2016 Nov 28.

Analysis of protein-coding genetic variation in 60,706 humans.对60706名人类的蛋白质编码基因变异进行分析。

Nature. 2016 Aug 18;536(7616):285-91. doi: 10.1038/nature19057.

SETH detects and normalizes genetic variants in text.SETH可检测并规范文本中的基因变异。

Bioinformatics. 2016 Sep 15;32(18):2883-5. doi: 10.1093/bioinformatics/btw234. Epub 2016 Jun 2.

BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations.BRONCO：用于提取基因-变异-疾病-药物关系的生物医学实体关系肿瘤语料库。

Database (Oxford). 2016 Apr 13;2016. doi: 10.1093/database/baw043. Print 2016.

Beyond accuracy: creating interoperable and scalable text-mining web services.超越准确性：创建可互操作且可扩展的文本挖掘网络服务。

Bioinformatics. 2016 Jun 15;32(12):1907-10. doi: 10.1093/bioinformatics/btv760. Epub 2016 Feb 16.

Biocuration with insufficient resources and fixed timelines.在资源不足且时间线固定的情况下进行生物数据编目。

Database (Oxford). 2015 Dec 26;2015. doi: 10.1093/database/bav116. Print 2015.

Database resources of the National Center for Biotechnology Information.美国国立生物技术信息中心的数据库资源。

Nucleic Acids Res. 2016 Jan 4;44(D1):D7-19. doi: 10.1093/nar/gkv1290. Epub 2015 Nov 28.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验