National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA.
Bioinformatics. 2018 Jan 1;34(1):80-87. doi: 10.1093/bioinformatics/btx541.
Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data.
We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ∼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research.
The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/.
尽管专家精心整理付出了巨大努力,但大多数 1.54 亿个 dbSNP 参考变异(RS)的临床相关性仍然未知。然而,大量关于变异生物功能/疾病影响的知识都隐藏在非结构化文献数据中。以前的研究试图使用文本挖掘技术来获取和利用这些信息,但由于它们的突变提取结果没有标准化或与已整理数据集成,因此用处有限。
我们提出了一种自动方法来提取和规范化变异提及到唯一标识符(dbSNP RSIDs)。在基准测试结果中,我们的方法表现出约 90%的高 F 度量,并优于最新技术。接下来,我们将我们的方法应用于整个 PubMed,并通过验证每个提取的变异-基因对与基于映射基因组位置的 dbSNP 注释匹配,以及分析 ClinVar 中整理的变异来验证结果。然后,我们确定了哪些经过文本挖掘的变异和基因构成了新的发现。我们的分析揭示了 41889 个 RS 编号(与 9151 个基因相关)未在 ClinVar 中找到。此外,我们获得了一组值得进一步审查的丰富数据集:3849 个基因中的 12462 个罕见变异(MAF≤0.01),据推测这些变异是有害的,并且在普通人群中不常见。据我们所知,这是第一个大规模研究,分析和整合现有的数据库中已整理的知识库与文本挖掘的变异数据。我们的结果表明,数据库可以通过文本挖掘得到极大的丰富,并且组合信息可以极大地帮助人类在基因组研究中评估/优先考虑变异。
tmVar 2.0 的源代码和语料库可在 https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/ 上免费获得。