Suppr超能文献

tmVar 2.0:整合文献中的基因组变异信息与 dbSNP 和 ClinVar,以用于精准医学。

tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine.

机构信息

National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA.

出版信息

Bioinformatics. 2018 Jan 1;34(1):80-87. doi: 10.1093/bioinformatics/btx541.

Abstract

MOTIVATION

Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data.

RESULTS

We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ∼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research.

AVAILABILITY AND IMPLEMENTATION

The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/.

CONTACT

zhiyong.lu@nih.gov.

摘要

动机

尽管专家精心整理付出了巨大努力,但大多数 1.54 亿个 dbSNP 参考变异(RS)的临床相关性仍然未知。然而,大量关于变异生物功能/疾病影响的知识都隐藏在非结构化文献数据中。以前的研究试图使用文本挖掘技术来获取和利用这些信息,但由于它们的突变提取结果没有标准化或与已整理数据集成,因此用处有限。

结果

我们提出了一种自动方法来提取和规范化变异提及到唯一标识符(dbSNP RSIDs)。在基准测试结果中,我们的方法表现出约 90%的高 F 度量,并优于最新技术。接下来,我们将我们的方法应用于整个 PubMed,并通过验证每个提取的变异-基因对与基于映射基因组位置的 dbSNP 注释匹配,以及分析 ClinVar 中整理的变异来验证结果。然后,我们确定了哪些经过文本挖掘的变异和基因构成了新的发现。我们的分析揭示了 41889 个 RS 编号(与 9151 个基因相关)未在 ClinVar 中找到。此外,我们获得了一组值得进一步审查的丰富数据集:3849 个基因中的 12462 个罕见变异(MAF≤0.01),据推测这些变异是有害的,并且在普通人群中不常见。据我们所知,这是第一个大规模研究,分析和整合现有的数据库中已整理的知识库与文本挖掘的变异数据。我们的结果表明,数据库可以通过文本挖掘得到极大的丰富,并且组合信息可以极大地帮助人类在基因组研究中评估/优先考虑变异。

可用性和实现

tmVar 2.0 的源代码和语料库可在 https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/ 上免费获得。

联系方式

zhiyong.lu@nih.gov

相似文献

引用本文的文献

6
Named Entity Recognition and Normalization for Alzheimer's Disease Eligibility Criteria.阿尔茨海默病纳入标准的命名实体识别与规范化
Proc (IEEE Int Conf Healthc Inform). 2023 Jun;2023:558-564. doi: 10.1109/ichi57859.2023.00100. Epub 2023 Dec 11.

本文引用的文献

4
COSMIC: somatic cancer genetics at high-resolution.COSMIC:高分辨率体细胞癌遗传学
Nucleic Acids Res. 2017 Jan 4;45(D1):D777-D783. doi: 10.1093/nar/gkw1121. Epub 2016 Nov 28.
6
SETH detects and normalizes genetic variants in text.SETH可检测并规范文本中的基因变异。
Bioinformatics. 2016 Sep 15;32(18):2883-5. doi: 10.1093/bioinformatics/btw234. Epub 2016 Jun 2.
10
Database resources of the National Center for Biotechnology Information.美国国立生物技术信息中心的数据库资源。
Nucleic Acids Res. 2016 Jan 4;44(D1):D7-19. doi: 10.1093/nar/gkv1290. Epub 2015 Nov 28.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验