Suppr超能文献

使用2k + 2次冒泡搜索在k-mer图中查找单核苷酸多态性。

Using 2k + 2 bubble searches to find single nucleotide polymorphisms in k-mer graphs.

作者信息

Younsi Reda, MacLean Dan

机构信息

The Sainsbury Laboratory, Norwich Research Park, Norwich NR4 7UH, UK.

出版信息

Bioinformatics. 2015 Mar 1;31(5):642-6. doi: 10.1093/bioinformatics/btu706. Epub 2014 Oct 24.

Abstract

MOTIVATION

Single nucleotide polymorphism (SNP) discovery is an important preliminary for understanding genetic variation. With current sequencing methods, we can sample genomes comprehensively. SNPs are found by aligning sequence reads against longer assembled references. De Bruijn graphs are efficient data structures that can deal with the vast amount of data from modern technologies. Recent work has shown that the topology of these graphs captures enough information to allow the detection and characterization of genetic variants, offering an alternative to alignment-based methods. Such methods rely on depth-first walks of the graph to identify closing bifurcations. These methods are conservative or generate many false-positive results, particularly when traversing highly inter-connected (complex) regions of the graph or in regions of very high coverage.

RESULTS

We devised an algorithm that calls SNPs in converted De Bruijn graphs by enumerating 2k + 2 cycles. We evaluated the accuracy of predicted SNPs by comparison with SNP lists from alignment-based methods. We tested accuracy of the SNP calling using sequence data from 16 ecotypes of Arabidopsis thaliana and found that accuracy was high. We found that SNP calling was even across the genome and genomic feature types. Using sequence-based attributes of the graph to train a decision tree allowed us to increase accuracy of SNP calls further. Together these results indicate that our algorithm is capable of finding SNPs accurately in complex sub-graphs and potentially comprehensively from whole genome graphs.

AVAILABILITY AND IMPLEMENTATION

The source code for a C++ implementation of our algorithm is available under the GNU Public Licence v3 at: https://github.com/danmaclean/2kplus2. The datasets used in this study are available at the European Nucleotide Archive, reference ERP00565, http://www.ebi.ac.uk/ena/data/view/ERP000565.

摘要

动机

单核苷酸多态性(SNP)发现是理解遗传变异的重要前期工作。利用当前的测序方法,我们能够全面地对基因组进行采样。通过将序列读数与更长的组装参考序列进行比对来发现SNP。德布鲁因图是一种高效的数据结构,能够处理来自现代技术的大量数据。最近的研究表明,这些图的拓扑结构捕获了足够的信息,以允许检测和表征遗传变异,为基于比对的方法提供了一种替代方案。此类方法依靠图上的深度优先遍历以识别闭合分支。这些方法较为保守或会产生许多假阳性结果,尤其是在遍历图的高度互联(复杂)区域或高覆盖区域时。

结果

我们设计了一种算法,通过枚举2k + 2个循环在转换后的德布鲁因图中调用SNP。我们通过与基于比对方法的SNP列表进行比较来评估预测SNP的准确性。我们使用来自拟南芥16种生态型的序列数据测试了SNP调用的准确性,发现准确性很高。我们发现SNP调用在整个基因组和基因组特征类型中分布均匀。利用图的基于序列的属性训练决策树使我们能够进一步提高SNP调用的准确性。这些结果共同表明,我们的算法能够在复杂子图中准确地找到SNP,并有可能从全基因组图中全面地找到SNP。

可用性和实现

我们算法的C++实现的源代码可在GNU公共许可证v3下获取,网址为:https://github.com/danmaclean/2kplus2本研究中使用的数据集可在欧洲核苷酸档案库获取,参考编号ERP00565,网址为:http://www.ebi.ac.uk/ena/data/view/ERP000565

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b949/4341063/0c95c325884e/btu706f1p.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验