Suppr超能文献

基于 k- -mer 频率的单倍型重构进行无图谱变异调用。

Mapping-free variant calling using haplotype reconstruction from k-mer frequencies.

机构信息

School of Biology, Georgia Institute of Technology, Atlanta, GA 30332, USA.

出版信息

Bioinformatics. 2018 May 15;34(10):1659-1665. doi: 10.1093/bioinformatics/btx753.

Abstract

MOTIVATION

The standard protocol for detecting variation in DNA is to map millions of short sequence reads to a known reference and find loci that differ. While this approach works well, it cannot be applied where the sample contains dense variants or is too distant from known references. De novo assembly or hybrid methods can recover genomic variation, but the cost of computation is often much higher. We developed a novel k-mer algorithm and software implementation, Kestrel, capable of characterizing densely packed SNPs and large indels without mapping, assembly or de Bruijn graphs.

RESULTS

When applied to mosaic penicillin binding protein (PBP) genes in Streptococcus pneumoniae, we found near perfect concordance with assembled contigs at a fraction of the CPU time. Multilocus sequence typing (MLST) with this approach was able to bypass de novo assemblies. Kestrel has a very low false-positive rate when applied to the whole genome, and while Kestrel identified many variants missed by other methods, limitations of a purely k-mer based approach affect overall sensitivity.

AVAILABILITY AND IMPLEMENTATION

Source code and documentation for a Java implementation of Kestrel can be found at https://github.com/paudano/kestrel. All test code for this publication is located at https://github.com/paudano/kescases.

CONTACT

paudano@gatech.edu or fredrik.vannberg@biology.gatech.edu.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

检测 DNA 变异的标准方法是将数百万个短序列读取映射到已知的参考序列,并找到不同的基因座。虽然这种方法效果很好,但在样本中包含密集变体或与已知参考序列相差太远的情况下,它就无法使用。从头组装或混合方法可以恢复基因组变异,但计算成本通常要高得多。我们开发了一种新的 k-mer 算法和软件实现,名为 Kestrel,可以在无需映射、组装或 de Bruijn 图的情况下,对密集排列的 SNPs 和大片段插入缺失进行特征描述。

结果

当将其应用于肺炎链球菌中镶嵌青霉素结合蛋白 (PBP) 基因时,我们发现与组装的连续基因片段几乎完全一致,而 CPU 时间仅为其一小部分。使用这种方法进行多位点序列分型 (MLST) 可以绕过从头组装。当应用于整个基因组时,Kestrel 的假阳性率非常低,尽管 Kestrel 识别出了许多其他方法错过的变体,但纯粹基于 k-mer 的方法的局限性会影响整体敏感性。

可用性和实现

Kestrel 的 Java 实现的源代码和文档可在 https://github.com/paudano/kestrel 上找到。本出版物的所有测试代码都位于 https://github.com/paudano/kescases 上。

联系人

paudano@gatech.edufredrik.vannberg@biology.gatech.edu

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d729/5946877/4c882f81467d/btx753f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验