快速且耐受 SNP 的短读长中复杂变体和剪接检测

Fast and SNP-tolerant detection of complex variants and splicing in short reads.

机构信息

Department of Bioinformatics, Genentech, Inc., 1 DNA Way, South San Francisco, CA, USA.

出版信息

Bioinformatics. 2010 Apr 1;26(7):873-81. doi: 10.1093/bioinformatics/btq057. Epub 2010 Feb 10.

Abstract

MOTIVATION

Next-generation sequencing captures sequence differences in reads relative to a reference genome or transcriptome, including splicing events and complex variants involving multiple mismatches and long indels. We present computational methods for fast detection of complex variants and splicing in short reads, based on a successively constrained search process of merging and filtering position lists from a genomic index. Our methods are implemented in GSNAP (Genomic Short-read Nucleotide Alignment Program), which can align both single- and paired-end reads as short as 14 nt and of arbitrarily long length. It can detect short- and long-distance splicing, including interchromosomal splicing, in individual reads, using probabilistic models or a database of known splice sites. Our program also permits SNP-tolerant alignment to a reference space of all possible combinations of major and minor alleles, and can align reads from bisulfite-treated DNA for the study of methylation state.

RESULTS

In comparison testing, GSNAP has speeds comparable to existing programs, especially in reads of > or=70 nt and is fastest in detecting complex variants with four or more mismatches or insertions of 1-9 nt and deletions of 1-30 nt. Although SNP tolerance does not increase alignment yield substantially, it affects alignment results in 7-8% of transcriptional reads, typically by revealing alternate genomic mappings for a read. Simulations of bisulfite-converted DNA show a decrease in identifying genomic positions uniquely in 6% of 36 nt reads and 3% of 70 nt reads.

AVAILABILITY

Source code in C and utility programs in Perl are freely available for download as part of the GMAP package at http://share.gene.com/gmap.

摘要

动机

下一代测序技术可捕获相对于参考基因组或转录组的读取序列差异,包括剪接事件和涉及多个错配和长插入缺失的复杂变体。我们提出了一种基于从基因组索引中合并和过滤位置列表的连续约束搜索过程的快速检测短读中复杂变体和剪接的计算方法。我们的方法在 GSNAP(基因组短读核苷酸对齐程序)中实现,它可以对齐短至 14 个核苷酸且长度任意的单端和双端读取。它可以使用概率模型或已知剪接位点数据库在单个读取中检测短距离和长距离剪接,包括染色体间剪接。我们的程序还允许对主要和次要等位基因的所有可能组合的参考空间进行 SNP 容忍对齐,并可以对齐经亚硫酸氢盐处理的 DNA 的读取,以研究甲基化状态。

结果

在比较测试中,GSNAP 的速度与现有程序相当,尤其是在 > = 70 个核苷酸的读取中,并且在检测具有四个或更多错配或 1-9 个核苷酸插入和 1-30 个核苷酸缺失的复杂变体时速度最快。尽管 SNP 容忍度不会大大增加对齐产量,但它会影响 7-8%的转录读取的对齐结果,通常通过为读取揭示替代的基因组映射来实现。亚硫酸氢盐转化 DNA 的模拟显示,在 36 个核苷酸读取的 6%和 70 个核苷酸读取的 3%中,识别唯一基因组位置的能力下降。

可用性

C 语言源代码和 Perl 实用程序作为 GMAP 包的一部分免费提供下载,网址为 http://share.gene.com/gmap。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f2b/2844994/9bf2b9b394dd/btq057f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索