Department of Microbiology, University of Massachusetts, Amherst, Massachusetts, United States of America.
PLoS One. 2009 Dec 14;4(12):e8230. doi: 10.1371/journal.pone.0008230.
As the scope of microbial surveys expands with the parallel growth in sequencing capacity, a significant bottleneck in data analysis is the ability to generate a biologically meaningful multiple sequence alignment. The most commonly used aligners have varying alignment quality and speed, tend to depend on a specific reference alignment, or lack a complete description of the underlying algorithm. The purpose of this study was to create and validate an aligner with the goal of quickly generating a high quality alignment and having the flexibility to use any reference alignment. Using the simple nearest alignment space termination algorithm, the resulting aligner operates in linear time, requires a small memory footprint, and generates a high quality alignment. In addition, the alignments generated for variable regions were of as high a quality as the alignment of full-length sequences. As implemented, the method was able to align 18 full-length 16S rRNA gene sequences and 58 V2 region sequences per second to the 50,000-column SILVA reference alignment. Most importantly, the resulting alignments were of a quality equal to SILVA-generated alignments. The aligner described in this study will enable scientists to rapidly generate robust multiple sequences alignments that are implicitly based upon the predicted secondary structure of the 16S rRNA molecule. Furthermore, because the implementation is not connected to a specific database it is easy to generalize the method to reference alignments for any DNA sequence.
随着微生物调查范围的扩大和测序能力的同步增长,数据分析的一个显著瓶颈是生成具有生物学意义的多重序列比对的能力。最常用的比对器在比对质量和速度上存在差异,往往依赖于特定的参考比对,或者缺乏对底层算法的完整描述。本研究的目的是创建和验证一种比对器,其目标是快速生成高质量的比对,并具有使用任何参考比对的灵活性。使用简单的最近邻比对空间终止算法,所得到的比对器在时间上呈线性运行,需要的内存空间小,并生成高质量的比对。此外,对于可变区生成的比对与全长序列的比对一样高质量。在实现中,该方法能够每秒将 18 个全长 16S rRNA 基因序列和 58 个 V2 区序列对齐到 50000 列 SILVA 参考对齐。最重要的是,生成的比对与 SILVA 生成的比对质量相当。本研究中描述的比对器将使科学家能够快速生成稳健的多重序列比对,这些比对隐式地基于 16S rRNA 分子的预测二级结构。此外,由于实现与特定数据库没有连接,因此很容易将该方法推广到任何 DNA 序列的参考比对。