Woodcroft Ben J, Boyd Joel A, Tyson Gene W
Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD 4072, Australia.
Bioinformatics. 2016 Sep 1;32(17):2702-3. doi: 10.1093/bioinformatics/btw241. Epub 2016 May 3.
Finding and translating stretches of DNA lacking stop codons is a task common in the analysis of sequence data. However, the computational tools for finding open reading frames are sufficiently slow that they are becoming a bottleneck as the volume of sequence data grows. This computational bottleneck is especially problematic in metagenomics when searching unassembled reads, or screening assembled contigs for genes of interest. Here, we present OrfM, a tool to rapidly identify open reading frames (ORFs) in sequence data by applying the Aho-Corasick algorithm to find regions uninterrupted by stop codons. Benchmarking revealed that OrfM finds identical ORFs to similar tools ('GetOrf' and 'Translate') but is four-five times faster. While OrfM is sequencing platform-agnostic, it is best suited to large, high quality datasets such as those produced by Illumina sequencers.
Source code and binaries are freely available for download at http://github.com/wwood/OrfM or through GNU Guix under the LGPL 3+ license. OrfM is implemented in C and supported on GNU/Linux and OSX.
Supplementary data are available at Bioinformatics online.
查找和翻译缺乏终止密码子的DNA片段是序列数据分析中的常见任务。然而,用于查找开放阅读框的计算工具速度足够慢,以至于随着序列数据量的增长,它们正成为一个瓶颈。在宏基因组学中,当搜索未组装的读段或筛选组装的重叠群以寻找感兴趣的基因时,这种计算瓶颈尤其成问题。在这里,我们展示了OrfM,这是一种通过应用Aho-Corasick算法来快速识别序列数据中的开放阅读框(ORF)的工具,以找到未被终止密码子中断的区域。基准测试表明,OrfM与类似工具(“GetOrf”和“Translate”)找到的ORF相同,但速度快四到五倍。虽然OrfM与测序平台无关,但它最适合大型、高质量的数据集,如Illumina测序仪产生的数据集。
源代码和二进制文件可在http://github.com/wwood/OrfM免费下载,或通过GNU Guix在LGPL 3+许可下获取。OrfM用C语言实现,支持GNU/Linux和OSX。
补充数据可在《生物信息学》在线获取。