Broad Institute of MIT and Harvard, 7 Cambridge Center, Cambridge, MA 02142, USA.
Bioinformatics. 2010 May 1;26(9):1145-51. doi: 10.1093/bioinformatics/btq102. Epub 2010 Mar 5.
Comparative genomics heavily relies on alignments of large and often complex DNA sequences. From an engineering perspective, the problem here is to provide maximum sensitivity (to find all there is to find), specificity (to only find real homology) and speed (to accommodate the billions of base pairs of vertebrate genomes).
Satsuma addresses all three issues through novel strategies: (i) cross-correlation, implemented via fast Fourier transform; (ii) a match scoring scheme that eliminates almost all false hits; and (iii) an asynchronous 'battleship'-like search that allows for aligning two entire fish genomes (470 and 217 Mb) in 120 CPU hours using 15 processors on a single machine.
Satsuma is part of the Spines software package, implemented in C++ on Linux. The latest version of Spines can be freely downloaded under the LGPL license from http://www.broadinstitute.org/science/programs/genome-biology/spines/.
比较基因组学在很大程度上依赖于对大型且通常复杂的 DNA 序列的比对。从工程学的角度来看,这里的问题是提供最大的灵敏度(以找到所有可找到的)、特异性(只找到真正的同源性)和速度(以适应脊椎动物基因组的数十亿个碱基对)。
Satsuma 通过新颖的策略解决了所有三个问题:(i)通过快速傅里叶变换实现的互相关;(ii)一种消除几乎所有假命中的匹配评分方案;和(iii)异步的“战舰”式搜索,允许在单个机器上使用 15 个处理器在 120 CPU 小时内对齐两个完整的鱼类基因组(470 和 217 Mb)。
Satsuma 是 Spines 软件包的一部分,用 C++ 在 Linux 上实现。最新版本的 Spines 可以根据 LGPL 许可证免费从 http://www.broadinstitute.org/science/programs/genome-biology/spines/ 下载。