González Alvaro J, Liao Li
Laboratory of Bioinformatics, Computer and Information Sciences Department, University of Delaware, 421 Smith Hall, Newark, DE 19716, USA.
BMC Bioinformatics. 2008 Feb 18;9:102. doi: 10.1186/1471-2105-9-102.
At intermediate stages of genome assembly projects, when a number of contigs have been generated and their validity needs to be verified, it is desirable to align these contigs to a reference genome when it is available. The interest is not to analyze a detailed alignment between a contig and the reference genome at the base level, but rather to have a rough estimate of where the contig aligns to the reference genome, specifically, by identifying the starting and ending positions of such a region. This information is very useful in ordering the contigs, facilitating post-assembly analysis such as gap closure and resolving repeats. There exist programs, such as BLAST and MUMmer, that can quickly align and identify high similarity segments between two sequences, which, when seen in a dot plot, tend to agglomerate along a diagonal but can also be disrupted by gaps or shifted away from the main diagonal due to mismatches between the contig and the reference. It is a tedious and practically impossible task to visually inspect the dot plot to identify the regions covered by a large number of contigs from sequence assembly projects. A forced global alignment between a contig and the reference is not only time consuming but often meaningless.
We have developed an algorithm that uses the coordinates of all the exact matches or high similarity local alignments, clusters them with respect to the main diagonal in the dot plot using a weighted linear regression technique, and identifies the starting and ending coordinates of the region of interest.
This algorithm complements existing pairwise sequence alignment packages by replacing the time-consuming seed extension phase with a weighted linear regression for the alignment seeds. It was experimentally shown that the gain in execution time can be outstanding without compromising the accuracy. This method should be of great utility to sequence assembly and genome comparison projects.
在基因组组装项目的中间阶段,当已经生成了一些重叠群并且需要验证它们的有效性时,如果有可用的参考基因组,将这些重叠群与参考基因组进行比对是很有必要的。这里的目的不是在碱基水平上分析重叠群与参考基因组之间的详细比对,而是大致估计重叠群在参考基因组上的比对位置,具体来说,就是确定这样一个区域的起始和结束位置。这些信息对于排列重叠群、促进诸如填补缺口和解决重复序列等组装后分析非常有用。存在一些程序,如BLAST和MUMmer,它们可以快速比对并识别两个序列之间的高相似性片段,这些片段在点阵图中往往沿着对角线聚集,但也可能因重叠群与参考之间的缺口或错配而偏离主对角线。目视检查点阵图以识别来自序列组装项目的大量重叠群所覆盖的区域是一项繁琐且几乎不可能完成的任务。对重叠群和参考进行强制全局比对不仅耗时,而且往往没有意义。
我们开发了一种算法,该算法使用所有精确匹配或高相似性局部比对的坐标,使用加权线性回归技术在点阵图中相对于主对角线对它们进行聚类,并识别感兴趣区域的起始和结束坐标。
该算法通过用加权线性回归代替比对种子的耗时种子延伸阶段,对现有的成对序列比对软件包进行了补充。实验表明,在不影响准确性的情况下,执行时间的增益可能非常显著。该方法对序列组装和基因组比较项目应该非常有用。