Tammi Martti T, Arner Erik, Britton Tom, Andersson Björn
Department of Genetics and Pathology, Rudbeck Laboratory, Uppsala University, Uppsala, Sweden.
Bioinformatics. 2002 Mar;18(3):379-88. doi: 10.1093/bioinformatics/18.3.379.
An increasingly important problem in genome sequencing is the failure of the commonly used shotgun assembly programs to correctly assemble repetitive sequences. The assembly of non-repetitive regions or regions containing repeats considerably shorter than the average read length is in practice easy to solve, while longer repeats have been a difficult problem. We here present a statistical method to separate arbitrarily long, almost identical repeats, which makes it possible to correctly assemble complex repetitive sequence regions. The differences between repeat units may be as low as 1% and the sequencing error may be up to ten times higher. The method is based on the realization that a comparison of only a part of all overlapping sequences at a time in a data set does not generate enough information for a conclusive analysis. Our method uses optimal multi-alignments consisting of all the overlaps of each read. This makes it possible to determine defined nucleotide positions, DNPs, which constitute the differences between the repeat units. Differences between repeats are distinguished from sequencing errors using statistical methods, where the probabilities of obtaining certain combinations of candidate DNPs are calculated using the information from the multi-alignments. The use of DNPs and combinations of DNPs will allow for optimal and rapid assemblies of repeated regions. This method can solve repeats that differ in only two positions in a read length, which is the theoretical limit for repeat separation. We predict that this method will be highly useful in shotgun sequencing in the future.
在基因组测序中,一个日益重要的问题是常用的鸟枪法组装程序无法正确组装重复序列。实际上,非重复区域或包含比平均读长明显短的重复序列的区域的组装很容易解决,而较长的重复序列一直是个难题。我们在此提出一种统计方法,用于分离任意长的、几乎相同的重复序列,这使得正确组装复杂的重复序列区域成为可能。重复单元之间的差异可能低至1%,而测序错误可能高达其十倍。该方法基于这样一种认识:一次仅比较数据集中所有重叠序列的一部分,无法生成足够的信息进行确定性分析。我们的方法使用由每个读段的所有重叠组成的最优多序列比对。这使得确定构成重复单元之间差异的特定核苷酸位置(DNP)成为可能。使用统计方法将重复序列之间的差异与测序错误区分开来,其中利用多序列比对中的信息计算获得候选DNP特定组合的概率。使用DNP和DNP组合将实现重复区域的最优和快速组装。该方法能够解决在一个读长中仅在两个位置不同的重复序列,这是重复序列分离的理论极限。我们预测该方法在未来的鸟枪法测序中将非常有用。