Department of Evolutionary Genetics, Albert-Ludwigs University, Freiburg, Germany.
Bioinformatics. 2011 Feb 15;27(4):449-55. doi: 10.1093/bioinformatics/btq689. Epub 2010 Dec 14.
Sequencing capacity is currently growing more rapidly than CPU speed, leading to an analysis bottleneck in many genome projects. Alignment-free sequence analysis methods tend to be more efficient than their alignment-based counterparts. They may, therefore, be important in the long run for keeping sequence analysis abreast with sequencing.
We derive and implement an alignment-free estimator of the number of pairwise mismatches, . Our implementation of , pim, is based on an enhanced suffix array and inherits the superior time and memory efficiency of this data structure. Simulations demonstrate that is accurate if mutations are distributed randomly along the chromosome. While real data often deviates from this ideal, remains useful for identifying regions of low genetic diversity using a sliding window approach. We demonstrate this by applying it to the complete genomes of 37 strains of Drosophila melanogaster, and to the genomes of two closely related Drosophila species, D.simulans and D.sechellia. In both cases, we detect the diversity minimum and discuss its biological implications.
测序能力目前的增长速度快于 CPU 速度,导致许多基因组项目出现分析瓶颈。无比对序列分析方法往往比基于比对的方法效率更高。因此,从长远来看,它们对于保持序列分析与测序同步可能非常重要。
我们推导出并实现了一种无比对的估计两个序列之间错配数的方法,即 pim。我们对 pim 的实现是基于增强后缀数组的,并继承了这种数据结构的卓越时间和空间效率。模拟表明,如果突变沿着染色体随机分布,那么是准确的。虽然真实数据往往偏离这个理想情况,但仍然可以使用滑动窗口方法来识别遗传多样性低的区域。我们通过将其应用于 37 个黑腹果蝇品系的完整基因组以及两个密切相关的果蝇物种,即 D.simulans 和 D.sechellia 的基因组,来证明这一点。在这两种情况下,我们都检测到了多样性最小值,并讨论了其生物学意义。