Wendl Michael C, Barbazuk W Brad
Genome Sequencing Center, Washington University, St. Louis, MO 63108, USA.
BMC Bioinformatics. 2005 Oct 10;6:245. doi: 10.1186/1471-2105-6-245.
The degree to which conventional DNA sequencing techniques will be successful for highly repetitive genomes is unclear. Investigators are therefore considering various filtering methods to select against high-copy sequence in DNA clone libraries. The standard model for random sequencing, Lander-Waterman theory, does not account for two important issues in such libraries, discontinuities and position-based sampling biases (the so-called "edge effect"). We report an extension of the theory for analyzing such configurations.
The edge effect cannot be neglected in most cases. Specifically, rates of coverage and gap reduction are appreciably lower than those for conventional libraries, as predicted by standard theory. Performance decreases as read length increases relative to island size. Although opposite of what happens in a conventional library, this apparent paradox is readily explained in terms of the edge effect. The model agrees well with prototype gene-tagging experiments for Zea mays and Sorghum bicolor. Moreover, the associated density function suggests well-defined probabilistic milestones for the number of reads necessary to capture a given fraction of the gene space. An exception for applying standard theory arises if sequence redundancy is less than about 1-fold. Here, evolution of the random quantities is independent of library gaps and edge effects. This observation effectively validates the practice of using standard theory to estimate the genic enrichment of a library based on light shotgun sequencing.
Coverage performance using a filtered library is significantly lower than that for an equivalent-sized conventional library, suggesting that directed methods may be more critical for the former. The proposed model should be useful for analyzing future projects.
传统DNA测序技术在高度重复基因组上取得成功的程度尚不清楚。因此,研究人员正在考虑各种过滤方法,以在DNA克隆文库中筛选出高拷贝序列。随机测序的标准模型,即兰德-沃特曼理论,并未考虑此类文库中的两个重要问题,即不连续性和基于位置的抽样偏差(所谓的“边缘效应”)。我们报告了该理论的一个扩展,用于分析此类结构。
在大多数情况下,边缘效应不可忽视。具体而言,覆盖率和间隙减少率明显低于标准理论预测的传统文库。随着读长相对于片段大小增加,性能会下降。尽管这与传统文库中的情况相反,但这种明显的矛盾可以用边缘效应很容易地解释。该模型与玉米和高粱的原型基因标记实验结果非常吻合。此外,相关的密度函数为捕获给定比例的基因空间所需的读数数量提出了明确的概率里程碑。如果序列冗余度小于约1倍,则应用标准理论会出现例外情况。在此情况下,随机量的演变与文库间隙和边缘效应无关。这一观察结果有效地验证了基于轻度鸟枪法测序使用标准理论来估计文库基因富集度的做法。
使用过滤文库的覆盖性能明显低于同等大小的传统文库,这表明定向方法对前者可能更为关键。所提出的模型应有助于分析未来的项目。