Wendl Michael C, Wilson Richard K
The Genome Center and Department of Genetics, Washington University, St Louis, MO 63108, USA.
BMC Genomics. 2009 Aug 5;10:359. doi: 10.1186/1471-2164-10-359.
Structural variations in the form of DNA insertions and deletions are an important aspect of human genetics and especially relevant to medical disorders. Investigations have shown that such events can be detected via tell-tale discrepancies in the aligned lengths of paired-end DNA sequencing reads. Quantitative aspects underlying this method remain poorly understood, despite its importance and conceptual simplicity. We report the statistical theory characterizing the length-discrepancy scheme for Gaussian libraries, including coverage-related effects that preceding models are unable to account for.
Deletion and insertion statistics both depend heavily on physical coverage, but otherwise differ dramatically, refuting a commonly held doctrine of symmetry. Specifically, coverage restrictions render insertions much more difficult to capture. Increased read length has the counterintuitive effect of worsening insertion detection characteristics of short inserts. Variance in library insert length is also a critical factor here and should be minimized to the greatest degree possible. Conversely, no significant improvement would be realized in lowering fosmid variances beyond current levels. Detection power is examined under a straightforward alternative hypothesis and found to be generally acceptable. We also consider the proposition of characterizing variation over the entire spectrum of variant sizes under constant risk of false-positive errors. At 1% risk, many designs will leave a significant gap in the 100 to 200 bp neighborhood, requiring unacceptably high redundancies to compensate. We show that a few modifications largely close this gap and we give a few examples of feasible spectrum-covering designs.
The theory resolves several outstanding issues and furnishes a general methodology for designing future projects from the standpoint of a spectrum-wide constant risk.
DNA插入和缺失形式的结构变异是人类遗传学的一个重要方面,与医学疾病尤其相关。研究表明,此类事件可通过双末端DNA测序读数比对长度中的明显差异来检测。尽管该方法具有重要性且概念简单,但其背后的定量方面仍知之甚少。我们报告了表征高斯文库长度差异方案的统计理论,包括先前模型无法解释的与覆盖度相关的效应。
缺失和插入统计都严重依赖于物理覆盖度,但在其他方面差异巨大,这反驳了一种普遍持有的对称学说。具体而言,覆盖度限制使得插入更难捕获。增加读长对短插入片段的插入检测特征具有适得其反的影响。文库插入片段长度的方差也是一个关键因素,应尽可能将其最小化。相反,将fosmid方差降低到当前水平以下不会实现显著改善。在一个直接的备择假设下检验了检测能力,发现其总体上是可接受的。我们还考虑了在恒定假阳性错误风险下表征整个变异大小谱上变异的提议。在1%的风险水平下,许多设计在100至200 bp范围内会留下显著差距,需要高得不可接受的冗余度来弥补。我们表明,一些修改在很大程度上缩小了这个差距,并给出了一些可行的覆盖谱设计示例。
该理论解决了几个突出问题,并从全谱恒定风险的角度提供了一种用于设计未来项目的通用方法。