Churchill G A, Waterman M S
Biometrics Unit, Cornell University, Ithaca, New York 14853.
Genomics. 1992 Sep;14(1):89-98. doi: 10.1016/s0888-7543(05)80288-5.
In this paper we describe a method for the statistical reconstruction of a large DNA sequence from a set of sequenced fragments. We assume that the fragments have been assembled and address the problem of determining the degree to which the reconstructed sequence is free from errors, i.e., its accuracy. A consensus distribution is derived from the assembled fragment configuration based upon the rates of sequencing errors in the individual fragments. The consensus distribution can be used to find a minimally redundant consensus sequence that meets a prespecified confidence level, either base by base or across any region of the sequence. A likelihood-based procedure for the estimation of the sequencing error rates, which utilizes an iterative EM algorithm, is described. Prior knowledge of the error rates is easily incorporated into the estimation procedure. The methods are applied to a set of assembled sequence fragments from the human G6PD locus. We close the paper with a brief discussion of the relevance and practical implications of this work.
在本文中,我们描述了一种从一组测序片段对大DNA序列进行统计重建的方法。我们假设片段已经组装好,并解决确定重建序列无错误程度(即其准确性)的问题。基于各个片段中的测序错误率,从组装好的片段配置中得出一个一致分布。该一致分布可用于逐个碱基或跨越序列的任何区域找到满足预定置信水平的最小冗余一致序列。描述了一种利用迭代期望最大化(EM)算法估计测序错误率的基于似然性的程序。错误率的先验知识很容易纳入估计程序。这些方法应用于来自人类葡萄糖-6-磷酸脱氢酶(G6PD)基因座的一组组装序列片段。我们在本文结尾简要讨论了这项工作的相关性和实际意义。