States D J, Botstein D
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894.
Proc Natl Acad Sci U S A. 1991 Jul 1;88(13):5518-22. doi: 10.1073/pnas.88.13.5518.
Molecular sequences, like all experimental data, have finite error rates. The impact of errors on the information content of molecular sequence data is dependent on the analytic paradigm used to interpret the data. We studied the impact of nucleic acid sequence errors on the ability to align predicted amino acid sequences with the sequences of related proteins. We found that with a simultaneous translation and alignment algorithm, identification of sequence homologies is resilient to the introduction of random errors. Proteins with greater than 30% sequence identity can be reliably recognized even in the presence of 1% frameshifting (insertion or deletion) error rates and 5% base substitution rates. Incorporation of prior knowledge about the location and characteristics of errors improves tolerance to error of amino acid sequence alignments. Similarly, inclusion of prior knowledge of biased codon utilization by yeast (Saccharomyces cerevisiae) allows reliable detection of correct reading frames in yeast sequences even in the presence of 5% substitution and 1% frameshift errors.
与所有实验数据一样,分子序列具有有限的错误率。错误对分子序列数据信息内容的影响取决于用于解释数据的分析范式。我们研究了核酸序列错误对将预测的氨基酸序列与相关蛋白质序列进行比对能力的影响。我们发现,使用同步翻译和比对算法时,序列同源性的识别对随机错误的引入具有弹性。即使存在1%的移码(插入或缺失)错误率和5%的碱基替换率,序列同一性大于30%的蛋白质也能被可靠识别。纳入有关错误位置和特征的先验知识可提高氨基酸序列比对的错误耐受性。同样,纳入酵母(酿酒酵母)密码子使用偏好的先验知识,即使存在5%的替换和1%的移码错误,也能可靠检测酵母序列中的正确阅读框。