Department of Biochemistry, The University of Western Ontario, London, Canada.
PLoS One. 2010 Jun 28;5(6):e11082. doi: 10.1371/journal.pone.0011082.
There is currently no way to verify the quality of a multiple sequence alignment that is independent of the assumptions used to build it. Sequence alignments are typically evaluated by a number of established criteria: sequence conservation, the number of aligned residues, the frequency of gaps, and the probable correct gap placement. Covariation analysis is used to find putatively important residue pairs in a sequence alignment. Different alignments of the same protein family give different results demonstrating that covariation depends on the quality of the sequence alignment. We thus hypothesized that current criteria are insufficient to build alignments for use with covariation analyses.
METHODOLOGY/PRINCIPAL FINDINGS: We show that current criteria are insufficient to build alignments for use with covariation analyses as systematic sequence alignment errors are present even in hand-curated structure-based alignment datasets like those from the Conserved Domain Database. We show that current non-parametric covariation statistics are sensitive to sequence misalignments and that this sensitivity can be used to identify systematic alignment errors. We demonstrate that removing alignment errors due to 1) improper structure alignment, 2) the presence of paralogous sequences, and 3) partial or otherwise erroneous sequences, improves contact prediction by covariation analysis. Finally we describe two non-parametric covariation statistics that are less sensitive to sequence alignment errors than those described previously in the literature.
CONCLUSIONS/SIGNIFICANCE: Protein alignments with errors lead to false positive and false negative conclusions (incorrect assignment of covariation and conservation, respectively). Covariation analysis can provide a verification step, independent of traditional criteria, to identify systematic misalignments in protein alignments. Two non-parametric statistics are shown to be somewhat insensitive to misalignment errors, providing increased confidence in contact prediction when analyzing alignments with erroneous regions because of an emphasis on they emphasize pairwise covariation over group covariation.
目前尚无独立于构建假设的方法来验证多重序列比对的质量。序列比对通常通过多种既定标准进行评估:序列保守性、对齐残基数、空位频率以及可能正确的空位位置。共变分析用于在序列比对中找到假定重要的残基对。同一蛋白质家族的不同比对会产生不同的结果,这表明共变取决于序列比对的质量。因此,我们假设当前的标准不足以构建用于共变分析的比对。
方法/主要发现:我们表明,当前的标准不足以构建用于共变分析的比对,因为即使在基于结构的精心编制的比对数据集(如来自保守域数据库的数据集)中,也存在系统的序列比对错误。我们表明,当前的非参数共变统计数据对序列错位敏感,并且这种敏感性可用于识别系统的对齐错误。我们证明,通过以下方式消除对齐错误可以提高共变分析的接触预测:1)不正确的结构对齐,2)旁系同源序列的存在,以及 3)部分或其他错误序列。最后,我们描述了两个非参数共变统计数据,它们比文献中先前描述的统计数据对序列对齐错误的敏感性要低。
结论/意义:具有错误的蛋白质比对会导致假阳性和假阴性结论(分别为共变和保守性的错误分配)。共变分析可以提供一个独立于传统标准的验证步骤,以识别蛋白质比对中的系统错位。两个非参数统计数据对错位错误的敏感性较低,因此在分析由于错误区域而导致的比对时,通过强调对共变的成对共变而不是组共变,提高了接触预测的置信度。