Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, United Kingdom.
Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, United Kingdom
Proc Natl Acad Sci U S A. 2018 Jan 23;115(4):690-695. doi: 10.1073/pnas.1711913115. Epub 2018 Jan 8.
Covariance analysis of protein sequence alignments uses coevolving pairs of sequence positions to predict features of protein structure and function. However, current methods ignore the phylogenetic relationships between sequences, potentially corrupting the identification of covarying positions. Here, we use random matrix theory to demonstrate the existence of a power law tail that distinguishes the spectrum of covariance caused by phylogeny from that caused by structural interactions. The power law is essentially independent of the phylogenetic tree topology, depending on just two parameters-the sequence length and the average branch length. We demonstrate that these power law tails are ubiquitous in the large protein sequence alignments used to predict contacts in 3D structure, as predicted by our theory. This suggests that to decouple phylogenetic effects from the interactions between sequence distal sites that control biological function, it is necessary to remove or down-weight the eigenvectors of the covariance matrix with largest eigenvalues. We confirm that truncating these eigenvectors improves contact prediction.
蛋白质序列比对的协方差分析使用共进化的序列位置对来预测蛋白质结构和功能的特征。然而,目前的方法忽略了序列之间的系统发育关系,可能会破坏共变位置的识别。在这里,我们使用随机矩阵理论来证明存在一个幂律尾部,它可以区分由系统发育引起的协方差谱和由结构相互作用引起的协方差谱。该幂律基本上与系统发育树拓扑无关,仅取决于两个参数——序列长度和平均分支长度。我们证明,这些幂律尾部在用于预测 3D 结构中接触的大型蛋白质序列比对中普遍存在,这与我们的理论预测一致。这表明,要将系统发育效应与控制生物功能的序列远端位点之间的相互作用分离,有必要去除或降低协方差矩阵的具有最大特征值的特征向量。我们确认截断这些特征向量可以提高接触预测的准确性。