University of Havana, Physics Faculty, Department of Theoretical Physics, Group of Complex Systems and Statistical Physics, Havana, Cuba.
Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative - LCQB, Paris, France.
PLoS Comput Biol. 2021 May 24;17(5):e1008957. doi: 10.1371/journal.pcbi.1008957. eCollection 2021 May.
Coevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. A comparison between the results of Direct Coupling Analysis applied to real and to resampled data shows that the largest coevolutionary couplings, i.e. those used for contact prediction, are only weakly influenced by phylogeny. However, the phylogeny-induced spurious couplings in the resampled data are compatible in size with the first false-positive contact predictions from real data. Dissecting functional from phylogeny-induced couplings might therefore extend accurate contact predictions to the range of intermediate-size couplings.
基于共进化的接触预测,无论是直接通过全局统计序列模型产生的共进化耦合,还是使用结构监督和深度学习,都在基于序列的蛋白质结构预测中得到了广泛应用。然而,全局统计建模的一个基本假设是,序列形成了一个未知概率分布的至少近似独立的样本,该样本可以从数据中学习。在蛋白质家族的情况下,这种假设显然被蛋白质序列之间的系统发育关系所违反。事实证明,在共进化模型学习中考虑系统发育相关性非常困难。在这里,我们提出了一种互补的方法:我们开发了一些策略来对序列数据进行随机化或重采样,这样就可以保留保守模式和系统发育关系,同时去除内在的(即基于结构或功能的)共进化耦合。将直接耦合分析应用于真实数据和重采样数据的结果进行比较表明,最大的共进化耦合,即用于接触预测的那些,仅受到系统发育的微弱影响。然而,在重采样数据中,由系统发育引起的虚假耦合与真实数据中的第一个假阳性接触预测大小相当。因此,从功能上对系统发育诱导的耦合进行剖析,可能会将准确的接触预测扩展到中等大小的耦合范围。