The Green Center for Systems Biology, University of Texas Southwestern Medical Center, Dallas, Texas 75390, USA.
The Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas 75390, USA.
Cold Spring Harb Perspect Biol. 2024 Apr 1;16(4):a041463. doi: 10.1101/cshperspect.a041463.
Homologous protein sequences are wonderfully diverse, indicating many possible evolutionary "solutions" to the encoding of function. Consequently, one can construct statistical models of protein sequence by analyzing amino acid frequency across a large multiple sequence alignment. A central premise is that covariance between amino acid positions reflects coevolution due to a shared functional or biophysical constraint. In this review, we describe the implementation and discuss the advantages, limitations, and recent progress on two coevolution-based modeling approaches: (1) Potts models of protein sequence (direct coupling analysis [DCA]-like), and (2) the statistical coupling analysis (SCA). Each approach detects interesting features of protein sequence and structure-the former emphasizes local physical contacts throughout the structure, while the latter identifies larger evolutionarily coupled networks of residues. Recent advances in large-scale gene synthesis and high-throughput functional selection now motivate additional work to benchmark model performance across quantitative function prediction and de novo design tasks.
同源蛋白质序列非常多样化,这表明在功能编码方面可能存在许多不同的进化“解决方案”。因此,可以通过分析大量多重序列比对中的氨基酸频率来构建蛋白质序列的统计模型。一个核心前提是,氨基酸位置之间的协方差反映了由于共同的功能或物理限制而导致的共同进化。在这篇综述中,我们描述了两种基于共进化的建模方法的实现,并讨论了它们的优势、局限性和最新进展:(1)蛋白质序列的 Potts 模型(类似于直接耦合分析 [DCA]),以及(2)统计耦合分析(SCA)。每种方法都可以检测蛋白质序列和结构的有趣特征——前者强调整个结构中的局部物理接触,而后者则确定更大的进化相关残基网络。大规模基因合成和高通量功能选择的最新进展现在促使人们开展更多的工作,以便在定量功能预测和从头设计任务中对模型性能进行基准测试。