Department of Chemical Engineering, University of Washington , Seattle, Washington 98195, United States.
Langmuir. 2017 Oct 24;33(42):11511-11517. doi: 10.1021/acs.langmuir.7b02438. Epub 2017 Sep 12.
The ability to intervene in biological pathways has for decades been limited by the lack of a quantitative description of protein-protein interactions (PPIs). Herein we generate and compare millions of simple PPI models for insight into the mechanisms of specific recognition and binding. We use a coarse-grained approach whereby amino acids are counted in the interface, and these counts are used as binding affinity predictors. We perform lasso regression, a modern regression technique aimed at interpretability, with every possible amino acid combination (over 10 unique feature sets) to select only those amino acid predictors that provide more information than noise. This approach circumvents arbitrary binning and assumptions about the binding environment that obscure other binding affinity models. Aggregated analysis of these models trained at various interfacial cutoff distances informs the roles of specific amino acids in different binding contexts. We find that a simple amino acid count model outperforms detailed intermolecular contact and binned residue type models. We identify the prevalence of serine, glycine, and tryptophan in the interface as particularly important for predicting binding affinity across a range of distance cutoffs. Although current sample size limitations prevent a robust consensus model for binding affinity prediction, our approach underscores the relevance of a residue-based description of the protein-protein interface to increase our understanding of specific interactions.
几十年来,由于缺乏对蛋白质-蛋白质相互作用 (PPIs) 的定量描述,干预生物途径的能力受到限制。在此,我们生成并比较了数百万个简单的 PPI 模型,以深入了解特定识别和结合的机制。我们使用一种粗粒度的方法,即在界面中计算氨基酸的数量,并将这些数量用作结合亲和力预测因子。我们使用套索回归 (lasso regression) 进行现代回归技术,旨在提高可解释性,使用每种可能的氨基酸组合(超过 10 个独特特征集)来选择仅提供比噪声更多信息的氨基酸预测因子。这种方法避免了任意的分箱和关于结合环境的假设,这些假设掩盖了其他结合亲和力模型。对在各种界面截止距离下训练的这些模型进行聚合分析,可以了解特定氨基酸在不同结合环境中的作用。我们发现,简单的氨基酸计数模型的表现优于详细的分子间接触和分箱残基类型模型。我们发现,在界面中丝氨酸、甘氨酸和色氨酸的普遍存在对于预测在各种距离截止值下的结合亲和力特别重要。尽管当前的样本量限制阻止了对结合亲和力预测的稳健共识模型,但我们的方法强调了基于残基的蛋白质-蛋白质界面描述对于增加我们对特定相互作用的理解的相关性。