Suppr超能文献

迈向平衡蛋白质稳定性数据集的汇编:通过系统富集使ΔΔ曲线变平缓

Toward Compilation of Balanced Protein Stability Data Sets: Flattening the ΔΔ Curve through Systematic Enrichment.

作者信息

Kebabci Narod, Timucin Ahmet Can, Timucin Emel

机构信息

Department of Biostatistics and Bioinformatics, Institute of Health Sciences, Acibadem University, Istanbul 34752, Turkey.

Department of Molecular Biology and Genetics, Faculty of Arts and Sciences, Acibadem University, Istanbul 34752, Turkey.

出版信息

J Chem Inf Model. 2022 Mar 14;62(5):1345-1355. doi: 10.1021/acs.jcim.2c00054. Epub 2022 Feb 24.

Abstract

Often studies analyzing stability data sets and/or predictors ignore neutral mutations and use a binary classification scheme labeling only destabilizing and stabilizing mutations. Recognizing that highly concentrated neutral mutations interfere with data set quality, we have explored three protein stability data sets: S2648, PON-tstab, and the symmetric S that differ in size and quality. A characteristic leptokurtic shape in the ΔΔ distributions of all three data sets including the curated and symmetric ones was reported due to concentrated neutral mutations. To further investigate the impact of neutral mutations on ΔΔ predictions, we have comprehensively assessed the performance of 11 predictors on the PON-tstab data set. Correlation and error analyses showed that all of the predictors performed the best on the neutral mutations, while their performance became gradually worse as the ΔΔ of the mutations departed further from the neutral zone regardless of the direction, implying a bias toward dense mutations. To this end, after unraveling the role of concentrated neutral mutations in biases of stability data sets, we described a systematic enrichment approach to balance the ΔΔ distributions. Before enrichment, mutations were clustered based on their biochemical and/or structural features, and then three mutations were selected from every 2 kcal/mol of each cluster. Upon implementation of this approach by distinct clustering schemes, we generated five subsets varying in size and ΔΔ distributions. All subsets showed improved ΔΔ and frequency distributions. We ultimately reported that the errors toward enriched subsets were higher than those toward the parent data sets, confirming the enrichment of difficult-to-predict mutations in the subsets. In summary, we elaborated the prediction bias toward a concentrated neutral zone and also implemented a rational strategy to tackle this and other forms of biases. Ultimately, this study equipping us with an extended view of shortcomings of stability data sets is a step taken toward development of an unbiased predictor.

摘要

通常,分析稳定性数据集和/或预测因子的研究忽略中性突变,并使用仅标记不稳定和稳定突变的二元分类方案。认识到高度集中的中性突变会干扰数据集质量,我们探索了三个蛋白质稳定性数据集:S2648、PON-tstab和对称S,它们在大小和质量上有所不同。由于集中的中性突变,所有三个数据集(包括经过整理的和对称的数据集)的ΔΔ分布都呈现出典型的尖峰态形状。为了进一步研究中性突变对ΔΔ预测的影响,我们全面评估了11个预测因子在PON-tstab数据集上的性能。相关性和误差分析表明,所有预测因子在中性突变上表现最佳,而随着突变的ΔΔ值无论方向如何离中性区越远,它们的性能逐渐变差,这意味着对密集突变存在偏差。为此,在揭示集中的中性突变在稳定性数据集偏差中的作用后,我们描述了一种系统的富集方法来平衡ΔΔ分布。在富集之前,根据突变的生化和/或结构特征对其进行聚类,然后从每个聚类的每2千卡/摩尔中选择三个突变。通过不同的聚类方案实施该方法后,我们生成了五个大小和ΔΔ分布不同的子集。所有子集的ΔΔ和频率分布都有所改善。我们最终报告称,富集子集的误差高于母数据集的误差,证实了子集中难以预测的突变得到了富集。总之,我们阐述了对集中中性区的预测偏差,并实施了一种合理的策略来解决这种偏差以及其他形式的偏差。最终,这项让我们对稳定性数据集的缺点有更广泛认识的研究是朝着开发无偏差预测因子迈出的一步。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验