Division of Medicinal Chemistry, Leiden / Amsterdam Center for Drug Research, Einsteinweg 55, Leiden 2333, CC, The Netherlands.
J Cheminform. 2013 Sep 24;5(1):42. doi: 10.1186/1758-2946-5-42.
While a large body of work exists on comparing and benchmarking descriptors of molecular structures, a similar comparison of protein descriptor sets is lacking. Hence, in the current work a total of 13 amino acid descriptor sets have been benchmarked with respect to their ability of establishing bioactivity models. The descriptor sets included in the study are Z-scales (3 variants), VHSE, T-scales, ST-scales, MS-WHIM, FASGAI, BLOSUM, a novel protein descriptor set (termed ProtFP (4 variants)), and in addition we created and benchmarked three pairs of descriptor combinations. Prediction performance was evaluated in seven structure-activity benchmarks which comprise Angiotensin Converting Enzyme (ACE) dipeptidic inhibitor data, and three proteochemometric data sets, namely (1) GPCR ligands modeled against a GPCR panel, (2) enzyme inhibitors (NNRTIs) with associated bioactivities against a set of HIV enzyme mutants, and (3) enzyme inhibitors (PIs) with associated bioactivities on a large set of HIV enzyme mutants.
The amino acid descriptor sets compared here show similar performance (<0.1 log units RMSE difference and <0.1 difference in MCC), while errors for individual proteins were in some cases found to be larger than those resulting from descriptor set differences ( > 0.3 log units RMSE difference and >0.7 difference in MCC). Combining different descriptor sets generally leads to better modeling performance than utilizing individual sets. The best performers were Z-scales (3) combined with ProtFP (Feature), or Z-Scales (3) combined with an average Z-Scale value for each target, while ProtFP (PCA8), ST-Scales, and ProtFP (Feature) rank last.
While amino acid descriptor sets capture different aspects of amino acids their ability to be used for bioactivity modeling is still - on average - surprisingly similar. Still, combining sets describing complementary information consistently leads to small but consistent improvement in modeling performance (average MCC 0.01 better, average RMSE 0.01 log units lower). Finally, performance differences exist between the targets compared thereby underlining that choosing an appropriate descriptor set is of fundamental for bioactivity modeling, both from the ligand- as well as the protein side.
尽管已经有大量关于比较和基准化分子结构描述符的工作,但缺乏对蛋白质描述符集的类似比较。因此,在当前的工作中,总共对 13 种氨基酸描述符集进行了基准测试,以评估它们建立生物活性模型的能力。研究中包含的描述符集有 Z 标度(3 种变体)、VHSE、T 标度、ST 标度、MS-WHIM、FASGAI、BLOSUM、一种新的蛋白质描述符集(称为 ProtFP(4 种变体)),此外,我们还创建并基准测试了三对描述符组合。预测性能在七个结构-活性基准中进行了评估,这些基准包括血管紧张素转换酶(ACE)二肽抑制剂数据,以及三个基于结构的定量构效关系(QSAR)数据集,即(1)针对 G 蛋白偶联受体(GPCR)面板建模的 GPCR 配体,(2)与一组 HIV 酶突变体相关的具有生物活性的非核苷类逆转录酶抑制剂(NNRTIs),以及(3)与一组 HIV 酶突变体相关的具有生物活性的酶抑制剂(PIs)。
这里比较的氨基酸描述符集表现相似(<0.1 个对数单位 RMSE 差异和<0.1 个 MCC 差异),而在某些情况下,单个蛋白质的误差发现比描述符集差异更大(>0.3 个对数单位 RMSE 差异和>0.7 个 MCC 差异)。组合不同的描述符集通常比使用单个描述符集能获得更好的建模性能。表现最好的是 Z 标度(3)与 ProtFP(特征)组合,或 Z 标度(3)与每个目标的平均 Z 标度值组合,而 ProtFP(PCA8)、ST 标度和 ProtFP(特征)则排名最后。
尽管氨基酸描述符集捕获了氨基酸的不同方面,但它们在生物活性建模中的应用能力仍然——平均而言——非常相似。不过,组合描述互补信息的描述符集始终能带来小而一致的建模性能提升(平均 MCC 提高 0.01,平均 RMSE 降低 0.01 个对数单位)。最后,所比较的靶标之间存在性能差异,这强调了选择合适的描述符集对于生物活性建模至关重要,无论是从配体还是蛋白质方面来说。