Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing, China102206, China.
National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing102206, China.
J Chem Inf Model. 2022 Nov 28;62(22):5485-5502. doi: 10.1021/acs.jcim.2c01149. Epub 2022 Oct 21.
In structure-based virtual screening (SBVS), it is critical that scoring functions capture protein-ligand atomic interactions. By focusing on the local domains of ligand binding pockets, a standardized pocket Pfam-based clustering (Pfam-cluster) approach was developed to assess the cross-target generalization ability of machine-learning scoring functions (MLSFs). Subsequently, 12 typical MLSFs were evaluated using random cross-validation (Random-CV), protein sequence similarity-based cross-validation (Seq-CV), and pocket Pfam-based cross-validation (Pfam-CV) methods. Surprisingly, all of the tested models showed decreased performances from Random-CV to Seq-CV to Pfam-CV experiments, not showing satisfactory generalization capacity. Our interpretable analysis suggested that the predictions on novel targets by MLSFs were dependent on buried solvent-accessible surface area (SASA)-related features of complex structures, with greater predicted binding affinities on complexes owning larger protein-ligand interfaces. By combining buried SASA-related features with target-specific patterns that were only shared among structurally similar compounds in the same cluster, the random forest (RF)-Score attained a good performance in the Random-CV test. Based on these findings, we strongly advise assessing the generalization ability of MLSFs with the Pfam-cluster approach and being cautious with the features learned by MLSFs.
在基于结构的虚拟筛选 (SBVS) 中,评分函数捕捉蛋白质-配体原子相互作用至关重要。通过关注配体结合口袋的局部域,开发了一种标准化口袋 Pfam 聚类 (Pfam-cluster) 方法,以评估基于机器学习评分函数 (MLSFs) 的跨靶泛化能力。随后,使用随机交叉验证 (Random-CV)、基于蛋白质序列相似性的交叉验证 (Seq-CV) 和口袋 Pfam 交叉验证 (Pfam-CV) 方法评估了 12 种典型的 MLSFs。令人惊讶的是,所有测试模型都显示出从 Random-CV 到 Seq-CV 再到 Pfam-CV 实验的性能下降,并没有表现出令人满意的泛化能力。我们的可解释性分析表明,MLSFs 对新靶标的预测取决于复合物结构中埋藏溶剂可及表面积 (SASA) 相关特征,具有更大蛋白质-配体界面的复合物具有更大的预测结合亲和力。通过将埋藏 SASA 相关特征与仅在同一簇中结构相似化合物之间共享的目标特定模式相结合,随机森林 (RF)-Score 在 Random-CV 测试中取得了良好的性能。基于这些发现,我们强烈建议使用 Pfam-cluster 方法评估 MLSFs 的泛化能力,并谨慎对待 MLSFs 学习到的特征。