Department of Chemistry, New York University, New York, NY, 10003, USA.
NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai, 200062, China.
J Comput Aided Mol Des. 2019 Dec;33(12):1095-1105. doi: 10.1007/s10822-019-00247-3. Epub 2019 Nov 15.
Cathepsin S (CatS), a member of cysteine cathepsin proteases, has been well studied due to its significant role in many pathological processes, including arthritis, cancer and cardiovascular diseases. CatS inhibitors have been included in D3R-GC3 for both docking pose prediction and affinity ranking, and in D3R-GC4 for binding affinity ranking. The difficulties posed by CatS inhibitors in D3R mainly come from three aspects: large size, high flexibility and similar chemical structures. We have participated in GC4; our best submitted model, which employs a similarity-based alignment docking and Vina scoring protocol, yielded Kendall's τ of 0.23 for 459 binders in GC4. In our further explorations with machine learning, by curating a CatS specific training set, adopting a similarity-based constrained docking method as well as an arm-based fragmentation strategy which can describe large inhibitors in a locality-sensitive fashion, our best structure-based ranking protocol can achieve Kendall's τ of 0.52 for all binders in GC4. In this exploration process, we have demonstrated the importance of training data, docking approaches and fragmentation strategies in inhibitor-ranking protocol development with machine learning.
组织蛋白酶 S(CatS)是半胱氨酸蛋白酶家族的成员,由于其在许多病理过程中(包括关节炎、癌症和心血管疾病)的重要作用而得到了广泛研究。CatS 抑制剂已被纳入 D3R-GC3 进行对接构象预测和亲和力排序,以及 D3R-GC4 进行结合亲和力排序。CatS 抑制剂在 D3R 中面临的困难主要来自三个方面:体积大、灵活性高和化学结构相似。我们参与了 GC4;我们提交的最佳模型采用基于相似性的对接和 Vina 评分协议,在 GC4 中的 459 个结合物中,Kendall's τ 为 0.23。在我们进一步使用机器学习进行探索时,通过精心策划一个特定于 CatS 的训练集、采用基于相似性的约束对接方法以及基于臂的碎片策略,该策略可以以局部敏感的方式描述大型抑制剂,我们最佳的基于结构的排名协议可以为 GC4 中的所有结合物实现 Kendall's τ 为 0.52。在这个探索过程中,我们展示了在使用机器学习开发抑制剂排名协议时,训练数据、对接方法和碎片策略的重要性。