Lanevskij Kiril, Didziapetris Remigijus, Sazonovas Andrius
VšĮ "Aukštieji algoritmai", A.Mickevičiaus 29, LT-08117, Vilnius, Lithuania.
ACD/Labs, Inc, 8 King Street East, Suite 107, M5C 1B5, Toronto, ON, Canada.
J Comput Aided Mol Des. 2022 Dec;36(12):837-849. doi: 10.1007/s10822-022-00483-0. Epub 2022 Oct 28.
In an earlier study (Didziapetris R & Lanevskij K (2016). J Comput Aided Mol Des. 30:1175-1188) we collected a database of publicly available hERG inhibition data for almost 6700 drug-like molecules and built a probabilistic Gradient Boosting classifier with a minimal set of physicochemical descriptors (log P, pK, molecular size and topology parameters). This approach favored interpretability over statistical performance but still achieved an overall classification accuracy of 75%. In the current follow-up work we expanded the database (provided in Supplementary Information) to almost 9400 molecules and performed temporal validation of the model on a set of novel chemicals from recently published lead optimization projects. Validation results showed almost no performance degradation compared to the original study. Additionally, we rebuilt the model using AFT (Accelerated Failure Time) learning objective in XGBoost, which accepts both quantitative and censored data often reported in protein inhibition studies. The new model achieved a similar level of accuracy of discerning hERG blockers from non-blockers at 10 µM threshold, which can be conceived as close to the performance ceiling for methods aiming to describe only non-specific ligand interactions with hERG. Yet, this model outputs quantitative potency values (IC) and is not tied to a particular classification cut-off. pIC from patch-clamp measurements can be predicted with R ≈ 0.4 and MAE < 0.5, which enables ligand ranking according to their expected potency levels. The employed approach can be valuable for quantitative modeling of various ADME and drug safety endpoints with a high prevalence of censored data.
在一项早期研究中(Didziapetris R和Lanevskij K(2016年)。《计算机辅助分子设计杂志》。30:1175 - 1188),我们收集了一个包含近6700种类药物分子的公开可用hERG抑制数据的数据库,并构建了一个具有最少物理化学描述符集(log P、pK、分子大小和拓扑参数)的概率梯度提升分类器。这种方法更注重可解释性而非统计性能,但仍实现了75%的总体分类准确率。在当前的后续工作中,我们将数据库(补充信息中提供)扩展到近9400个分子,并对一组来自最近发表的先导优化项目的新型化学品进行了模型的时间验证。验证结果表明,与原始研究相比,性能几乎没有下降。此外,我们在XGBoost中使用加速失效时间(AFT)学习目标重建了模型,该目标接受蛋白质抑制研究中经常报告的定量和删失数据。新模型在10µM阈值下区分hERG阻滞剂和非阻滞剂的准确率达到了相似水平,这可以被认为接近旨在仅描述与hERG的非特异性配体相互作用的方法的性能上限。然而,该模型输出定量效价数值(IC),并且不依赖于特定的分类截止值。膜片钳测量的pIC可以用相关系数R≈0.4和平均绝对误差MAE < 0.5来预测,这使得能够根据配体的预期效价水平进行排序。所采用的方法对于具有高删失数据发生率的各种ADME和药物安全性终点的定量建模可能是有价值的。