Department of Computer and Systems Sciences, Stockholm University, 164 07 Kista, Sweden.
Molecules. 2022 Dec 26;28(1):217. doi: 10.3390/molecules28010217.
Molecular structure property modeling is an increasingly important tool for predicting compounds with desired properties due to the expensive and resource-intensive nature and the problem of toxicity-related attrition in late phases during drug discovery and development. Lately, the interest for applying deep learning techniques has increased considerably. This investigation compares the traditional physico-chemical descriptor and machine learning-based approaches through autoencoder generated descriptors to two different descriptor-free, Simplified Molecular Input Line Entry System (SMILES) based, deep learning architectures of Bidirectional Encoder Representations from Transformers (BERT) type using the Mondrian aggregated conformal prediction method as overarching framework. The results show for the binary CATMoS non-toxic and very-toxic datasets that for the former, almost equally balanced, dataset all methods perform equally well while for the latter dataset, with an 11-fold difference between the two classes, the MolBERT model based on a large pre-trained network performs somewhat better compared to the rest with high efficiency for both classes (0.93-0.94) as well as high values for sensitivity, specificity and balanced accuracy (0.86-0.87). The descriptor-free, SMILES-based, deep learning BERT architectures seem capable of producing well-balanced predictive models with defined applicability domains. This work also demonstrates that the class imbalance problem is gracefully handled through the use of Mondrian conformal prediction without the use of over- and/or under-sampling, weighting of classes or cost-sensitive methods.
分子结构性质建模是一种越来越重要的工具,可用于预测具有所需性质的化合物,因为在药物发现和开发的后期阶段,存在着成本高、资源密集型以及与毒性相关的淘汰问题。最近,人们对应用深度学习技术的兴趣大大增加。本研究通过自编码器生成的描述符,将传统的物理化学描述符和基于机器学习的方法与两种不同的无描述符、基于简化分子输入行系统(SMILES)的深度学习架构(基于 Transformer 的双向编码器表示的 Bidirectional Encoder Representations from Transformers,BERT)进行比较,使用蒙地卡罗聚合保形预测方法作为总体框架。结果表明,对于二元 CATMoS 无毒和剧毒数据集,对于前者,几乎平衡的数据集,所有方法的性能都相当,而对于后者,两个类之间存在 11 倍的差异,基于大型预训练网络的 MolBERT 模型的性能略优于其他模型,同时对于两个类都具有高效率(0.93-0.94)以及高灵敏度、特异性和平衡准确性(0.86-0.87)。无描述符、基于 SMILES 的深度学习 BERT 架构似乎能够生成具有定义适用域的平衡预测模型。这项工作还表明,通过使用蒙地卡罗保形预测,可以优雅地处理类不平衡问题,而无需使用过采样和/或欠采样、对类进行加权或使用成本敏感方法。