Liu Wei, Li Jianguo, Verma Chandra S, Lee Hwee Kuan
Bioinformatics Institute, Agency for Science, Technology and Research, 30 Biopilis Street, Singapore, 138671, Singapore.
Singapore Eye Research Institute, 20 College Rd, Singapore, 169856, Singapore.
J Cheminform. 2025 Aug 28;17(1):129. doi: 10.1186/s13321-025-01083-4.
Cyclic peptides are promising drug candidates due to their ability to modulate intracellular protein-protein interactions, a property often inaccessible to small molecules. However, their typically poor membrane permeability limits therapeutic applicability. Accurate computational prediction of permeability can accelerate the identification of cell-permeable candidates, reducing reliance on time-consuming and costly experimental screening. Although deep learning has shown potential in predicting molecular properties, its application in permeability prediction remains underexplored. A systematic evaluation of these models is important to assess current capabilities and guide future development. In this study, we conduct a comprehensive benchmark of 13 machine learning models for predicting cyclic peptide membrane permeability. These models cover four types of molecular representations: fingerprints, SMILES strings, molecular graphs, and 2D images. We use experimentally measured PAMPA permeability data from the CycPeptMPDB database, comprising nearly 6000 cyclic peptides, and evaluate performance across three prediction tasks: regression, binary classification, and soft-label classification. Two data-splitting strategies, random split and scaffold split, are used to assess the generalizability of trained models. Our results show that model performance depends strongly on molecular representation and model architecture. Graph-based models, particularly the Directed Message Passing Neural Network (DMPNN), consistently achieve top performance across tasks. Regression generally outperforms classification. Scaffold-based splitting, although intended to more rigorously assess generalization, yields substantially lower model generalizability compared to random splitting. Comparing prediction errors with experimental variability highlights the practical value of current models while also indicating room for further improvement.
环肽是很有前景的药物候选物,因为它们能够调节细胞内蛋白质-蛋白质相互作用,而小分子通常无法具备这一特性。然而,它们通常较差的膜通透性限制了其治疗应用。对通透性进行准确的计算预测可以加速细胞可渗透候选物的识别,减少对耗时且昂贵的实验筛选的依赖。尽管深度学习在预测分子特性方面已显示出潜力,但其在通透性预测中的应用仍未得到充分探索。对这些模型进行系统评估对于评估当前能力和指导未来发展很重要。在本研究中,我们对13种用于预测环肽膜通透性的机器学习模型进行了全面的基准测试。这些模型涵盖四种类型的分子表示:指纹、SMILES字符串、分子图和二维图像。我们使用来自CycPeptMPDB数据库的实验测量的PAMPA通透性数据,该数据库包含近6000种环肽,并评估了三种预测任务的性能:回归、二元分类和软标签分类。使用两种数据拆分策略,随机拆分和支架拆分,来评估训练模型的泛化能力。我们的结果表明,模型性能在很大程度上取决于分子表示和模型架构。基于图的模型,特别是定向消息传递神经网络(DMPNN),在各项任务中始终表现出最佳性能。回归通常优于分类。基于支架的拆分虽然旨在更严格地评估泛化能力,但与随机拆分相比,产生的模型泛化能力要低得多。将预测误差与实验变异性进行比较,既突出了当前模型的实用价值,也表明了进一步改进的空间。