Dan Yuanyuan, Ruan Junhao, Zhu Zhenghua, Yu Hualong
School of Environmental and Chemical Engineering, Jiangsu University of Science and Technology, Zhenjiang 212100, China.
School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, China.
Molecules. 2025 Mar 31;30(7):1548. doi: 10.3390/molecules30071548.
Predicting the toxicity of drug molecules using in silico quantitative structure-activity relationship (QSAR) approaches is very helpful for guiding safe drug development and accelerating the drug development procedure. The ongoing development of machine learning techniques has made this task easier and more accurate, but it still suffers negative effects from both the severely skewed distribution of active/inactive chemicals and relatively high-dimensional feature distribution. To simultaneously address both of these issues, a binary ant colony optimization feature selection algorithm, called BACO, is proposed in this study. Specifically, it divides the labeled drug molecules into a training set and a validation set multiple times; with each division, the ant colony seeks an optimal feature group that aims to maximize the weighted combination of three specific class imbalance performance metrics (F-measure, G-mean, and MCC) on the validation set. Then, after running all divisions, the frequency of each feature (descriptor) that emerges in the optimal feature groups is calculated and ranked in descending order. Only those high-frequency features are used to train a support vector machine (SVM) and construct the structure-activity relationship (SAR) prediction model. The experimental results for the 12 datasets in the Tox21 challenge, represented by the Modred descriptor calculator, show that the proposed BACO method significantly outperforms several traditional feature selection approaches that have been widely used in QSAR analysis. It only requires a few to a few dozen descriptors for most datasets to exhibit its best performance, which shows its effectiveness and potential application value in cheminformatics.
使用计算机辅助定量构效关系(QSAR)方法预测药物分子的毒性,对于指导安全药物研发和加速药物研发进程非常有帮助。机器学习技术的不断发展使这项任务变得更加轻松和准确,但它仍然受到活性/非活性化学物质严重偏态分布和相对高维特征分布的负面影响。为了同时解决这两个问题,本研究提出了一种二元蚁群优化特征选择算法,称为BACO。具体来说,它将标记的药物分子多次划分为训练集和验证集;每次划分时,蚁群寻找一个最优特征组,旨在使验证集上三个特定的类不平衡性能指标(F值、G均值和马修斯相关系数)的加权组合最大化。然后,在运行完所有划分后,计算每个特征(描述符)在最优特征组中出现的频率,并按降序排列。仅使用那些高频特征来训练支持向量机(SVM)并构建构效关系(SAR)预测模型。由Modred描述符计算器表示的Tox21挑战中12个数据集的实验结果表明,所提出的BACO方法明显优于几种在QSAR分析中广泛使用的传统特征选择方法。对于大多数数据集,它只需要几个到几十个描述符就能展现出最佳性能,这表明了其在化学信息学中的有效性和潜在应用价值。