Duy Huynh Anh, Srisongkram Tarapong
Graduate School in the Program of Research and Development in Pharmaceuticals, Faculty of Pharmaceutical Sciences, Khon Kaen University, Khon Kaen 40002, Thailand.
Department of Health Sciences, College of Natural Sciences, Can Tho University, Can Tho 900000, Vietnam.
J Chem Inf Model. 2025 Oct 13;65(19):10194-10220. doi: 10.1021/acs.jcim.5c01873. Epub 2025 Sep 12.
A carcinogenicity assessment of possibly carcinogenic chemicals (International Agency for Research on Cancer: IARC class 2B) was conducted using a consensus framework constructed from three complementary machine learning models: BiLSTM with MACCS fingerprints, LightGBM with RDKit descriptors, and Random Forest (RF) with E-state features. These models were developed and rigorously evaluated on benchmark carcinogenicity data sets, with LightGBM emerging as the top performer (accuracy = 0.800, MCC = 0.615, AUROC = 0.882, sensitivity = 0.739, specificity = 0.857). Consistent and competitive performance was also observed for RF and BiLSTM, affirming the reliability of individual predictions. Notably, LightGBM maintained strong generalization ability on independent human carcinogen test sets from IARC and IRIS (accuracy = 0.753, MCC = 0.535, AUROC = 0.842). For the ISSCAN internal test set, the top three models achieved MCC values ranging from 0.564 to 0.615, with AUROC scores between 0.858 and 0.882. For the human carcinogen test set, the top three models attained MCC values from 0.335 to 0.535 and AUROC scores ranging from 0.785 to 0.842. The consensus model was subsequently applied to 47 within-domain compounds from the 2B category, classifying them into 16 potential carcinogens, 8 presumed noncarcinogens, and 23 cases with inconclusive results. To uncover structural correlates, a SHAP-based interpretation of the BiLSTM model was performed, revealing discriminative molecular features including MACCS fingerprint keys and core Bemis-Murcko scaffolds associated with predicted carcinogenicity. To support practical applications, a freely accessible web server for carcinogenicity assessment has been developed and is available at https://carcinogenicity-predictor.streamlit.app.
使用由三个互补的机器学习模型构建的共识框架,对可能致癌的化学物质(国际癌症研究机构:IARC 2B类)进行了致癌性评估:带有MACCS指纹的双向长短期记忆网络(BiLSTM)、带有RDKit描述符的LightGBM以及带有E态特征的随机森林(RF)。这些模型是在基准致癌性数据集上开发并经过严格评估的,其中LightGBM表现最佳(准确率 = 0.800,马修斯相关系数 = 0.615,曲线下面积 = 0.882,灵敏度 = 0.739,特异性 = 0.857)。RF和BiLSTM也表现出一致且具有竞争力的性能,证实了单个预测的可靠性。值得注意的是,LightGBM在来自IARC和IRIS的独立人类致癌物测试集上保持了强大的泛化能力(准确率 = 0.753,马修斯相关系数 = 0.535,曲线下面积 = 0.842)。对于ISSCAN内部测试集,排名前三的模型的马修斯相关系数值在0.564至0.615之间,曲线下面积得分在0.858至0.882之间。对于人类致癌物测试集,排名前三的模型的马修斯相关系数值从0.335至0.535,曲线下面积得分在0.785至0.842之间。随后,将共识模型应用于2B类别的47种域内化合物,将它们分为16种潜在致癌物、8种假定非致癌物和23种结果不确定的情况。为了揭示结构相关性,对BiLSTM模型进行了基于SHAP的解释,揭示了与预测致癌性相关的判别性分子特征,包括MACCS指纹键和核心Bemis-Murcko支架。为了支持实际应用,已开发了一个可免费访问的致癌性评估网络服务器,可在https://carcinogenicity-predictor.streamlit.app上获取。