Liyaqat Tanya, Ahmad Tanvir, Kashif Mohammad, Saxena Chandni
Department of Computer Engineering, Jamia Millia Islamia, New Delhi, 110025, India.
The Chinese University of Hong Kong, Hong Kong, China.
Med Biol Eng Comput. 2025 Jun 5. doi: 10.1007/s11517-025-03392-0.
Mutagenicity is concerning due to its link to genetic mutations, which can lead to cancer and other adverse effects. Early identification of mutagenic compounds in drug development is crucial to prevent unsafe candidates and reduce costs. While computational techniques, especially machine learning (ML) models, have become prevalent for mutagenicity prediction, they typically rely on a single modality. Our work introduces a novel stacked ensemble mutagenicity prediction model that integrates multiple modalities, including SMILES and molecular graphs. These modalities capture diverse molecular information such as substructural, physicochemical, geometrical, and topological features. We use SMILES for deriving substructural, geometrical, and physicochemical data, while a graph attention network (GAT) extracts topological information from molecular graphs. Our model employs a stacked ensemble of ML classifiers and SHAP (Shapley Additive Explanations) to identify the significance of classifiers and key features. Our method outperforms state-of-the-art techniques on two standard datasets, achieving an area under the curve of 95.21% on the Hansen benchmark dataset. This research is expected to interest clinicians and computational biologists in translational research.
由于致突变性与基因突变相关联,而基因突变可导致癌症和其他不良影响,因此致突变性备受关注。在药物研发中尽早识别致突变化合物对于防止出现不安全的候选药物并降低成本至关重要。虽然计算技术,尤其是机器学习(ML)模型,已在致突变性预测中普遍使用,但它们通常依赖单一模态。我们的工作引入了一种新颖的堆叠集成致突变性预测模型,该模型整合了多种模态,包括SMILES和分子图。这些模态可捕捉各种分子信息,如亚结构、物理化学、几何和拓扑特征。我们使用SMILES来推导亚结构、几何和物理化学数据,而图注意力网络(GAT)则从分子图中提取拓扑信息。我们的模型采用ML分类器的堆叠集成和SHAP(Shapley加性解释)来确定分类器和关键特征的重要性。我们的方法在两个标准数据集上优于现有技术,在汉森基准数据集上实现了95.21%的曲线下面积。这项研究有望引起临床医生和计算生物学家在转化研究方面的兴趣。