Lu Yutong, Li Yan Yi, Sun Yan, Hu Pingzhao
Biostatistics Division, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
Department of Biochemistry, Western University, London, ON, Canada.
J Cheminform. 2025 Aug 29;17(1):133. doi: 10.1186/s13321-025-01073-6.
Chemical Language Models (CLMs) have demonstrated capabilities in extracting patterns and predicting from vast volume of the Simplified Molecular Input Line Entry System (SMILES), a notation used to represent molecular structures. Different CLMs, developed from various architectures, can provide unique insights into molecular properties. To harness the uniqueness of different CLMs, we propose FusionCLM, a novel stacking-ensemble learning algorithm that integrate the outputs of multiple CLMs into a unified framework. FusionCLM first generates SMILES embeddings, predictions, and losses from each CLM. Auxiliary models are trained on these first-level predictions and embeddings to estimate test losses during inference. The losses and predictions are then concatenated to create an integrated feature matrix, which trains second-level meta-models for final predictions. Empirical testing on five datasets demonstrates that FusionCLM have better performance than individual CLM at the first level and three advanced multimodal deep learning frameworks, showcasing FusionCLM's potential in advancing molecular property prediction.
化学语言模型(CLMs)已展现出从大量简化分子输入线性条目系统(SMILES,一种用于表示分子结构的符号)中提取模式并进行预测的能力。从各种架构开发而来的不同CLMs能够为分子性质提供独特的见解。为了利用不同CLMs的独特性,我们提出了FusionCLM,这是一种新颖的堆叠集成学习算法,它将多个CLMs的输出整合到一个统一框架中。FusionCLM首先从每个CLM生成SMILES嵌入、预测结果和损失。辅助模型在这些一级预测结果和嵌入上进行训练,以在推理过程中估计测试损失。然后将损失和预测结果连接起来创建一个集成特征矩阵,该矩阵用于训练二级元模型以进行最终预测。在五个数据集上的实证测试表明,FusionCLM在一级水平上比单个CLM以及三个先进的多模态深度学习框架具有更好的性能,展示了FusionCLM在推进分子性质预测方面的潜力。