Key Laboratory of Industrial Ecology and Environmental Engineering (Ministry of Education), Dalian Key Laboratory on Chemicals Risk Control and Pollution Prevention Technology, School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China.
Key Laboratory of Integrated Regulation and Resources Development of Shallow Lakes of Ministry of Education, College of Environment, Hohai University, Nanjing 210098, China.
Environ Sci Technol. 2024 Sep 3;58(35):15650-15660. doi: 10.1021/acs.est.4c02421. Epub 2024 Jul 25.
Accurate prediction of parameters related to the environmental exposure of chemicals is crucial for the sound management of chemicals. However, the lack of large data sets for training models may result in poor prediction accuracy and robustness. Herein, integrated transfer learning (TL) and multitask learning (MTL) was proposed for constructing a graph neural network (GNN) model (abbreviated as TL-MTL-GNN model) using -octanol/water partition coefficients as a source domain. The TL-MTL-GNN model was trained to predict three bioaccumulation parameters based on enlarged data sets that cover 2496 compounds with at least one bioaccumulation parameter. Results show that the TL-MTL-GNN model outperformed single-task GNN models with and without the TL, as well as conventional machine learning models trained with molecular descriptors or fingerprints. Applicability domains were characterized by a state-of-the-art structure-activity landscape-based (abbreviated as AD) methodology. The TL-MTL-GNN model coupled with the optimal AD was employed to predict bioaccumulation parameters for around 60,000 chemicals, with more than 13,000 compounds identified as bioaccumulative chemicals. The high predictive accuracy and robustness of the TL-MTL-GNN model demonstrate the feasibility of integrating the TL and MTL strategy in modeling small-sized data sets. The strategy holds significant potential for addressing small data challenges in modeling environmental chemicals.
准确预测与化学品环境暴露相关的参数对于化学品的合理管理至关重要。然而,缺乏用于训练模型的大型数据集可能导致预测准确性和稳健性较差。在此,提出了一种集成迁移学习(TL)和多任务学习(MTL)的方法,使用辛醇/水分配系数作为源域来构建图神经网络(GNN)模型(简称 TL-MTL-GNN 模型)。该 TL-MTL-GNN 模型基于包含至少一个生物累积参数的 2496 种化合物的扩充数据集进行训练,用于预测三个生物累积参数。结果表明,TL-MTL-GNN 模型优于具有和不具有 TL 的单任务 GNN 模型,以及使用分子描述符或指纹训练的传统机器学习模型。应用域通过一种基于最新结构-活性景观的(简称 AD)方法进行了描述。TL-MTL-GNN 模型与最优 AD 结合,用于预测约 60000 种化学品的生物累积参数,其中有 13000 多种化合物被确定为生物累积性化学品。TL-MTL-GNN 模型具有较高的预测准确性和稳健性,证明了在小数据集建模中集成 TL 和 MTL 策略的可行性。该策略在解决环境化学物质建模中的小数据集挑战方面具有重要潜力。