Department of Chemical Engineering, Imperial College London, London SW7 2AZ, United Kingdom.
J Colloid Interface Sci. 2022 Nov;625:328-339. doi: 10.1016/j.jcis.2022.06.034. Epub 2022 Jun 9.
Predicting the surface tension (SFT)-log(c) profiles of hydrocarbon surfactants in aqueous solution is computationally non-trivial, and empirically challenging due to the diverse and complex architecture and interactions of surfactant molecules. Machine learning (ML), combining a data-based and knowledge-based approach, can provide a powerful means to relate molecular descriptors to SFT profiles.
A dataset of SFT for 154 model hydrocarbon surfactants at 20-30 °C is fitted to the Szyszkowski equation to extract three characteristic parameters (Γ,K and critical micelle concentration (CMC)) which are correlated to a series of 2D and 3D molecular descriptors. Key (∼10) descriptors were selected by removing co-correlation, and employing a gradient-boosted regressor model to rank feature importance and carry out recursive feature elimination (RFE). The hyperparameters of each target-variable model were fine-tuned using a randomised cross-validated grid search, to improve predictive ability and reduce overfitting.
The ML models correlate favourably with test experimental data, with R= 0.69-0.87, and the merits and limitations of the approach are discussed based on 'unseen' hydrocarbon surfactants. The incorporation of a knowledge-based framework provides an appropriate smoothing of the experimental data which simplifies the data-driven approach and enhances its generality. Open-source codes and a brief tutorial are provided.
预测烃基表面活性剂在水溶液中的表面张力(SFT)-log(c)曲线在计算上是复杂的,并且由于表面活性剂分子的多样和复杂结构和相互作用,在经验上也具有挑战性。机器学习(ML)结合基于数据和基于知识的方法,可以为将分子描述符与 SFT 曲线相关联提供强大的手段。
在 20-30°C 下拟合了 154 种模型烃基表面活性剂的 SFT 数据集,以 Szyszkowski 方程提取三个特征参数(Γ、K 和临界胶束浓度(CMC)),这些参数与一系列 2D 和 3D 分子描述符相关联。通过去除共相关性,并采用梯度提升回归器模型对特征重要性进行排序和递归特征消除(RFE),选择了关键(约 10)描述符。通过随机交叉验证网格搜索精细调整每个目标变量模型的超参数,以提高预测能力并减少过拟合。
ML 模型与实验测试数据的相关性较好,R=0.69-0.87,并基于“未见”烃基表面活性剂讨论了该方法的优缺点。基于知识的框架的纳入为实验数据提供了适当的平滑处理,简化了数据驱动方法并增强了其通用性。提供了开源代码和简短的教程。