Jung Son Gyo, Jung Guwon, Cole Jacqueline M
Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.
ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.
J Chem Inf Model. 2025 Jan 13;65(1):133-152. doi: 10.1021/acs.jcim.4c01862. Epub 2024 Dec 23.
Machine learning (ML) methods provide a pathway to accurately predict molecular properties, leveraging patterns derived from structure-property relationships within materials databases. This approach holds significant importance in drug discovery and materials design, where the rapid, efficient screening of molecules can accelerate the development of new pharmaceuticals and chemical materials for highly specialized target application. Unsupervised and self-supervised learning methods applied to graph-based or geometric models have garnered considerable traction. More recently, transformer-based language models have emerged as powerful tools. Nevertheless, their application entails considerable computational resources, owing to the need for an extensive pretraining process on a vast corpus of unlabeled chemical data sets. To this end, we present a semisupervised strategy that harnesses substructure vector embeddings in conjunction with a ML-based feature selection workflow to predict various molecular and drug properties. We evaluate the efficacy of our modeling methodology across a diverse range of data sets, encompassing both regression and classification tasks. Our findings demonstrate superior performance compared to most existing state-of-the-art algorithms, while offering advantages in terms of balancing model accuracy with computational requirements. Moreover, our approach provides deeper insights into feature interactions that are essential for model interpretability. A case study is conducted to predict the lipophilicity of chemical molecules, exemplifying the robustness of our strategy. The result underscores the importance of meticulous feature analysis and selection over a mere reliance on predictive modeling with a high degree of algorithmic complexity.
机器学习(ML)方法提供了一条准确预测分子性质的途径,它利用从材料数据库中的结构-性质关系得出的模式。这种方法在药物发现和材料设计中具有重要意义,在这些领域中,对分子进行快速、高效的筛选可以加速针对高度专业化目标应用的新型药物和化学材料的开发。应用于基于图或几何模型的无监督和自监督学习方法已经获得了相当大的关注。最近,基于Transformer的语言模型已成为强大的工具。然而,由于需要在大量未标记的化学数据集上进行广泛的预训练过程,它们的应用需要大量的计算资源。为此,我们提出了一种半监督策略,该策略将子结构向量嵌入与基于ML的特征选择工作流程相结合,以预测各种分子和药物性质。我们在包括回归和分类任务在内的各种数据集上评估了我们的建模方法的有效性。我们的研究结果表明,与大多数现有的最先进算法相比,我们的方法具有卓越的性能,同时在平衡模型准确性和计算要求方面具有优势。此外,我们的方法为模型可解释性所必需的特征相互作用提供了更深入的见解。我们进行了一个案例研究来预测化学分子的亲脂性,例证了我们策略的稳健性。结果强调了细致的特征分析和选择的重要性,而不仅仅是依赖具有高度算法复杂性的预测建模。