Department of Pharmaceutical Science, Peking University, Beijing 100191, China.
J Chem Inf Model. 2024 Apr 22;64(8):2955-2970. doi: 10.1021/acs.jcim.4c00004. Epub 2024 Mar 15.
Chemical reactions serve as foundational building blocks for organic chemistry and drug design. In the era of large AI models, data-driven approaches have emerged to innovate the design of novel reactions, optimize existing ones for higher yields, and discover new pathways for synthesizing chemical structures comprehensively. To effectively address these challenges with machine learning models, it is imperative to derive robust and informative representations or engage in feature engineering using extensive data sets of reactions. This work aims to provide a comprehensive review of established reaction featurization approaches, offering insights into the selection of representations and the design of features for a wide array of tasks. The advantages and limitations of employing SMILES, molecular fingerprints, molecular graphs, and physics-based properties are meticulously elaborated. Solutions to bridge the gap between different representations will also be critically evaluated. Additionally, we introduce a new frontier in chemical reaction pretraining, holding promise as an innovative yet unexplored avenue.
化学反应是有机化学和药物设计的基础。在大型人工智能模型时代,数据驱动的方法已经出现,用于创新新反应的设计、优化现有反应以提高产率,以及全面发现合成化学结构的新途径。为了使用机器学习模型有效地解决这些挑战,必须使用反应的大量数据集来推导健壮和信息丰富的表示或进行特征工程。这项工作旨在全面回顾现有的反应特征化方法,深入了解用于各种任务的表示选择和特征设计。详细阐述了 SMILES、分子指纹、分子图和基于物理性质的方法的优缺点。还将批判性地评估解决不同表示之间差距的解决方案。此外,我们介绍了化学反应预训练的新前沿,这是一个有前途但尚未探索的创新途径。