Haas Brittany C, Kalyani Dipannita, Sigman Matthew S
Department of Chemistry, University of Utah, Salt Lake City, UT 84112, USA.
Discovery Chemistry, Merck & Co. Inc., Rahway, NJ 07065, USA.
Sci Adv. 2025 Jan 3;11(1):eadt3013. doi: 10.1126/sciadv.adt3013. Epub 2025 Jan 1.
The application of statistical modeling in organic chemistry is emerging as a standard practice for probing structure-activity relationships and as a predictive tool for many optimization objectives. This review is aimed as a tutorial for those entering the area of statistical modeling in chemistry. We provide case studies to highlight the considerations and approaches that can be used to successfully analyze datasets in low data regimes, a common situation encountered given the experimental demands of organic chemistry. Statistical modeling hinges on the data (what is being modeled), descriptors (how data are represented), and algorithms (how data are modeled). Herein, we focus on how various reaction outputs (e.g., yield, rate, selectivity, solubility, stability, and turnover number) and data structures (e.g., binned, heavily skewed, and distributed) influence the choice of algorithm used for constructing predictive and chemically insightful statistical models.
统计建模在有机化学中的应用正逐渐成为探究构效关系的标准做法以及实现许多优化目标的预测工具。本综述旨在为刚进入化学统计建模领域的人员提供一份教程。我们提供案例研究,以突出可用于在低数据量情况下成功分析数据集的注意事项和方法,鉴于有机化学的实验要求,这是一种常见情况。统计建模取决于数据(被建模的内容)、描述符(数据的表示方式)和算法(数据的建模方式)。在此,我们重点关注各种反应输出(例如产率、速率、选择性、溶解度、稳定性和周转数)以及数据结构(例如分箱、高度偏态和分布式)如何影响用于构建预测性和具有化学洞察力的统计模型的算法选择。