Di Lascio Elena, Gerebtzoff Grégori, Rodríguez-Pérez Raquel
Novartis Institutes for Biomedical Research, Novartis Campus, BaselCH-4002, Switzerland.
Mol Pharm. 2023 Mar 6;20(3):1758-1767. doi: 10.1021/acs.molpharmaceut.2c00962. Epub 2023 Feb 6.
Machine learning (ML) has become an indispensable tool to predict absorption, distribution, metabolism, and excretion (ADME) properties in pharmaceutical research. ML algorithms are trained on molecular structures and corresponding ADME assay data to develop quantitative structure-property relationship (QSPR) models. Traditional QSPR models were trained on compound sets of limited size. With the advent of more complex ML algorithms and data availability, training sets have become larger and more diverse. Most common training approaches consist in either training a model with a small set of similar compounds, namely, compounds designed for the same drug discovery project or chemical series ( approach) or with a larger set of diverse compounds ( approach). Global models are built with all experimental data available for an assay, combining compound data from different projects and disease areas. Despite the ML progress made so far, the choice of the appropriate data composition for building ML models is still unclear. Herein, a systematic evaluation of local and global ML models was performed for 10 different experimental assays and 112 drug discovery projects. Results show a consistent superior performance of global models for ADME property predictions. Diagnostic analyses were also carried out to investigate the influence of training set size, structural diversity, and data shift in the relative performance of local and global ML models. Training set and structural diversity did not have an impact in the relative performance on the methods. Instead, data shift helped to identify the projects with larger performance differences between local and global models. Results presented in this work can be leveraged to improve ML-based ADME properties predictions and thus decision-making in drug discovery projects.
机器学习(ML)已成为药物研究中预测吸收、分布、代谢和排泄(ADME)特性的不可或缺的工具。ML算法基于分子结构和相应的ADME测定数据进行训练,以开发定量构效关系(QSPR)模型。传统的QSPR模型是基于有限规模的化合物集进行训练的。随着更复杂的ML算法的出现和数据的可得性,训练集变得更大且更多样化。最常见的训练方法包括用一小组相似化合物(即针对同一药物发现项目或化学系列设计的化合物)训练模型(方法)或用一大组不同化合物训练模型(方法)。全局模型是利用某一测定的所有可用实验数据构建的,将来自不同项目和疾病领域的化合物数据结合起来。尽管到目前为止ML取得了进展,但对于构建ML模型而言,选择合适的数据组成仍不明确。在此,针对10种不同的实验测定和112个药物发现项目对局部和全局ML模型进行了系统评估。结果表明全局模型在ADME特性预测方面具有一致的卓越性能。还进行了诊断分析,以研究训练集大小、结构多样性和数据偏移对局部和全局ML模型相对性能的影响。训练集和结构多样性对这些方法的相对性能没有影响。相反,数据偏移有助于识别局部和全局模型之间性能差异较大的项目。这项工作中呈现的结果可用于改进基于ML的ADME特性预测,从而改善药物发现项目中的决策。