Lawrence Livermore National Laboratory, 7000 East Avenue, Livermore, California 94550, United States.
GlaxoSmithKline, 5 Crescent Drive Philadelphia Pennsylvania 19112, United States.
J Chem Inf Model. 2020 Apr 27;60(4):1955-1968. doi: 10.1021/acs.jcim.9b01053. Epub 2020 Apr 16.
One of the key requirements for incorporating machine learning (ML) into the drug discovery process is complete traceability and reproducibility of the model building and evaluation process. With this in mind, we have developed an end-to-end modular and extensible software pipeline for building and sharing ML models that predict key pharma-relevant parameters. The ATOM Modeling PipeLine, or AMPL, extends the functionality of the open source library DeepChem and supports an array of ML and molecular featurization tools. We have benchmarked AMPL on a large collection of pharmaceutical data sets covering a wide range of parameters. Our key findings indicate that traditional molecular fingerprints underperform other feature representation methods. We also find that data set size correlates directly with prediction performance, which points to the need to expand public data sets. Uncertainty quantification can help predict model error, but correlation with error varies considerably between data sets and model types. Our findings point to the need for an extensible pipeline that can be shared to make model building more widely accessible and reproducible. This software is open source and available at: https://github.com/ATOMconsortium/AMPL.
将机器学习 (ML) 纳入药物发现过程的关键要求之一是完整跟踪和重现模型构建和评估过程。考虑到这一点,我们开发了一个端到端的模块化和可扩展的软件管道,用于构建和共享预测关键制药相关参数的 ML 模型。ATOM 建模管道 (AMPL) 扩展了开源库 DeepChem 的功能,并支持一系列 ML 和分子特征化工具。我们已经在涵盖广泛参数的大型药物数据集上对 AMPL 进行了基准测试。我们的主要发现表明,传统的分子指纹表现不如其他特征表示方法。我们还发现数据集大小与预测性能直接相关,这表明需要扩展公共数据集。不确定性量化有助于预测模型误差,但在数据集和模型类型之间,误差的相关性差异很大。我们的研究结果表明,需要一个可扩展的管道,可以共享该管道以使模型构建更广泛地被访问和重现。该软件是开源的,并可在以下网址获得:https://github.com/ATOMconsortium/AMPL。