Biometics Research , Merck & Co., Inc. , Kenilworth , New Jersey 07033 , United States.
Department of Statistics , The Ohio State University , Cockins Hall, 1958 Neil Avenue , Columbus , Ohio 43210 , United States.
J Chem Inf Model. 2019 Jun 24;59(6):2642-2655. doi: 10.1021/acs.jcim.9b00094. Epub 2019 May 6.
Quantitative structure-activity relationship (QSAR) is a very commonly used technique for predicting the biological activity of a molecule using information contained in the molecular descriptors. The large number of compounds and descriptors and the sparseness of descriptors pose important challenges to traditional statistical methods and machine learning (ML) algorithms (such as random forest (RF)) used in this field. Recently, Bayesian Additive Regression Trees (BART), a flexible Bayesian nonparametric regression approach, has been demonstrated to be competitive with widely used ML approaches. Instead of only focusing on accurate point estimation, BART is formulated entirely in a hierarchical Bayesian modeling framework, allowing one to also quantify uncertainties and hence to provide both point and interval estimation for a variety of quantities of interest. We studied BART as a model builder for QSAR and demonstrated that the approach tends to have predictive performance comparable to RF. More importantly, we investigated BART's natural capability to analyze truncated (or qualified) data, generate interval estimates for molecular activities as well as descriptor importance, and conduct model diagnosis, which could not be easily handled through other approaches.
定量构效关系(QSAR)是一种非常常用的技术,用于使用分子描述符中包含的信息来预测分子的生物活性。大量的化合物和描述符以及描述符的稀疏性对该领域中使用的传统统计方法和机器学习(ML)算法(如随机森林(RF))提出了重要挑战。最近,贝叶斯加法回归树(BART)作为一种灵活的贝叶斯非参数回归方法,已被证明具有竞争力,可与广泛使用的 ML 方法相媲美。BART 不是仅专注于准确的点估计,而是完全在分层贝叶斯建模框架中进行公式化,这允许对各种感兴趣的数量进行不确定性的量化,从而为这些数量提供点估计和区间估计。我们将 BART 作为 QSAR 的模型构建者进行了研究,并证明了该方法具有与 RF 相当的预测性能。更重要的是,我们研究了 BART 分析截断(或合格)数据的自然能力,为分子活性以及描述符重要性生成区间估计,并进行模型诊断,而这些是其他方法不容易处理的。