Xu Yuting, Liaw Andy, Sheridan Robert P, Svetnik Vladimir
Early Development Statistics, Merck & Co., Inc., Rahway, New Jersey 07065, United States.
Modeling and Informatics, Merck & Co., Inc., Rahway, New Jersey 07033, United States.
ACS Omega. 2024 Jun 27;9(27):29478-29490. doi: 10.1021/acsomega.4c02017. eCollection 2024 Jul 9.
The quantitative structure-activity relationship (QSAR) regression model is a commonly used technique for predicting the biological activities of compounds using their molecular descriptors. Besides accurate activity estimation, obtaining a prediction uncertainty metric like a prediction interval is highly desirable. Quantifying prediction uncertainty is an active research area in statistical and machine learning (ML), but the implementation for QSAR remains challenging. However, most ML algorithms with high predictive performance require add-on companions for estimating the uncertainty of their prediction. Conformal prediction (CP) is a promising approach as its main components are agnostic to the prediction modes, and it produces valid prediction intervals under weak assumptions on the data distribution. We proposed computationally efficient CP algorithms tailored to the most widely used ML models, including random forests, deep neural networks, and gradient boosting. The algorithms use a novel approach to the derivation of nonconformity scores from the estimates of prediction uncertainty generated by the ensembles of point predictions. The validity and efficiency of proposed algorithms are demonstrated on a diverse collection of QSAR data sets as well as simulation studies. The provided software implementing our algorithms can be used as stand-alone or easily incorporated into other ML software packages for QSAR modeling.
定量构效关系(QSAR)回归模型是一种常用技术,用于利用化合物的分子描述符预测其生物活性。除了准确估计活性外,获得预测区间等预测不确定性度量非常有必要。量化预测不确定性是统计学和机器学习(ML)领域的一个活跃研究领域,但QSAR的实现仍然具有挑战性。然而,大多数具有高预测性能的ML算法需要附加工具来估计其预测的不确定性。共形预测(CP)是一种很有前途的方法,因为其主要组件与预测模式无关,并且在对数据分布的弱假设下能产生有效的预测区间。我们针对最广泛使用的ML模型,包括随机森林、深度神经网络和梯度提升,提出了计算效率高的CP算法。这些算法采用一种新颖的方法,从点预测集合生成的预测不确定性估计中推导不一致分数。在各种QSAR数据集以及模拟研究中证明了所提算法的有效性和效率。实现我们算法的软件可以独立使用,也可以轻松地纳入其他用于QSAR建模的ML软件包中。