Cheminformatics Department, Merck Research Laboratories , RY800-D133, Rahway, New Jersey 07065, United States.
J Chem Inf Model. 2013 Nov 25;53(11):2837-50. doi: 10.1021/ci400482e. Epub 2013 Nov 5.
In QSAR, a statistical model is generated from a training set of molecules (represented by chemical descriptors) and their biological activities. We will call this traditional type of QSAR model an "activity model". The activity model can be used to predict the activities of molecules not in the training set. A relatively new subfield for QSAR is domain applicability. The aim is to estimate the reliability of prediction of a specific molecule on a specific activity model. A number of different metrics have been proposed in the literature for this purpose. It is desirable to build a quantitative model of reliability against one or more of these metrics. We can call this an "error model". A previous publication from our laboratory (Sheridan J. Chem. Inf. Model., 2012, 52, 814-823.) suggested the simultaneous use of three metrics would be more discriminating than any one metric. An error model could be built in the form of a three-dimensional set of bins. When the number of metrics exceeds three, however, the bin paradigm is not practical. An obvious solution for constructing an error model using multiple metrics is to use a QSAR method, in our case random forest. In this paper we demonstrate the usefulness of this paradigm, specifically for determining whether a useful error model can be built and which metrics are most useful for a given problem. For the ten data sets and for the seven metrics we examine here, it appears that it is possible to construct a useful error model using only two metrics (TREE_SD and PREDICTED). These do not require calculating similarities/distances between the molecules being predicted and the molecules used to build the activity model, which can be rate-limiting.
在定量构效关系(QSAR)中,统计模型是从分子的训练集(用化学描述符表示)及其生物活性中生成的。我们将这种传统类型的 QSAR 模型称为“活性模型”。活性模型可用于预测未包含在训练集中的分子的活性。QSAR 的一个相对较新的子领域是领域适用性。目的是估计在特定活性模型上预测特定分子的可靠性。为此,文献中提出了许多不同的指标。理想情况下,针对一个或多个这些指标构建可靠性的定量模型。我们可以将其称为“误差模型”。我们实验室的先前出版物(Sheridan J. Chem. Inf. Model.,2012,52,814-823)表明,同时使用三种指标比使用任何一种指标更具辨别力。误差模型可以以三维的方式构建成一组箱。然而,当指标数量超过三个时,箱方法就不实用了。使用多个指标构建误差模型的一种明显方法是使用 QSAR 方法,在我们的情况下是随机森林。在本文中,我们展示了这种方法的有效性,特别是确定是否可以构建有用的误差模型以及哪些指标对于给定问题最有用。对于十个数据集和我们在这里检查的七个指标,似乎可以使用仅两个指标(TREE_SD 和 PREDICTED)构建有用的误差模型。这些指标不需要计算预测分子和构建活性模型的分子之间的相似性/距离,这可能会受到限制。