Dutta Debojyoti, Guha Rajarshi, Wild David, Chen Ting
School of Informatics, Indiana University, Bloomington, Indiana 47406, USA.
J Chem Inf Model. 2007 May-Jun;47(3):989-97. doi: 10.1021/ci600563w. Epub 2007 Apr 4.
Selecting a small subset of descriptors from a large pool to build a predictive quantitative structure-activity relationship (QSAR) model is an important step in the QSAR modeling process. In general, subset selection is very hard to solve, even approximately, with guaranteed performance bounds. Traditional approaches employ deterministic or stochastic methods to obtain a descriptor subset that leads to an optimal model of a single type (such as linear regression or a neural network). With the development of ensemble modeling approaches, multiple models of differing types are individually developed resulting in different descriptor subsets for each model type. However, it is advantageous, from the point of view of developing interpretable QSAR models, to have a single set of descriptors that can be used for different model types. In this paper, we describe an approach to the selection of a single, optimal, subset of descriptors for multiple model types. We apply this approach to three data sets, covering both regression and classification, and show that the constraint of forcing different model types to use the same set of descriptors does not lead to a significant loss in predictive ability for the individual models considered. In addition, interpretations of the individual models developed using this approach indicate that they encode similar structure-activity trends.
从大量描述符中选择一小部分来构建预测性定量构效关系(QSAR)模型是QSAR建模过程中的重要一步。一般来说,子集选择很难解决,即使是近似解决,也难以保证性能界限。传统方法采用确定性或随机方法来获得一个描述符子集,从而得到单一类型的最优模型(如线性回归或神经网络)。随着集成建模方法的发展,不同类型的多个模型被分别开发,导致每种模型类型都有不同的描述符子集。然而,从开发可解释的QSAR模型的角度来看,拥有一组可用于不同模型类型的描述符是有利的。在本文中,我们描述了一种为多种模型类型选择单个最优描述符子集的方法。我们将此方法应用于三个数据集,涵盖回归和分类,并表明迫使不同模型类型使用同一组描述符的约束不会导致所考虑的单个模型的预测能力显著损失。此外,使用此方法开发的单个模型的解释表明,它们编码了相似的构效趋势。