Terfloth Lothar, Bienfait Bruno, Gasteiger Johann
Computer-Chemie-Centrum and Institut für Organische Chemie, Universität Erlangen-Nürnberg, Nägelsbachstrasse 25, D-91052, Erlangen, Germany.
J Chem Inf Model. 2007 Jul-Aug;47(4):1688-701. doi: 10.1021/ci700010t. Epub 2007 Jul 3.
A data set of 379 drugs and drug analogs that are metabolized by human cytochrome P450 (CYP) isoforms 3A4, 2D6, and 2C9, respectively, was studied. A series of descriptor sets directly calculable from the constitution of these drugs was systematically investigated as to their power into classifying a compound into the CYP isoform that metabolizes it. In a four-step build-up process eventually 303 different descriptor components were investigated for 146 compounds of a training set by various model building methods, such as multinomal logistic regression, decision tree, or support vector machine (SVM). Automatic variable selection algorithms were used in order to decrease the number of descriptors. A comprehensive scheme of cross-validation (CV) experiments was applied to assess the robustness and reliability of the four models developed. In addition, the predictive power of the four models presented in this paper was inspected by predicting an external validation data set with 233 compounds. The best model has a leave-one-out (LOO) cross-validated predictivity of 89% and gives 83% correct predictions for the external validation data set. For our favored model we showed the strong influence on the predictivity of the way a data set is split into a training and test data set.
研究了分别由人细胞色素P450(CYP)同工酶3A4、2D6和2C9代谢的379种药物及药物类似物的数据集。系统地研究了一系列可直接从这些药物的结构计算得出的描述符集,以评估其将化合物分类到代谢它的CYP同工酶中的能力。在一个四步构建过程中,最终通过多种模型构建方法,如多项式逻辑回归、决策树或支持向量机(SVM),对训练集中的146种化合物研究了303种不同的描述符成分。使用自动变量选择算法以减少描述符的数量。应用了全面的交叉验证(CV)实验方案来评估所开发的四个模型的稳健性和可靠性。此外,通过预测包含233种化合物的外部验证数据集来检验本文提出的四个模型的预测能力。最佳模型的留一法(LOO)交叉验证预测率为89%,对外部验证数据集的正确预测率为83%。对于我们所青睐的模型,我们展示了数据集划分为训练集和测试集的方式对预测能力有很大影响。