Structural Chemogenomics, Laboratory of Therapeutical Innovation, UMR 7200 CNRS, University of Strasbourg, F-67400 Illkirch, France.
J Chem Inf Model. 2011 Jul 25;51(7):1593-603. doi: 10.1021/ci200166t. Epub 2011 Jun 21.
Computational chemogenomic (or proteochemometric) methods predict target-ligand interactions by training machine learning algorithms on known experimental data in order to distinguish attributes of true from false target-ligand pairs. Many ligand and target descriptors can be used for training and predicting binary associations or even binding affinities. Several chemogenomic studies have not noticed any real benefit in using 3-D structural target descriptors with respect to simpler sequence-based or property-based information. To assess whether this observation results from inaccurate target description or from the fact that 3-D information is simply not required in chemogenomic modeling, we used a target kernel measuring the distance between target-ligand binding sites of known X-ray structures. When used in combination with a standard ligand kernel in a support vector machine (SVM) classifier, the 3-D target kernel significantly outperforms a sequence-based target kernel in discriminating 2882 target-ligand PDB complexes from 9128 false pairs, whatever the modeling procedure (local or global). The best SVM models could be successfully applied to predict, with very high recall (70%), precision (99%), and specificity (99%), target-ligand associations for an external set of 14,117 ligands and 531 targets. In most of the cases, pooling all data in a global model gave better statistics than just discretizing specific target-ligand subspaces in local models. The current study clearly demonstrates that chemogenomic models taking both ligand and target information outperform simpler ligand-based models. It also permits one to design good modeling practices in predicting target-ligand pairing for a large array of targets: (i) ligand-based models are precise enough if sufficient ligand information (>40-50 diverse ligands) is known; (ii) if not, structure-based chemogenomic models (associating a ligand kernel to a structure-based target kernel) are recommended for proteins of known holostructures; (iii) sequence-based chemogenomic models (associating a ligand kernel to a sequence-based target kernel) can still be used with a very good accuracy for the remaining targets.
计算化学生物基因组学(或蛋白质化学计量学)方法通过在已知的实验数据上训练机器学习算法来预测靶标-配体相互作用,以区分真实靶标-配体对和虚假靶标-配体对的属性。许多配体和靶标描述符可用于训练和预测二元关联,甚至结合亲和力。一些化学生物基因组学研究没有注意到在使用三维结构靶标描述符方面相对于更简单的基于序列或基于性质的信息有任何实际好处。为了评估这种观察结果是由于靶标描述不准确还是由于在化学生物基因组学建模中根本不需要三维信息,我们使用了一种靶标核函数来测量已知 X 射线结构的靶标-配体结合位点之间的距离。当与支持向量机(SVM)分类器中的标准配体核函数结合使用时,三维靶标核函数在区分 2882 个靶标-配体 PDB 复合物与 9128 个假对时,无论建模过程(局部或全局)如何,都显著优于基于序列的靶标核函数。最佳的 SVM 模型可以成功应用于预测,其外部 14117 个配体和 531 个靶标集的靶标-配体关联,召回率(70%)、精度(99%)和特异性(99%)非常高。在大多数情况下,与在局部模型中仅离散特定靶标-配体子空间相比,在全局模型中汇总所有数据可提供更好的统计信息。本研究清楚地表明,同时考虑配体和靶标信息的化学生物基因组学模型优于更简单的基于配体的模型。它还允许设计用于预测大量靶标靶标-配体配对的良好建模实践:(i)如果有足够的配体信息(>40-50 种不同的配体),则基于配体的模型足够精确;(ii)如果没有,则建议使用基于结构的化学生物基因组模型(将配体核函数与基于结构的靶标核函数相关联)用于具有已知整体结构的蛋白质;(iii)对于其余的靶标,仍然可以使用基于序列的化学生物基因组模型(将配体核函数与基于序列的靶标核函数相关联)以非常高的准确性进行使用。