Springer Clayton, Adalsteinsson Helgi, Young Malin M, Kegelmeyer Philip W, Roe Diana C
Sandia National Labs, P.O. Box 969, MS 9951, Livermore, CA 94551, USA.
J Med Chem. 2005 Nov 3;48(22):6821-31. doi: 10.1021/jm0493360.
In this work we introduce a postprocessing filter (PostDOCK) that distinguishes true binding ligand-protein complexes from docking artifacts (that are created by DOCK 4.0.1). PostDOCK is a pattern recognition system that relies on (1) a database of complexes, (2) biochemical descriptors of those complexes, and (3) machine learning tools. We use the protein databank (PDB) as the structural database of complexes and create diverse training and validation sets from it based on the "families of structurally similar proteins" (FSSP) hierarchy. For the biochemical descriptors, we consider terms from the DOCK score, empirical scoring, and buried solvent accessible surface area. For the machine-learners, we use a random forest classifier and logistic regression. Our results were obtained on a test set of 44 structurally diverse protein targets. Our highest performing descriptor combinations obtained approximately 19-fold enrichment (39 of 44 binding complexes were correctly identified, while only allowing 2 of 44 decoy complexes), and our best overall accuracy was 92%.
在这项工作中,我们引入了一种后处理过滤器(PostDOCK),它能够区分真正的配体 - 蛋白质结合复合物与对接假象(由DOCK 4.0.1产生)。PostDOCK是一个模式识别系统,它依赖于(1)复合物数据库,(2)这些复合物的生化描述符,以及(3)机器学习工具。我们将蛋白质数据库(PDB)用作复合物的结构数据库,并基于“结构相似蛋白质家族”(FSSP)层次结构从中创建不同的训练集和验证集。对于生化描述符,我们考虑来自DOCK评分、经验评分和埋藏溶剂可及表面积的术语。对于机器学习器,我们使用随机森林分类器和逻辑回归。我们的结果是在一组包含44个结构各异的蛋白质靶点的测试集上获得的。我们表现最佳的描述符组合获得了约19倍的富集(44个结合复合物中的39个被正确识别,而44个诱饵复合物中仅允许2个被识别),我们的最佳总体准确率为92%。