Biosciences and Biotechnology Division, Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California, United States of America.
PLoS One. 2013 May 10;8(5):e62535. doi: 10.1371/journal.pone.0062535. Print 2013.
We present an enzyme protein function identification algorithm, Catalytic Site Identification (CatSId), based on identification of catalytic residues. The method is optimized for highly accurate template identification across a diverse template library and is also very efficient in regards to time and scalability of comparisons. The algorithm matches three-dimensional residue arrangements in a query protein to a library of manually annotated, catalytic residues--The Catalytic Site Atlas (CSA). Two main processes are involved. The first process is a rapid protein-to-template matching algorithm that scales quadratically with target protein size and linearly with template size. The second process incorporates a number of physical descriptors, including binding site predictions, in a logistic scoring procedure to re-score matches found in Process 1. This approach shows very good performance overall, with a Receiver-Operator-Characteristic Area Under Curve (AUC) of 0.971 for the training set evaluated. The procedure is able to process cofactors, ions, nonstandard residues, and point substitutions for residues and ions in a robust and integrated fashion. Sites with only two critical (catalytic) residues are challenging cases, resulting in AUCs of 0.9411 and 0.5413 for the training and test sets, respectively. The remaining sites show excellent performance with AUCs greater than 0.90 for both the training and test data on templates of size greater than two critical (catalytic) residues. The procedure has considerable promise for larger scale searches.
我们提出了一种酶蛋白功能鉴定算法,即 Catalytic Site Identification (CatSId),它基于催化残基的鉴定。该方法针对跨多样化模板库进行高度准确的模板识别进行了优化,并且在比较的时间和可扩展性方面也非常高效。该算法将查询蛋白中的三维残基排列与手动注释的催化残基库(Catalytic Site Atlas,CSA)进行匹配。该算法涉及两个主要过程。第一个过程是一种快速的蛋白质到模板匹配算法,其规模与目标蛋白大小呈二次方关系,与模板大小呈线性关系。第二个过程在逻辑评分过程中结合了许多物理描述符,包括结合位点预测,以重新评分过程 1 中找到的匹配。该方法总体性能非常好,在评估的训练集中,接收器操作特征曲线(ROC)下的面积(AUC)为 0.971。该过程能够以稳健且集成的方式处理辅助因子、离子、非标准残基以及残基和离子的点取代。只有两个关键(催化)残基的位点是具有挑战性的情况,导致训练集和测试集的 AUC 分别为 0.9411 和 0.5413。对于大于两个关键(催化)残基的模板的训练和测试数据,其余位点的 AUC 均大于 0.90,表现出出色的性能。该过程在更大规模的搜索中具有很大的应用前景。