通过监督交叉验证对蛋白质分类算法进行基准测试。

Kertész-Farkas Attila, Dhir Somdutta, Sonego Paolo, Pacurar Mircea, Netoteia Sergiu, Nijveen Harm, Kuzniar Arnold, Leunissen Jack A M, Kocsor András, Pongor Sándor

Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged, Aradi vértanúk tere 1., H-6720 Szeged, Hungary.

J Biochem Biophys Methods. 2008 Apr 24;70(6):1215-23. doi: 10.1016/j.jbbm.2007.05.011. Epub 2007 May 31.

Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.

蛋白质分类算法的开发和测试受到以下因素的阻碍

蛋白质世界的特点是不同组在成员数量、平均蛋白质大小、组内相似度等方面存在巨大差异。基于传统交叉验证（k折、留一法等）的数据集可能无法可靠地估计算法对已知蛋白质类别的新型、远缘相关亚型的泛化能力。监督交叉验证，即根据数据库中的已知亚型选择测试集和训练集，此前已成功地与SCOP数据库结合使用。我们的目标是将这一原则扩展到其他数据库，并设计用于蛋白质分类的标准化基准数据集。蛋白质类别的层次分类树为设计蛋白质分类的监督交叉验证策略提供了一个简单而通用的框架。可以使用简单的图论距离在概念层次结构的不同级别设计基准数据集。选择监督采样和随机采样相结合的方法来构建尺寸减小的模型数据集，适用于算法比较。我们最近建立的蛋白质分类基准集合中新增了3000多个新的分类任务，目前该集合包括蛋白质序列（包括蛋白质结构域和完整蛋白质）、蛋白质结构和读框DNA序列数据。我们基于各种机器学习算法进行了广泛评估，如最近邻算法、支持向量机、人工神经网络、随机森林和逻辑回归，并结合了比较算法BLAST、Smith-Waterman、Needleman-Wunsch以及3D比较方法DALI和PRIDE。与随机交叉验证方案相比，所得数据集对分类器性能的估计更低，我们认为也更现实。使用监督采样和随机采样相结合的方法构建适用于算法比较的模型数据集。

相似文献

Benchmarking protein classification algorithms via supervised cross-validation.

J Biochem Biophys Methods. 2008 Apr 24;70(6):1215-23. doi: 10.1016/j.jbbm.2007.05.011. Epub 2007 May 31.

A Protein Classification Benchmark collection for machine learning.

Nucleic Acids Res. 2007 Jan;35(Database issue):D232-6. doi: 10.1093/nar/gkl812. Epub 2006 Nov 16.

Fast model-based protein homology detection without alignment.

Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.

Variable predictive model based classification algorithm for effective separation of protein structural classes.

Comput Biol Chem. 2008 Aug;32(4):302-6. doi: 10.1016/j.compbiolchem.2008.03.009. Epub 2008 Apr 1.

Supervised machine learning algorithms for protein structure classification.

Comput Biol Chem. 2009 Jun;33(3):216-23. doi: 10.1016/j.compbiolchem.2009.04.004. Epub 2009 May 3.

Application of compression-based distance measures to protein sequence classification: a methodological study.

Bioinformatics. 2006 Feb 15;22(4):407-12. doi: 10.1093/bioinformatics/bti806. Epub 2005 Nov 29.

Classification and knowledge discovery in protein databases.

J Biomed Inform. 2004 Aug;37(4):224-39. doi: 10.1016/j.jbi.2004.07.008.

LogitBoost classifier for discriminating thermophilic and mesophilic proteins.

J Biotechnol. 2007 Jan 10;127(3):417-24. doi: 10.1016/j.jbiotec.2006.07.020. Epub 2006 Aug 1.

Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure.

Bioinformatics. 2006 Nov 15;22(22):2753-60. doi: 10.1093/bioinformatics/btl475. Epub 2006 Sep 11.

Multiple classifier integration for the prediction of protein structural classes.

J Comput Chem. 2009 Nov 15;30(14):2248-54. doi: 10.1002/jcc.21230.

引用本文的文献

Prediction of inhibitory peptides against E.coli with desired MIC value.

Sci Rep. 2025 Feb 8;15(1):4672. doi: 10.1038/s41598-025-86638-z.

Descriptor: .

IEEE Data Descr. 2024;1:109-112. doi: 10.1109/ieeedata.2024.3482283. Epub 2024 Oct 17.

GaIn: Human Gait Inference for Lower Limbic Prostheses for Patients Suffering from Double Trans-Femoral Amputation.

Sensors (Basel). 2018 Nov 26;18(12):4146. doi: 10.3390/s18124146.

Crop classification by forward neural network with adaptive chaotic particle swarm optimization.

Sensors (Basel). 2011;11(5):4721-43. doi: 10.3390/s110504721. Epub 2011 May 2.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Benchmarking protein classification algorithms via supervised cross-validation.

J Biochem Biophys Methods. 2008 Apr 24;70(6):1215-23. doi: 10.1016/j.jbbm.2007.05.011. Epub 2007 May 31.

A Protein Classification Benchmark collection for machine learning.

Nucleic Acids Res. 2007 Jan;35(Database issue):D232-6. doi: 10.1093/nar/gkl812. Epub 2006 Nov 16.

Fast model-based protein homology detection without alignment.

Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.

Variable predictive model based classification algorithm for effective separation of protein structural classes.

Comput Biol Chem. 2008 Aug;32(4):302-6. doi: 10.1016/j.compbiolchem.2008.03.009. Epub 2008 Apr 1.

Supervised machine learning algorithms for protein structure classification.

Comput Biol Chem. 2009 Jun;33(3):216-23. doi: 10.1016/j.compbiolchem.2009.04.004. Epub 2009 May 3.

Application of compression-based distance measures to protein sequence classification: a methodological study.

Bioinformatics. 2006 Feb 15;22(4):407-12. doi: 10.1093/bioinformatics/bti806. Epub 2005 Nov 29.

Classification and knowledge discovery in protein databases.

J Biomed Inform. 2004 Aug;37(4):224-39. doi: 10.1016/j.jbi.2004.07.008.

LogitBoost classifier for discriminating thermophilic and mesophilic proteins.

J Biotechnol. 2007 Jan 10;127(3):417-24. doi: 10.1016/j.jbiotec.2006.07.020. Epub 2006 Aug 1.

Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure.

Bioinformatics. 2006 Nov 15;22(22):2753-60. doi: 10.1093/bioinformatics/btl475. Epub 2006 Sep 11.

Multiple classifier integration for the prediction of protein structural classes.

J Comput Chem. 2009 Nov 15;30(14):2248-54. doi: 10.1002/jcc.21230.

引用本文的文献

Prediction of inhibitory peptides against E.coli with desired MIC value.

Sci Rep. 2025 Feb 8;15(1):4672. doi: 10.1038/s41598-025-86638-z.

Descriptor: .

IEEE Data Descr. 2024;1:109-112. doi: 10.1109/ieeedata.2024.3482283. Epub 2024 Oct 17.

GaIn: Human Gait Inference for Lower Limbic Prostheses for Patients Suffering from Double Trans-Femoral Amputation.

Sensors (Basel). 2018 Nov 26;18(12):4146. doi: 10.3390/s18124146.

Crop classification by forward neural network with adaptive chaotic particle swarm optimization.

Sensors (Basel). 2011;11(5):4721-43. doi: 10.3390/s110504721. Epub 2011 May 2.

Benchmarking protein classification algorithms via supervised cross-validation.

作者信息

机构信息

出版信息

蛋白质分类算法的开发和测试受到以下因素的阻碍

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献