• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一个用于机器学习的蛋白质分类基准数据集。

A Protein Classification Benchmark collection for machine learning.

作者信息

Sonego Paolo, Pacurar Mircea, Dhir Somdutta, Kertész-Farkas Attila, Kocsor András, Gáspári Zoltán, Leunissen Jack A M, Pongor Sándor

机构信息

Protein Structure and Bioinformatics Group, International Centre for Genetic Engineering and Biotechnology, Padriciano 99, 34012 Trieste, Italy.

出版信息

Nucleic Acids Res. 2007 Jan;35(Database issue):D232-6. doi: 10.1093/nar/gkl812. Epub 2006 Nov 16.

DOI:10.1093/nar/gkl812
PMID:17142240
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1669728/
Abstract

Protein classification by machine learning algorithms is now widely used in structural and functional annotation of proteins. The Protein Classification Benchmark collection (http://hydra.icgeb.trieste.it/benchmark) was created in order to provide standard datasets on which the performance of machine learning methods can be compared. It is primarily meant for method developers and users interested in comparing methods under standardized conditions. The collection contains datasets of sequences and structures, and each set is subdivided into positive/negative, training/test sets in several ways. There is a total of 6405 classification tasks, 3297 on protein sequences, 3095 on protein structures and 10 on protein coding regions in DNA. Typical tasks include the classification of structural domains in the SCOP and CATH databases based on their sequences or structures, as well as various functional and taxonomic classification problems. In the case of hierarchical classification schemes, the classification tasks can be defined at various levels of the hierarchy (such as classes, folds, superfamilies, etc.). For each dataset there are distance matrices available that contain all vs. all comparison of the data, based on various sequence or structure comparison methods, as well as a set of classification performance measures computed with various classifier algorithms.

摘要

通过机器学习算法进行蛋白质分类目前已广泛应用于蛋白质的结构和功能注释。蛋白质分类基准数据集(http://hydra.icgeb.trieste.it/benchmark)的创建是为了提供标准数据集,以便能够比较机器学习方法的性能。它主要面向那些有兴趣在标准化条件下比较方法的方法开发者和用户。该数据集包含序列和结构的数据集,并且每个数据集都以多种方式细分为正/负、训练/测试集。总共有6405个分类任务,其中3297个是关于蛋白质序列的,3095个是关于蛋白质结构的,10个是关于DNA中蛋白质编码区域的。典型任务包括基于序列或结构对SCOP和CATH数据库中的结构域进行分类,以及各种功能和分类学分类问题。对于层次分类方案,分类任务可以在层次结构的不同级别(如类、折叠、超家族等)上定义。对于每个数据集,都有距离矩阵可供使用,这些矩阵包含基于各种序列或结构比较方法的所有数据之间的全对全比较,以及使用各种分类器算法计算的一组分类性能度量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e708/1781154/be1414164ff2/gkl812f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e708/1781154/919dc9ce44e2/gkl812f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e708/1781154/be1414164ff2/gkl812f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e708/1781154/919dc9ce44e2/gkl812f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e708/1781154/be1414164ff2/gkl812f2.jpg

相似文献

1
A Protein Classification Benchmark collection for machine learning.一个用于机器学习的蛋白质分类基准数据集。
Nucleic Acids Res. 2007 Jan;35(Database issue):D232-6. doi: 10.1093/nar/gkl812. Epub 2006 Nov 16.
2
Benchmarking protein classification algorithms via supervised cross-validation.通过监督交叉验证对蛋白质分类算法进行基准测试。
J Biochem Biophys Methods. 2008 Apr 24;70(6):1215-23. doi: 10.1016/j.jbbm.2007.05.011. Epub 2007 May 31.
3
Supervised machine learning algorithms for protein structure classification.用于蛋白质结构分类的监督式机器学习算法。
Comput Biol Chem. 2009 Jun;33(3):216-23. doi: 10.1016/j.compbiolchem.2009.04.004. Epub 2009 May 3.
4
Accurate prediction of solvent accessibility using neural networks-based regression.使用基于神经网络的回归准确预测溶剂可及性。
Proteins. 2004 Sep 1;56(4):753-67. doi: 10.1002/prot.20176.
5
Global sequence properties for superfamily prediction: a machine learning approach.用于超家族预测的全局序列特性:一种机器学习方法。
J Integr Bioinform. 2009 Aug 23;6(1):109. doi: 10.2390/biecoll-jib-2009-109.
6
Protein classification with imbalanced data.不均衡数据下的蛋白质分类
Proteins. 2008 Mar;70(4):1125-32. doi: 10.1002/prot.21870.
7
Fast model-based protein homology detection without alignment.基于快速模型的无需比对的蛋白质同源性检测。
Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.
8
Variable predictive model based classification algorithm for effective separation of protein structural classes.基于可变预测模型的分类算法用于有效分离蛋白质结构类别。
Comput Biol Chem. 2008 Aug;32(4):302-6. doi: 10.1016/j.compbiolchem.2008.03.009. Epub 2008 Apr 1.
9
Classification and knowledge discovery in protein databases.蛋白质数据库中的分类与知识发现。
J Biomed Inform. 2004 Aug;37(4):224-39. doi: 10.1016/j.jbi.2004.07.008.
10
Inferring boundary information of discontinuous-domain proteins.推断不连续结构域蛋白质的边界信息。
IEEE Trans Nanobioscience. 2008 Sep;7(3):200-5. doi: 10.1109/TNB.2008.2002283.

引用本文的文献

1
Descriptor: .描述符:.
IEEE Data Descr. 2024;1:109-112. doi: 10.1109/ieeedata.2024.3482283. Epub 2024 Oct 17.
2
Metrology of convex-shaped nanoparticles soft classification machine learning of TEM images.凸形纳米颗粒的计量学:透射电子显微镜图像的软分类机器学习
Nanoscale Adv. 2021 Oct 13;3(24):6956-6964. doi: 10.1039/d1na00524c. eCollection 2021 Dec 7.
3
A Comparative Study of Machine Learning Methods for Persistence Diagrams.持久图的机器学习方法比较研究

本文引用的文献

1
Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching.使用受试者工作特征(ROC)分析来评估序列匹配。
Comput Chem. 1996 Mar;20(1):25-33. doi: 10.1016/s0097-8485(96)80004-0.
2
Benchmark for evaluating the quality of DNA sequencing: proposal from an international external quality assessment scheme.评估DNA测序质量的基准:一项国际外部质量评估计划的提议
Clin Chem. 2006 Apr;52(4):728-36. doi: 10.1373/clinchem.2005.061887. Epub 2006 Feb 2.
3
BIOREL: the benchmark resource to estimate the relevance of the gene networks.
Front Artif Intell. 2021 Jul 28;4:681174. doi: 10.3389/frai.2021.681174. eCollection 2021.
4
ComQXPA quorum sensing systems may not be unique to Bacillus subtilis: a census in prokaryotic genomes.ComQXPA群体感应系统可能并非枯草芽孢杆菌所特有:原核生物基因组普查
PLoS One. 2014 May 2;9(5):e96122. doi: 10.1371/journal.pone.0096122. eCollection 2014.
5
How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis.如何评估预测方法的性能?变异效应分析中的度量及其解释。
BMC Genomics. 2012 Jun 18;13 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2164-13-S4-S2.
6
ccPDB: compilation and creation of data sets from Protein Data Bank.ccPDB:从蛋白质数据库中编译和创建数据集。
Nucleic Acids Res. 2012 Jan;40(Database issue):D486-9. doi: 10.1093/nar/gkr1150. Epub 2011 Dec 1.
7
Data mining approaches for genome-wide association of mood disorders.用于情绪障碍全基因组关联研究的数据挖掘方法。
Psychiatr Genet. 2012 Apr;22(2):55-61. doi: 10.1097/YPG.0b013e32834dc40d.
8
Multi-netclust: an efficient tool for finding connected clusters in multi-parametric networks.Multi-netclust:一种用于在多参数网络中发现连接簇的有效工具。
Bioinformatics. 2010 Oct 1;26(19):2482-3. doi: 10.1093/bioinformatics/btq435. Epub 2010 Aug 2.
9
Issues in bioinformatics benchmarking: the case study of multiple sequence alignment.生物信息学基准测试中的问题:多序列比对案例研究。
Nucleic Acids Res. 2010 Nov;38(21):7353-63. doi: 10.1093/nar/gkq625. Epub 2010 Jul 17.
10
A biosegmentation benchmark for evaluation of bioimage analysis methods.用于评估生物图像分析方法的生物分割基准。
BMC Bioinformatics. 2009 Nov 1;10:368. doi: 10.1186/1471-2105-10-368.
BIOREL:用于评估基因网络相关性的基准资源。
FEBS Lett. 2006 Feb 6;580(3):844-8. doi: 10.1016/j.febslet.2005.12.101. Epub 2006 Jan 18.
4
Application of compression-based distance measures to protein sequence classification: a methodological study.基于压缩的距离度量在蛋白质序列分类中的应用:一项方法学研究。
Bioinformatics. 2006 Feb 15;22(4):407-12. doi: 10.1093/bioinformatics/bti806. Epub 2005 Nov 29.
5
BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark.BAliBASE 3.0:多序列比对基准测试的最新进展。
Proteins. 2005 Oct 1;61(1):127-36. doi: 10.1002/prot.20527.
6
Protein-Protein Docking Benchmark 2.0: an update.蛋白质-蛋白质对接基准2.0:更新版
Proteins. 2005 Aug 1;60(2):214-6. doi: 10.1002/prot.20560.
7
Efficient recognition of folds in protein 3D structures by the improved PRIDE algorithm.通过改进的PRIDE算法高效识别蛋白质三维结构中的折叠。
Bioinformatics. 2005 Aug 1;21(15):3322-3. doi: 10.1093/bioinformatics/bti513. Epub 2005 May 24.
8
Taxonomic utility of a phylogenetic analysis of phosphoglycerate kinase proteins of Archaea, Bacteria, and Eukaryota: insights by Bayesian analyses.古菌、细菌和真核生物磷酸甘油酸激酶蛋白系统发育分析的分类学效用:贝叶斯分析的见解
Mol Phylogenet Evol. 2005 May;35(2):420-30. doi: 10.1016/j.ympev.2005.02.002.
9
The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis.CATH结构域数据库以及相关资源Gene3D和DHS为基因组分析提供了全面的结构域家族信息。
Nucleic Acids Res. 2005 Jan 1;33(Database issue):D247-51. doi: 10.1093/nar/gki024.
10
SABmark--a benchmark for sequence alignment that covers the entire known fold space.SABmark——一种涵盖整个已知折叠空间的序列比对基准。
Bioinformatics. 2005 Apr 1;21(7):1267-8. doi: 10.1093/bioinformatics/bth493. Epub 2004 Aug 27.