多领域基准测试：一个多领域查询和主题数据库套件。

MultiDomainBenchmark: a multi-domain query and subject database suite.

机构信息

TSYS School of Computer Science, Columbus State University, 4225 University Avenue, Columbus, 31907, GA, USA.

National Center for Biotechnology Information, Bethesda, National Institutes of Health, 8600 Rockville Pike, Bethesda, 20894, MD, USA.

出版信息

BMC Bioinformatics. 2019 Feb 14;20(1):77. doi: 10.1186/s12859-019-2660-5.

DOI:10.1186/s12859-019-2660-5

PMID:30764761

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6376684/

Abstract

BACKGROUND

Genetic sequence database retrieval benchmarks play an essential role in evaluating the performance of sequence searching tools. To date, all phylogenetically diverse benchmarks known to the authors include only query sequences with single protein domains. Domains are the primary building blocks of protein structure and function. Independently, each domain can fulfill a single function, but most proteins (>80% in Metazoa) exist as multi-domain proteins. Multiple domain units combine in various arrangements or architectures to create different functions and are often under evolutionary pressures to yield new ones. Thus, it is crucial to create gold standards reflecting the multi-domain complexity of real proteins to more accurately evaluate sequence searching tools.

DESCRIPTION

This work introduces MultiDomainBenchmark (MDB), a database suite of 412 curated multi-domain queries and 227,512 target sequences, representing at least 5108 species and 1123 phylogenetically divergent protein families, their relevancy annotation, and domain location. Here, we use the benchmark to evaluate the performance of two commonly used sequence searching tools, BLAST/PSI-BLAST and HMMER. Additionally, we introduce a novel classification technique for multi-domain proteins to evaluate how well an algorithm recovers a domain architecture.

CONCLUSION

MDB is publicly available at http://csc.columbusstate.edu/carroll/MDB/ .

摘要

背景

遗传序列数据库检索基准在评估序列搜索工具的性能方面起着至关重要的作用。迄今为止，作者所知道的所有具有系统发育多样性的基准都只包含具有单个蛋白质结构域的查询序列。结构域是蛋白质结构和功能的主要组成部分。每个结构域都可以独立完成单个功能，但大多数蛋白质（在 Metazoa 中超过 80%）都以多结构域蛋白质的形式存在。多个结构域单元以各种排列或架构组合在一起，形成不同的功能，并且经常受到进化压力的影响，以产生新的功能。因此，创建反映真实蛋白质的多结构域复杂性的黄金标准对于更准确地评估序列搜索工具至关重要。

描述

这项工作介绍了 MultiDomainBenchmark（MDB），这是一个由 412 个经过精心整理的多结构域查询和 227,512 个目标序列组成的数据库套件，代表至少 5108 个物种和 1123 个具有系统发育差异的蛋白质家族，以及它们的相关性注释和结构域位置。在这里，我们使用该基准来评估两种常用的序列搜索工具 BLAST/PSI-BLAST 和 HMMER 的性能。此外，我们还引入了一种新的多结构域蛋白质分类技术，以评估算法恢复结构域架构的效果如何。

结论

MDB 可在 http://csc.columbusstate.edu/carroll/MDB/ 上公开获取。

相似文献

MultiDomainBenchmark: a multi-domain query and subject database suite.多领域基准测试：一个多领域查询和主题数据库套件。

BMC Bioinformatics. 2019 Feb 14;20(1):77. doi: 10.1186/s12859-019-2660-5.

Protein domain identification and improved sequence similarity searching using PSI-BLAST.使用PSI-BLAST进行蛋白质结构域鉴定及改进序列相似性搜索。

Proteins. 2002 Sep 1;48(4):672-81. doi: 10.1002/prot.10175.

CDD: a Conserved Domain Database for protein classification.CDD：用于蛋白质分类的保守结构域数据库。

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D192-6. doi: 10.1093/nar/gki069.

Benchmarking PSI-BLAST in genome annotation.在基因组注释中对PSI-BLAST进行基准测试。

J Mol Biol. 1999 Nov 12;293(5):1257-71. doi: 10.1006/jmbi.1999.3233.

HMMER web server: 2015 update.HMMER网络服务器：2015年更新版。

Nucleic Acids Res. 2015 Jul 1;43(W1):W30-8. doi: 10.1093/nar/gkv397. Epub 2015 May 5.

FlowerPower: clustering proteins into domain architecture classes for phylogenomic inference of protein function.花之力：将蛋白质聚类到结构域架构类别中以进行蛋白质功能的系统发育推断

BMC Evol Biol. 2007 Feb 8;7 Suppl 1(Suppl 1):S12. doi: 10.1186/1471-2148-7-S1-S12.

SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments.超家族：代表所有已知结构蛋白质的隐马尔可夫模型。SCOP序列搜索、比对及基因组分配。

Nucleic Acids Res. 2002 Jan 1;30(1):268-72. doi: 10.1093/nar/30.1.268.

Improved detection of remote homologues using cascade PSI-BLAST: influence of neighbouring protein families on sequence coverage.利用级联 PSI-BLAST 提高远程同源物的检测：邻近蛋白质家族对序列覆盖度的影响。

PLoS One. 2013;8(2):e56449. doi: 10.1371/journal.pone.0056449. Epub 2013 Feb 20.

Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures.基于功能分化残基特征的蛋白质结构域亚家族自动层次分类。

BMC Bioinformatics. 2012 Jun 22;13:144. doi: 10.1186/1471-2105-13-144.

dissectHMMER: a HMMER-based score dissection framework that statistically evaluates fold-critical sequence segments for domain fold similarity.dissectHMMER：一个基于HMMER的得分剖析框架，用于对结构域折叠相似性的折叠关键序列片段进行统计评估。

Biol Direct. 2015 Aug 1;10:39. doi: 10.1186/s13062-015-0068-3.

本文引用的文献

Benchmarking the next generation of homology inference tools.对下一代同源性推断工具进行基准测试。

Bioinformatics. 2016 Sep 1;32(17):2636-41. doi: 10.1093/bioinformatics/btw305. Epub 2016 Jun 1.

The Pfam protein families database: towards a more sustainable future.Pfam蛋白质家族数据库：迈向更可持续的未来。

Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85. doi: 10.1093/nar/gkv1344. Epub 2015 Dec 15.

CDD: NCBI's conserved domain database.CDD：美国国家生物技术信息中心的保守结构域数据库。

Nucleic Acids Res. 2015 Jan;43(Database issue):D222-6. doi: 10.1093/nar/gku1221. Epub 2014 Nov 20.

The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements.2015年的OMA直系同源数据库：功能预测、对植物的更好支持、共线性视图及其他改进

Nucleic Acids Res. 2015 Jan;43(Database issue):D240-9. doi: 10.1093/nar/gku1158. Epub 2014 Nov 15.

Database resources of the National Center for Biotechnology Information.美国国立生物技术信息中心的数据库资源。

Nucleic Acids Res. 2015 Jan;43(Database issue):D6-17. doi: 10.1093/nar/gku1130. Epub 2014 Nov 14.

UniProt: a hub for protein information.通用蛋白质数据库（UniProt）：蛋白质信息中心。

Nucleic Acids Res. 2015 Jan;43(Database issue):D204-12. doi: 10.1093/nar/gku989. Epub 2014 Oct 27.

SCOPe: Structural Classification of Proteins--extended, integrating SCOP and ASTRAL data and classification of new structures.SCOPe：蛋白质结构分类——扩展版，整合了 SCOP 和 ASTRAL 数据以及新结构的分类。

Nucleic Acids Res. 2014 Jan;42(Database issue):D304-9. doi: 10.1093/nar/gkt1240. Epub 2013 Dec 3.

Adjusting scoring matrices to correct overextended alignments.调整评分矩阵以纠正过度延伸的比对。

Bioinformatics. 2013 Dec 1;29(23):3007-13. doi: 10.1093/bioinformatics/btt517. Epub 2013 Aug 31.

Bioinformatics. 2014 Jan 15;30(2):274-81. doi: 10.1093/bioinformatics/btt379. Epub 2013 Jul 4.

Domain enhanced lookup time accelerated BLAST.基于域名的快速检索 BLAST。

Biol Direct. 2012 Apr 17;7:12. doi: 10.1186/1745-6150-7-12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

多领域基准测试：一个多领域查询和主题数据库套件。

MultiDomainBenchmark: a multi-domain query and subject database suite.

机构信息

出版信息

BACKGROUND

DESCRIPTION

CONCLUSION

背景

描述

结论

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献