Bursteinas Borisas, Britto Ramona, Bely Benoit, Auchincloss Andrea, Rivoire Catherine, Redaschi Nicole, O'Donovan Claire, Martin Maria Jesus
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, Geneva 4 1211, Switzerland.
Database (Oxford). 2016 Dec 26;2016. doi: 10.1093/database/baw139. Print 2016.
Advances in high-throughput sequencing have led to an unprecedented growth in genome sequences being submitted to biological databases. In particular, the sequencing of large numbers of nearly identical bacterial genomes during infection outbreaks and for other large-scale studies has resulted in a high level of redundancy in nucleotide databases and consequently in the UniProt Knowledgebase (UniProtKB). Redundancy negatively impacts on database searches by causing slower searches, an increase in statistical bias and cumbersome result analysis. The redundancy combined with the large data volume increases the computational costs for most reuses of UniProtKB data. All of this poses challenges for effective discovery in this wealth of data. With the continuing development of sequencing technologies, it is clear that finding ways to minimize redundancy is crucial to maintaining UniProt's essential contribution to data interpretation by our users. We have developed a methodology to identify and remove highly redundant proteomes from UniProtKB. The procedure identifies redundant proteomes by performing pairwise alignments of sets of sequences for pairs of proteomes and subsequently, applies graph theory to find dominating sets that provide a set of non-redundant proteomes with a minimal loss of information. This method was implemented for bacteria in mid-2015, resulting in a removal of 50 million proteins in UniProtKB. With every new release, this procedure is used to filter new incoming proteomes, resulting in a more scalable and scientifically valuable growth of UniProtKB.Database URL: http://www.uniprot.org/proteomes/.
高通量测序技术的进步使得提交至生物数据库的基因组序列数量呈前所未有的增长。特别是在感染爆发期间以及其他大规模研究中,对大量近乎相同的细菌基因组进行测序,导致核苷酸数据库以及由此产生的UniProt知识库(UniProtKB)中存在高度冗余。冗余对数据库搜索产生负面影响,导致搜索速度变慢、统计偏差增加以及结果分析繁琐。冗余与大数据量相结合,增加了大多数对UniProtKB数据再利用的计算成本。所有这些都给从这海量数据中进行有效发现带来了挑战。随着测序技术的不断发展,显然找到减少冗余的方法对于维持UniProt对用户数据解读的重要贡献至关重要。我们已经开发出一种方法,用于识别和去除UniProtKB中高度冗余的蛋白质组。该程序通过对蛋白质组对的序列集进行成对比对来识别冗余蛋白质组,随后应用图论来找到支配集,从而提供一组信息损失最小的非冗余蛋白质组。此方法于2015年年中在细菌领域实施,结果是从UniProtKB中去除了5000万个蛋白质。在每次新版本发布时,该程序都用于筛选新传入的蛋白质组,从而使UniProtKB实现更具扩展性且更具科学价值的增长。数据库网址:http://www.uniprot.org/proteomes/