为序列相似性搜索程序过滤冗余信息。

Filtering redundancies for sequence similarity search programs.

作者信息

Cantalloube Hubert, Chomilier Jacques, Chiusa Sylvain, Lonquety Mathieu, Spadoni Jean-Louis, Zagury Jean-François

机构信息

Groupe Bioinformatique, Génomique et Traitement des Pathologies du Système Immunitaire, INSERM EMI0355, 15 rue de l'Ecole de Médecine, 75006 Paris, France.

出版信息

J Biomol Struct Dyn. 2005 Feb;22(4):487-92. doi: 10.1080/07391102.2005.10507020.

DOI:10.1080/07391102.2005.10507020

PMID:15588112

Abstract

Database scanning programs such as BLAST and FASTA are used nowadays by most biologists for the post-genomic processing of DNA or protein sequence information (in particular to retrieve the structure/function of uncharacterized proteins). Unfortunately, their results can be polluted by identical alignments (called redundancies) coming from the same protein or DNA sequences present in different entries of the database. This makes the efficient use of the listed alignments difficult. Pretreatment of databases has been proposed to suppress strictly identical entries. However, there still remain many identical alignments since redundancies may occur locally for entries corresponding to various fragments of the same sequence or for entries corresponding to very homologous sequences but differing at the level of a few residues such as ortholog proteins. In the present work, we show that redundant alignments can be indeed numerous even when working with a pretreated non-redundant data bank, going as high as 60% of the output results according to the query and the bank. Therefore the accuracy and the efficiency of the post-genomic work will be greatly increased if these redundancies are removed. To solve this up to now unaddressed problem, we have developed an algorithm that allows for the efficient and safe suppression of all the redundancies with no loss of information. This algorithm is based on various filtering steps that we describe here in the context of the Automat similarity search program, and such an algorithm should also be added to the other similarity search programs (BLAST, FASTA, etc...).

摘要

如今，大多数生物学家使用诸如BLAST和FASTA之类的数据库扫描程序对DNA或蛋白质序列信息进行后基因组处理（特别是检索未表征蛋白质的结构/功能）。不幸的是，它们的结果可能会被来自数据库不同条目中相同蛋白质或DNA序列的相同比对（称为冗余）所污染。这使得有效利用列出的比对变得困难。有人提出对数据库进行预处理以严格抑制相同的条目。然而，仍然存在许多相同的比对，因为对于对应于同一序列的各种片段的条目或对应于非常同源但在几个残基水平上不同的序列（如直系同源蛋白）的条目，可能会局部出现冗余。在本研究中，我们表明，即使使用预处理的非冗余数据库，冗余比对也可能确实很多，根据查询和数据库的不同，其比例高达输出结果的60%。因此，如果去除这些冗余，后基因组工作的准确性和效率将大大提高。为了解决这个迄今为止尚未解决的问题，我们开发了一种算法，该算法能够高效且安全地抑制所有冗余，而不会丢失信息。该算法基于我们在此处结合自动相似性搜索程序描述的各种过滤步骤，并且这种算法也应添加到其他相似性搜索程序（BLAST、FASTA等）中。