数据库相似性搜索。

Database similarity searches.

作者信息

Plewniak Frédéric

机构信息

Plate-forme Bio-informatique de Strasbourg, Institut de Génétique et de Biologie Moléculaire et Cellulaire, Illkirch, UMR 7104 - CNRS - Inserm - ULP, France.

出版信息

Methods Mol Biol. 2008;484:361-78. doi: 10.1007/978-1-59745-398-1_24.

DOI:10.1007/978-1-59745-398-1_24

PMID:18592192

Abstract

With genome sequencing projects producing huge amounts of sequence data, database sequence similarity search has become a central tool in bioinformatics to identify potentially homologous sequences. It is thus widely used as an initial step for sequence characterization and annotation, phylogeny, genomics, transcriptomics, and proteomics studies. Database similarity search is based upon sequence alignment methods also used in pairwise sequence comparison. Sequence alignment can be global (whole sequence alignment) or local (partial sequence alignment) and there are algorithms to find the optimal alignment given particular comparison criteria. However, as database searches require the comparison of the query sequence with every single sequence in the database, heuristic algorithms have been designed to reduce the time required to build an alignment that has a reasonable chance to be the best one. Such algorithms have been implemented as fast and efficient programs (Blast, FastA) available in different types to address different kinds of problems. After searching the appropriate database, similarity search programs produce a list of similar sequences and local alignments. These results should be carefully examined before coming to any conclusion, as many traps await the similarity seeker: paralogues, multidomain proteins, pseudogenes, etc. This chapter presents points that should always be kept in mind when performing database similarity searches for various goals. It ends with a practical example of sequence characterization from a single protein database search using Blast.

摘要

随着基因组测序项目产生大量的序列数据，数据库序列相似性搜索已成为生物信息学中识别潜在同源序列的核心工具。因此，它被广泛用作序列表征与注释、系统发育、基因组学、转录组学和蛋白质组学研究的第一步。数据库相似性搜索基于成对序列比较中也使用的序列比对方法。序列比对可以是全局的（全序列比对）或局部的（部分序列比对），并且有一些算法可以根据特定的比较标准找到最优比对。然而，由于数据库搜索需要将查询序列与数据库中的每一个序列进行比较，因此已设计出启发式算法来减少构建有合理机会成为最佳比对的比对所需的时间。此类算法已被实现为不同类型的快速高效程序（Blast、FastA），以解决不同类型的问题。在搜索合适的数据库后，相似性搜索程序会生成一份相似序列和局部比对的列表。在得出任何结论之前，应对这些结果进行仔细检查，因为许多陷阱等待着相似性搜索者：旁系同源物、多结构域蛋白、假基因等。本章介绍了在为各种目标进行数据库相似性搜索时应始终牢记的要点。最后给出了一个使用Blast从单个蛋白质数据库搜索进行序列表征的实际示例。