用于蛋白质同源物的迭代序列/二级结构搜索：与氨基酸序列比对的比较及在基因组数据库中折叠识别的应用

Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases.

作者信息

Wallqvist A, Fukunishi Y, Murphy L R, Fadel A, Levy R M

机构信息

Department of Chemistry, Rutgers University, Wright-Rieman Laboratories, 610 Taylor Rd, Piscataway, NJ 08854-8087, USA.

出版信息

Bioinformatics. 2000 Nov;16(11):988-1002. doi: 10.1093/bioinformatics/16.11.988.

DOI:10.1093/bioinformatics/16.11.988

PMID:11159310

Abstract

MOTIVATION

Sequence alignment techniques have been developed into extremely powerful tools for identifying the folding families and function of proteins in newly sequenced genomes. For a sufficiently low sequence identity it is necessary to incorporate additional structural information to positively detect homologous proteins. We have carried out an extensive analysis of the effectiveness of incorporating secondary structure information directly into the alignments for fold recognition and identification of distant protein homologs. A secondary structure similarity matrix based on a database of three-dimensionally aligned proteins was first constructed. An iterative application of dynamic programming was used which incorporates linear combinations of amino acid and secondary structure sequence similarity scores. Initially, only primary sequence information is used. Subsequently contributions from secondary structure are phased in and new homologous proteins are positively identified if their scores are consistent with the predetermined error rate.

RESULTS

We used the SCOP40 database, where only PDB sequences that have 40% homology or less are included, to calibrate homology detection by the combined amino acid and secondary structure sequence alignments. Combining predicted secondary structure with sequence information results in a 8-15% increase in homology detection within SCOP40 relative to the pairwise alignments using only amino acid sequence data at an error rate of 0.01 errors per query; a 35% increase is observed when the actual secondary structure sequences are used. Incorporating predicted secondary structure information in the analysis of six small genomes yields an improvement in the homology detection of approximately 20% over SSEARCH pairwise alignments, but no improvement in the total number of homologs detected over PSI-BLAST, at an error rate of 0.01 errors per query. However, because the pairwise alignments based on combinations of amino acid and secondary structure similarity are different from those produced by PSI-BLAST and the error rates can be calibrated, it is possible to combine the results of both searches. An additional 25% relative improvement in the number of genes identified at an error rate of 0.01 is observed when the data is pooled in this way. Similarly for the SCOP40 dataset, PSI-BLAST detected 15% of all possible homologs, whereas the pooled results increased the total number of homologs detected to 19%. These results are compared with recent reports of homology detection using sequence profiling methods.

AVAILABILITY

Secondary structure alignment homepage at http://lutece.rutgers.edu/ssas

CONTACT

anders@rutchem.rutgers.edu; ronlevy@lutece.rutgers.edu

SUPPLEMENTARY INFORMATION

Genome sequence/structure alignment results at http://lutece.rutgers.edu/ss_fold_predictions.

摘要

动机

序列比对技术已发展成为用于识别新测序基因组中蛋白质折叠家族和功能的极其强大的工具。对于足够低的序列同一性，有必要纳入额外的结构信息以可靠地检测同源蛋白质。我们对将二级结构信息直接纳入比对以进行折叠识别和远距离蛋白质同源物鉴定的有效性进行了广泛分析。首先构建了基于三维比对蛋白质数据库的二级结构相似性矩阵。使用了动态规划的迭代应用，它纳入了氨基酸和二级结构序列相似性得分的线性组合。最初，仅使用一级序列信息。随后逐步引入二级结构的贡献，如果同源蛋白质的得分与预定错误率一致，则可可靠地识别它们。

结果

我们使用SCOP40数据库（其中仅包含同源性为40%或更低的PDB序列）来校准通过氨基酸和二级结构序列比对组合进行的同源性检测。将预测的二级结构与序列信息相结合，相对于仅使用氨基酸序列数据的成对比对，在SCOP40中同源性检测提高了8 - 15%，错误率为每查询0.01个错误；当使用实际二级结构序列时，观察到提高了35%。在六个小基因组的分析中纳入预测的二级结构信息，相对于SSEARCH成对比对，同源性检测提高了约20%，但在每查询0.01个错误的错误率下，检测到的同源物总数相对于PSI-BLAST没有提高。然而，由于基于氨基酸和二级结构相似性组合的成对比对与PSI-BLAST产生的比对不同，并且错误率可以校准，因此可以将两次搜索的结果相结合。当以这种方式汇总数据时，在错误率为0.01时，识别出的基因数量额外相对提高了25%。同样对于SCOP40数据集，PSI-BLAST检测到所有可能同源物的15%，而汇总结果将检测到的同源物总数提高到了19%。将这些结果与最近使用序列分析方法进行同源性检测的报告进行了比较。