利用基于组成的统计方法和其他改进措施提高PSI-BLAST蛋白质数据库搜索的准确性。

Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.

作者信息

Schäffer A A, Aravind L, Madden T L, Shavirin S, Spouge J L, Wolf Y I, Koonin E V, Altschul S F

机构信息

National Center for Biotechnology Information, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.

出版信息

Nucleic Acids Res. 2001 Jul 15;29(14):2994-3005. doi: 10.1093/nar/29.14.2994.

DOI:10.1093/nar/29.14.2994

PMID:11452024

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC55814/

Abstract

PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.

摘要

PSI-BLAST是一个迭代程序，用于在数据库中搜索与查询序列具有远源相似性的蛋白质。我们研究了对PSI-BLAST中使用的方法进行的十几种修改，目的是提高找到真正阳性匹配的准确性。为了评估性能，我们使用了一组103个查询，其中酵母中的真正阳性已由人类专家注释，以及一种流行的检索准确性度量（ROC），其可以标准化以取值在0（最差）和1（最佳）之间。我们认为新颖的修改将ROC分数从0.758±0.005提高到0.895±0.003。这还不包括我们在“基线”版本中包含的四项修改所带来的好处，尽管它们未在PSI-BLAST 2.0版本中实现。在第二个小测试集上证实了准确性的提高。该测试涉及分析三个蛋白质家族，这些家族具有来自非冗余蛋白质数据库的经策划的真正阳性列表。占改进大部分的修改是对每个数据库序列使用根据该序列的氨基酸组成调整的位置特异性评分系统。基于组成的统计数据的使用对于PSI-BLAST的大规模自动化应用特别有益。

相似文献

Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.

Nucleic Acids Res. 2001 Jul 15;29(14):2994-3005. doi: 10.1093/nar/29.14.2994.

IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices.

Bioinformatics. 1999 Dec;15(12):1000-11. doi: 10.1093/bioinformatics/15.12.1000.

Simple is beautiful: a straightforward approach to improve the delineation of true and false positives in PSI-BLAST searches.

Bioinformatics. 2008 Jun 1;24(11):1339-43. doi: 10.1093/bioinformatics/btn130. Epub 2008 Apr 10.

Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST.

BMC Biol. 2006 Dec 7;4:41. doi: 10.1186/1741-7007-4-41.

Efficient recognition of protein fold at low sequence identity by conservative application of Psi-BLAST: validation.

J Mol Recognit. 2005 Mar-Apr;18(2):139-49. doi: 10.1002/jmr.721.

Large-scale comparison of protein sequence alignment algorithms with structure alignments.

Proteins. 2000 Jul 1;40(1):6-22. doi: 10.1002/(sici)1097-0134(20000701)40:1<6::aid-prot30>3.0.co;2-7.

Domain enhanced lookup time accelerated BLAST.

Biol Direct. 2012 Apr 17;7:12. doi: 10.1186/1745-6150-7-12.

SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters.

BMC Bioinformatics. 2004 Oct 28;5:171. doi: 10.1186/1471-2105-5-171.

Code optimization of the subroutine to remove near identical matches in the sequence database homology search tool PSI-BLAST.

J Comput Biol. 2010 Jun;17(6):819-23. doi: 10.1089/cmb.2008.0053.

SIB-BLAST: a web server for improved delineation of true and false positives in PSI-BLAST searches.

Nucleic Acids Res. 2009 Jul;37(Web Server issue):W53-6. doi: 10.1093/nar/gkp301. Epub 2009 May 8.

引用本文的文献

Gabija restricts phages that antagonize a conserved host DNA repair complex.

bioRxiv. 2025 Aug 30:2025.08.30.673261. doi: 10.1101/2025.08.30.673261.

An activator regulates the DNA damage response and anti-phage defense networks in Moraxellaceae.

Nucleic Acids Res. 2025 Aug 27;53(16). doi: 10.1093/nar/gkaf828.

DeepAIPs-SFLA: Deep Convolutional Model for Prediction of Anti-Inflammatory Peptides Using Binary Pattern Decomposition of Novel Multiview Descriptors with an SFLA Approach.

ACS Omega. 2025 Aug 5;10(32):35747-35762. doi: 10.1021/acsomega.5c02422. eCollection 2025 Aug 19.

Two distinct SWI/SNF complexes direct chromatin-linked transcriptional programs in .

bioRxiv. 2025 Jul 16:2025.07.16.665172. doi: 10.1101/2025.07.16.665172.

Complete genome sequences of two dye-decolorizing aeromonads isolated from a stormwater drain in Hong Kong.

Microbiol Resour Announc. 2025 Sep 11;14(9):e0052425. doi: 10.1128/mra.00524-25. Epub 2025 Jul 31.

Functional study of Phaeodactylum tricornutum Seipin highlights specificities of lipid droplets biogenesis in diatoms.

New Phytol. 2025 Sep;247(5):2245-2269. doi: 10.1111/nph.70350. Epub 2025 Jul 7.

Identification and characterization of the de novo methyltransferases for eukaryotic -methyladenine (6mA).

Sci Adv. 2025 May 16;11(20):eadq4623. doi: 10.1126/sciadv.adq4623. Epub 2025 May 14.

PRESCOTT: a population aware, epistatic, and structural model accurately predicts missense effects.

Genome Biol. 2025 May 6;26(1):113. doi: 10.1186/s13059-025-03581-y.

Jumbo phage killer immune system targets early infection of nucleus-forming phages.

Cell. 2025 Apr 17;188(8):2127-2140.e21. doi: 10.1016/j.cell.2025.02.016. Epub 2025 Mar 19.

Comparative structural insights and functional analysis for the distinct unbound states of Human AGO proteins.

Sci Rep. 2025 Mar 19;15(1):9432. doi: 10.1038/s41598-025-91849-5.

本文引用的文献

Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores.

Bull Math Biol. 1992 Jan;54(1):59-75. doi: 10.1007/BF02458620.

Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching.

Comput Chem. 1996 Mar;20(1):25-33. doi: 10.1016/s0097-8485(96)80004-0.

Optimal sequence alignments.

Proc Natl Acad Sci U S A. 1983 Mar;80(5):1382-6. doi: 10.1073/pnas.80.5.1382.

An evolutionary classification of the metallo-beta-lactamase fold proteins.

In Silico Biol. 1999;1(2):69-91.

Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases.

Bioinformatics. 2000 Nov;16(11):988-1002. doi: 10.1093/bioinformatics/16.11.988.

The estimation of statistical parameters for local alignment score distributions.

Nucleic Acids Res. 2001 Jan 15;29(2):351-61. doi: 10.1093/nar/29.2.351.

Database resources of the National Center for Biotechnology Information.

Nucleic Acids Res. 2001 Jan 1;29(1):11-6. doi: 10.1093/nar/29.1.11.

Accurate formula for P-values of gapped local sequence and profile alignments.

J Mol Biol. 2000 Jul 14;300(3):649-59. doi: 10.1006/jmbi.2000.3875.

Large-scale comparison of protein sequence alignment algorithms with structure alignments.

Proteins. 2000 Jul 1;40(1):6-22. doi: 10.1002/(sici)1097-0134(20000701)40:1<6::aid-prot30>3.0.co;2-7.

Rapid assessment of extremal statistics for gapped local alignment.

Proc Int Conf Intell Syst Mol Biol. 1999:211-22.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用基于组成的统计方法和其他改进措施提高PSI-BLAST蛋白质数据库搜索的准确性。

Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.

作者信息

Schäffer A A, Aravind L, Madden T L, Shavirin S, Spouge J L, Wolf Y I, Koonin E V, Altschul S F

机构信息

National Center for Biotechnology Information, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.

出版信息

Nucleic Acids Res. 2001 Jul 15;29(14):2994-3005. doi: 10.1093/nar/29.14.2994.

DOI:10.1093/nar/29.14.2994

PMID:11452024

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC55814/

Abstract

摘要

利用基于组成的统计方法和其他改进措施提高PSI-BLAST蛋白质数据库搜索的准确性。

Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

利用基于组成的统计方法和其他改进措施提高PSI-BLAST蛋白质数据库搜索的准确性。

Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.

作者信息

机构信息

出版信息