基于n元语法模式的蛋白质亚家族特异性保守概况。

Subfamily specific conservation profiles for proteins based on n-gram patterns.

作者信息

Vries John K, Liu Xiong

机构信息

Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA.

出版信息

BMC Bioinformatics. 2008 Jan 30;9:72. doi: 10.1186/1471-2105-9-72.

DOI:10.1186/1471-2105-9-72

PMID:18234090

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2267698/

Abstract

BACKGROUND

A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{n,m}) which are sets of n residues and m wildcards in windows of size n+m. The generation of conservation profiles is treated as a signal-to-noise problem where the signal is the count of n-gram patterns in target sequences that are similar to the query sequence and the noise is the count over all target sequences. The signal is differentiated from the noise by applying singular value decomposition to sets of target sequences rank ordered by similarity with respect to the query.

RESULTS

The new algorithm was used to construct 4,248 profiles from 120 randomly selected Pfam-A families. These were compared to profiles generated from multiple alignments using the consensus approach. The two profiles were similar whenever the subfamily associated with the query sequence was well represented in the multiple alignment. It was possible to construct subfamily specific conservation profiles using the new algorithm for subfamilies with as few as five members. The speed of the new algorithm was comparable to the multiple alignment approach.

CONCLUSION

Subfamily specific conservation profiles can be generated by the new algorithm without aprioi knowledge of family relationships or domain architecture. This is useful when the subfamily contains multiple domains with different levels of representation in protein databases. It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.

摘要

背景

已开发出一种新算法，用于生成反映与查询序列相关的亚家族进化历史的保守性图谱。该算法基于n元语法模式（NP{n,m}），即在大小为n+m的窗口中由n个残基和m个通配符组成的集合。保守性图谱的生成被视为一个信号与噪声的问题，其中信号是目标序列中与查询序列相似的n元语法模式的计数，噪声是所有目标序列上的计数。通过对与查询序列相似度排序的目标序列集应用奇异值分解，将信号与噪声区分开来。

结果

新算法用于从120个随机选择的Pfam-A家族构建4248个图谱。将这些图谱与使用一致性方法从多序列比对生成的图谱进行比较。只要与查询序列相关的亚家族在多序列比对中有很好的代表性，这两种图谱就相似。对于成员少至五个的亚家族，使用新算法可以构建亚家族特异性的保守性图谱。新算法的速度与多序列比对方法相当。

结论

新算法可以生成亚家族特异性的保守性图谱，而无需事先了解家族关系或结构域架构。当亚家族包含在蛋白质数据库中有不同代表性水平的多个结构域时，这很有用。当亚家族样本量对于多序列比对方法来说太小时，它也可能适用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/63a6/2267698/61f52f682155/1471-2105-9-72-1.jpg

相似文献

Subfamily specific conservation profiles for proteins based on n-gram patterns.

BMC Bioinformatics. 2008 Jan 30;9:72. doi: 10.1186/1471-2105-9-72.

Automated protein subfamily identification and classification.

PLoS Comput Biol. 2007 Aug;3(8):e160. doi: 10.1371/journal.pcbi.0030160.

ProClust: improved clustering of protein sequences with an extended graph-based approach.

Bioinformatics. 2002;18 Suppl 2:S182-91. doi: 10.1093/bioinformatics/18.suppl_2.s182.

A sequence alignment-independent method for protein classification.

Appl Bioinformatics. 2004;3(2-3):137-48. doi: 10.2165/00822942-200403020-00008.

Robust sequence alignment using evolutionary rates coupled with an amino acid substitution matrix.

BMC Bioinformatics. 2015 Aug 14;16:255. doi: 10.1186/s12859-015-0688-8.

A word-oriented approach to alignment validation.

Bioinformatics. 2005 May 15;21(10):2230-9. doi: 10.1093/bioinformatics/bti335. Epub 2005 Feb 22.

Improved multiple sequence alignments using coupled pattern mining.

IEEE/ACM Trans Comput Biol Bioinform. 2013 Sep-Oct;10(5):1098-112. doi: 10.1109/TCBB.2013.36.

An integrated approach to the analysis and modeling of protein sequences and structures. III. A comparative study of sequence conservation in protein structural families using multiple structural alignments.

J Mol Biol. 2000 Aug 18;301(3):691-711. doi: 10.1006/jmbi.2000.3975.

Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures.

PLoS Comput Biol. 2009 Mar;5(3):e1000331. doi: 10.1371/journal.pcbi.1000331. Epub 2009 Mar 27.

A comprehensive system for evaluation of remote sequence similarity detection.

BMC Bioinformatics. 2007 Aug 28;8:314. doi: 10.1186/1471-2105-8-314.

引用本文的文献

Prediction of Human Papillomavirus-Host Oncoprotein Interactions Using Deep Learning.

Bioinform Biol Insights. 2024 Dec 10;18:11779322241304666. doi: 10.1177/11779322241304666. eCollection 2024.

Machine learning predicts nucleosome binding modes of transcription factors.

BMC Bioinformatics. 2021 Mar 30;22(1):166. doi: 10.1186/s12859-021-04093-9.

Polymorphism studies on microRNA targetome of thalassemia.

Bioinformation. 2018 May 31;14(5):252-258. doi: 10.6026/97320630014252. eCollection 2018.

A New Method for Recognizing Cytokines Based on Feature Combination and a Support Vector Machine Classifier.

Molecules. 2018 Aug 11;23(8):2008. doi: 10.3390/molecules23082008.

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.

PLoS One. 2015 Nov 10;10(11):e0141287. doi: 10.1371/journal.pone.0141287. eCollection 2015.

Mining for class-specific motifs in protein sequence classification.

BMC Bioinformatics. 2013 Mar 15;14:96. doi: 10.1186/1471-2105-14-96.

Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach.

PLoS One. 2012;7(11):e50039. doi: 10.1371/journal.pone.0050039. Epub 2012 Nov 21.

A singular value decomposition approach for improved taxonomic classification of biological sequences.

BMC Genomics. 2011 Dec 22;12 Suppl 4(Suppl 4):S11. doi: 10.1186/1471-2164-12-S4-S11.

本文引用的文献

The relationship between n-gram patterns and protein secondary structure.

Proteins. 2007 Sep 1;68(4):830-8. doi: 10.1002/prot.21480.

Composition-modified matrices improve identification of homologs of saccharomyces cerevisiae low-complexity glycoproteins.

Eukaryot Cell. 2006 Apr;5(4):628-37. doi: 10.1128/EC.5.4.628-637.2006.

Generalized Poisson distribution: the property of mixture of Poisson and comparison with negative binomial distribution.

Biom J. 2005 Apr;47(2):219-29. doi: 10.1002/bimj.200410102.

Pfam: clans, web tools and services.

Nucleic Acids Res. 2006 Jan 1;34(Database issue):D247-51. doi: 10.1093/nar/gkj149.

Application of latent semantic analysis to protein remote homology detection.

Bioinformatics. 2006 Feb 1;22(3):285-90. doi: 10.1093/bioinformatics/bti801. Epub 2005 Nov 29.

A sequence alignment-independent method for protein classification.

Appl Bioinformatics. 2004;3(2-3):137-48. doi: 10.2165/00822942-200403020-00008.

Remote homolog detection using local sequence-structure correlations.

Proteins. 2004 Nov 15;57(3):518-30. doi: 10.1002/prot.20221.

OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy.

BMC Bioinformatics. 2003 Oct 10;4:47. doi: 10.1186/1471-2105-4-47.

Scoring residue conservation.

Proteins. 2002 Aug 1;48(2):227-41. doi: 10.1002/prot.10146.

Singular value decomposition analysis of protein sequence alignment score data.

Proteins. 2002 Feb 1;46(2):161-70. doi: 10.1002/prot.10032.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于n元语法模式的蛋白质亚家族特异性保守概况。

Subfamily specific conservation profiles for proteins based on n-gram patterns.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献