从相关蛋白质序列集中自动生成一级序列模式。

Automatic generation of primary sequence patterns from sets of related protein sequences.

作者信息

Smith R F, Smith T F

机构信息

Department of Biostatistics, Dana-Farber Cancer Institute, Boston, MA 02115.

出版信息

Proc Natl Acad Sci U S A. 1990 Jan;87(1):118-22. doi: 10.1073/pnas.87.1.118.

DOI:10.1073/pnas.87.1.118

PMID:2296575

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC53211/

Abstract

We have developed a computer algorithm that can extract the pattern of conserved primary sequence elements common to all members of a homologous protein family. The method involves clustering the pairwise similarity scores among a set of related sequences to generate a binary dendrogram (tree). The tree is then reduced in a stepwise manner by progressively replacing the node connecting the two most similar termini by one common pattern until only a single common "root" pattern remains. A pattern is generated at a node by (i) performing a local optimal alignment on the sequence/pattern pair connected by the node with the use of an extended dynamic programming algorithm and then (ii) constructing a single common pattern from this alignment with a nested hierarchy of amino acid classes to identify the minimal inclusive amino acid class covering each paired set of elements in the alignment. Gaps within an alignment are created and/or extended using a "pay once" gap penalty rule, and gapped positions are converted into gap characters that function as 0 or 1 amino acid of any type during subsequent alignment. This method has been used to generate a library of covering patterns for homologous families in the National Biomedical Research Foundation/Protein Identification Resource protein sequence data base. We show that a covering pattern can be more diagnostic for sequence family membership than any of the individual sequences used to construct the pattern.

摘要

我们开发了一种计算机算法，该算法能够提取同源蛋白质家族所有成员共有的保守一级序列元件模式。该方法包括对一组相关序列之间的成对相似性得分进行聚类，以生成二元树状图（树）。然后通过逐步用一个共同模式替换连接两个最相似末端的节点，以逐步简化该树，直到仅剩下一个共同的“根”模式。在一个节点处生成模式的方法如下：(i) 使用扩展动态规划算法对由该节点连接的序列/模式对进行局部最优比对，然后 (ii) 根据该比对，通过氨基酸类别的嵌套层次结构构建一个单一的共同模式，以识别覆盖比对中每个配对元素集的最小包容性氨基酸类别。比对中的空位使用“一次付费”空位罚分规则来创建和/或扩展，并且在后续比对过程中，有间隙的位置会转换为间隙字符，其作用相当于任何类型的0或1个氨基酸。该方法已用于为国家生物医学研究基金会/蛋白质鉴定资源蛋白质序列数据库中的同源家族生成覆盖模式库。我们表明，对于序列家族成员身份，一个覆盖模式可能比用于构建该模式的任何单个序列更具诊断性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/735f/53211/591d0ef0323d/pnas01026-0141-a.jpg

相似文献

Automatic generation of primary sequence patterns from sets of related protein sequences.从相关蛋白质序列集中自动生成一级序列模式。

Proc Natl Acad Sci U S A. 1990 Jan;87(1):118-22. doi: 10.1073/pnas.87.1.118.

Hierarchical method to align large numbers of biological sequences.用于比对大量生物序列的分层方法。

Methods Enzymol. 1990;183:456-74. doi: 10.1016/0076-6879(90)83031-4.

A novel randomized iterative strategy for aligning multiple protein sequences.一种用于比对多条蛋白质序列的新型随机迭代策略。

Comput Appl Biosci. 1991 Oct;7(4):479-84. doi: 10.1093/bioinformatics/7.4.479.

A non-local gap-penalty for profile alignment.一种用于轮廓比对的非局部空位罚分。

Bull Math Biol. 1996 Jan;58(1):1-18. doi: 10.1007/BF02458279.

Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model.通过隐马尔可夫模型的蒙特卡罗优化实现蛋白质序列基序的间隙比对。

BMC Bioinformatics. 2004 Oct 25;5:157. doi: 10.1186/1471-2105-5-157.

An integrated approach to the analysis and modeling of protein sequences and structures. III. A comparative study of sequence conservation in protein structural families using multiple structural alignments.一种蛋白质序列与结构分析及建模的综合方法。III. 使用多重结构比对对蛋白质结构家族中的序列保守性进行比较研究。

J Mol Biol. 2000 Aug 18;301(3):691-711. doi: 10.1006/jmbi.2000.3975.

Flexible protein sequence patterns. A sensitive method to detect weak structural similarities.灵活的蛋白质序列模式。一种检测微弱结构相似性的灵敏方法。

J Mol Biol. 1990 Mar 20;212(2):389-402. doi: 10.1016/0022-2836(90)90133-7.

Profile analysis: detection of distantly related proteins.轮廓分析：检测远亲相关蛋白。

Proc Natl Acad Sci U S A. 1987 Jul;84(13):4355-8. doi: 10.1073/pnas.84.13.4355.

Clustering of domains of functionally related enzymes in the interaction database PRECISE by the generation of primary sequence patterns.通过生成一级序列模式，在相互作用数据库PRECISE中对功能相关酶的结构域进行聚类。

J Mol Graph Model. 2006 May;24(6):426-33. doi: 10.1016/j.jmgm.2005.08.004. Epub 2005 Oct 10.

Parametric sequence comparisons.参数序列比较

Proc Natl Acad Sci U S A. 1992 Jul 1;89(13):6090-3. doi: 10.1073/pnas.89.13.6090.

引用本文的文献

Taxonomic quasi-primes: peptides charting lineage-specific adaptations and disease-relevant loci.分类学准素：描绘谱系特异性适应性和疾病相关基因座的肽段。

Protein Sci. 2025 Sep;34(9):e70241. doi: 10.1002/pro.70241.

Research progress of reduced amino acid alphabets in protein analysis and prediction.蛋白质分析与预测中简化氨基酸字母表的研究进展

Comput Struct Biotechnol J. 2022 Jul 4;20:3503-3510. doi: 10.1016/j.csbj.2022.07.001. eCollection 2022.

TGF-β Prodomain Alignments Reveal Unexpected Cysteine Conservation Consistent with Phylogenetic Predictions of Cross-Subfamily Heterodimerization.转化生长因子-β前结构域比对揭示了与跨亚家族异源二聚化系统发育预测一致的意外半胱氨酸保守性。

Genetics. 2020 Feb;214(2):447-465. doi: 10.1534/genetics.119.302255. Epub 2019 Dec 16.

Transgenic Analyses in Drosophila Reveal That mCORL1 Is Functionally Distinct from mCORL2 and dCORL.在果蝇中的转基因分析表明 mCORL1 在功能上与 mCORL2 和 dCORL 不同。

G3 (Bethesda). 2019 Nov 5;9(11):3781-3789. doi: 10.1534/g3.119.400647.

Tracking interspecies transmission and long-term evolution of an ancient retrovirus using the genomes of modern mammals.利用现代哺乳动物基因组追踪一种古老逆转录病毒的种间传播和长期进化。

Elife. 2016 Mar 8;5:e12704. doi: 10.7554/eLife.12704.

Statistical discovery of site inter-dependencies in sub-molecular hierarchical protein structuring.亚分子层次蛋白质结构中位点相互依赖性的统计发现

EURASIP J Bioinform Syst Biol. 2012 Jul 13;2012(1):8. doi: 10.1186/1687-4153-2012-8.

PhyloMap: an algorithm for visualizing relationships of large sequence data sets and its application to the influenza A virus genome.PhyloMap：一种可视化大型序列数据集关系的算法及其在甲型流感病毒基因组中的应用。

BMC Bioinformatics. 2011 Jun 20;12:248. doi: 10.1186/1471-2105-12-248.

Optimized ancestral state reconstruction using Sankoff parsimony.使用桑科夫简约法进行优化的祖先状态重建。

BMC Bioinformatics. 2009 Feb 7;10:51. doi: 10.1186/1471-2105-10-51.

A reduced amino acid alphabet for understanding and designing protein adaptation to mutation.用于理解和设计蛋白质对突变适应性的简化氨基酸字母表。

Eur Biophys J. 2007 Nov;36(8):1059-69. doi: 10.1007/s00249-007-0188-5. Epub 2007 Jun 13.

SCANMOT: searching for similar sequences using a simultaneous scan of multiple sequence motifs.SCANMOT：通过同时扫描多个序列基序来搜索相似序列。

Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W274-6. doi: 10.1093/nar/gki493.

本文引用的文献

Identification of common molecular subsequences.常见分子子序列的鉴定

J Mol Biol. 1981 Mar 25;147(1):195-7. doi: 10.1016/0022-2836(81)90087-5.

Efficient sequence alignment algorithms.高效的序列比对算法。

J Theor Biol. 1984 Jun 7;108(3):333-7. doi: 10.1016/s0022-5193(84)80037-5.

Proc Natl Acad Sci U S A. 1983 Feb;80(3):726-30. doi: 10.1073/pnas.80.3.726.

Rapid searches for complex patterns in biological molecules.快速搜索生物分子中的复杂模式。

Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):263-80. doi: 10.1093/nar/12.1part1.263.

Covalent structure of bovine trypsinogen. The position of the remaining amides.牛胰蛋白酶原的共价结构。剩余酰胺的位置。

Biochem Biophys Res Commun. 1966 Aug 12;24(3):346-52. doi: 10.1016/0006-291x(66)90162-8.

The protein identification resource (PIR).蛋白质鉴定资源（PIR）。

Nucleic Acids Res. 1986 Jan 10;14(1):11-5. doi: 10.1093/nar/14.1.11.

The statistical distribution of nucleic acid similarities.核酸相似性的统计分布。

Nucleic Acids Res. 1985 Jan 25;13(2):645-56. doi: 10.1093/nar/13.2.645.

Determinants of a protein fold. Unique features of the globin amino acid sequences.蛋白质折叠的决定因素。珠蛋白氨基酸序列的独特特征。

J Mol Biol. 1987 Jul 5;196(1):199-216. doi: 10.1016/0022-2836(87)90521-3.

Prediction of protein secondary structure and active sites using the alignment of homologous sequences.利用同源序列比对预测蛋白质二级结构和活性位点。

J Mol Biol. 1987 Jun 20;195(4):957-61. doi: 10.1016/0022-2836(87)90501-8.

Knowledge-based prediction of protein structures and the design of novel molecules.基于知识的蛋白质结构预测与新型分子设计。

Nature. 1987;326(6111):347-52. doi: 10.1038/326347a0.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

从相关蛋白质序列集中自动生成一级序列模式。

Automatic generation of primary sequence patterns from sets of related protein sequences.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献