Dima Ruxandra I, Thirumalai D
Institute for Physical Science and Technology, University of Maryland, College Park, MD 20742, USA.
Bioinformatics. 2004 Oct 12;20(15):2345-54. doi: 10.1093/bioinformatics/bth245. Epub 2004 Apr 8.
Function of proteins or a network of interacting proteins often involves communication between residues that are well separated in sequence. The classic example is the participation of distant residues in allosteric regulation. Bioinformatic and structural analysis methods have been introduced to infer residues that are correlated. Recently, increasing attention has been paid to obtain the sequence properties that determine the tendency of disease-related proteins (Abeta peptides, prion proteins, transthyretin, etc.) to aggregate and form fibrils. Motivated in part by the need to identify sequence characteristics that indicate a tendency to aggregate, we introduce a general method that probes covariations in charged residues along the sequence in a given protein family. The method, which involves computing the sequence correlation entropy (SCE) using the quenched probability P(sk)(i,j) of finding a residue pair at a given sequence separation, sk, allows us to classify protein families in terms of their SCE. Our general approach may be a useful way in obtaining evolutionary covariations of amino acid residues on a genome wide level.
We use a combination of SCE and clustering based on the principle component analysis to classify the protein families. From an analysis of 839 families, covering approximately 500,000 sequences, we find that proteins with relatively low values of SCE are predominantly associated with various diseases. In several families, residues that give rise to peaks in P(sk)(i,j) are clustered in the three-dimensional structure. For the class of proteins with low SCE values, there are significant numbers of mixed charged-hydrophobic (CH) and charged-polar (CP) runs. Our findings suggest that the low values of SCE and the presence of (CH) and/or (CP) may be indicative of disease association or tendency to aggregate. Our results led to the hypothesis that functions of proteins with similar SCE values may be linked. The hypothesis is validated with a few anecdotal examples. The present results also lead to the prediction that the overall charge correlations in proteins affect the kinetics of amyloid formation--a feature that is common to all proteins implicated in neurodegenerative diseases.
蛋白质或相互作用蛋白质网络的功能通常涉及序列中相隔较远的残基之间的通讯。经典例子是远距离残基参与变构调节。已引入生物信息学和结构分析方法来推断相关残基。最近,人们越来越关注获得决定疾病相关蛋白质(β-淀粉样肽、朊病毒蛋白、转甲状腺素蛋白等)聚集和形成原纤维倾向的序列特性。部分受识别表明聚集倾向的序列特征需求的推动,我们引入了一种通用方法,该方法探测给定蛋白质家族序列中带电残基的共变情况。该方法涉及使用在给定序列间距(s_k)处找到残基对的淬火概率(P(s_k)(i,j))计算序列相关熵(SCE),使我们能够根据SCE对蛋白质家族进行分类。我们的通用方法可能是在全基因组水平上获得氨基酸残基进化共变的有用途径。
我们结合使用SCE和基于主成分分析的聚类来对蛋白质家族进行分类。通过对涵盖约500,000个序列的839个家族的分析,我们发现SCE值相对较低的蛋白质主要与各种疾病相关。在几个家族中,导致(P(s_k)(i,j))出现峰值的残基在三维结构中聚集。对于SCE值较低的蛋白质类别,存在大量混合的带电 - 疏水(CH)和带电 - 极性(CP)片段。我们的发现表明,SCE值低以及(CH)和/或(CP)的存在可能表明与疾病相关或聚集倾向。我们的结果导致这样的假设,即具有相似SCE值的蛋白质功能可能相关。该假设通过一些实例得到验证。目前的结果还导致预测,蛋白质中的整体电荷相关性会影响淀粉样蛋白形成的动力学——这是所有与神经退行性疾病相关蛋白质的共同特征。