Proteome Sci. 2013 Nov 7;11(Suppl 1):S8. doi: 10.1186/1477-5956-11-S1-S8.
Discovering sequence patterns with variation can unveil functions of a protein family that are important for drug discovery. Exploring protein families using existing methods such as multiple sequence alignment is computationally expensive, thus pattern search, called motif finding in Bioinformatics, is used. However, at present, combinatorial algorithms result in large sets of solutions, and probabilistic models require a richer representation of the amino acid associations. To overcome these shortcomings, we present a method for ranking and compacting these solutions in a new representation referred to as Aligned Pattern Clusters (APCs). To tackle the problem of a large solution set, our method reveals a reduced set of candidate solutions without losing any information. To address the problem of representation, our method captures the amino acid associations and conservations of the aligned patterns. Our algorithm renders a set of APCs in which a set of patterns is discovered, pruned, aligned, and synthesized from the input sequences of a protein family.
Our algorithm identifies the binding or other functional segments and their embedded residues which are important drug targets from the cytochrome c and the ubiquitin protein families taken from Unitprot. The results are independently confirmed by pFam's multiple sequence alignment. For cytochrome c protein the number of resulting patterns with variations are reduced by 76.62% from the number of original patterns without variations. Furthermore, all of the top four candidate APCs correspond to the binding segments with one of each of their conserved amino acid as the binding residue. The discovered proximal APCs agree with pFam and PROSITE results. Surprisingly, the distal binding site discovered by our algorithm is not discovered by pFam nor PROSITE, but confirmed by the three-dimensional cytochrome c structure. When applied to the ubiquitin protein family, our results agree with pFam and reveals six of the seven Lysine binding residues as conserved aligned columns with entropy redundancy measure of 1.0.
The discovery, ranking, reduction, and representation of a set of patterns is important to avert time-consuming and expensive simulations and experimentations during proteomic study and drug discovery.
发现具有变异性的序列模式可以揭示对药物发现很重要的蛋白质家族的功能。使用多重序列比对等现有方法探索蛋白质家族在计算上很昂贵,因此使用模式搜索,在生物信息学中称为基序发现。然而,目前,组合算法会产生大量的解决方案,概率模型需要更丰富的氨基酸关联表示。为了克服这些缺点,我们提出了一种在新的表示形式中对这些解决方案进行排序和压缩的方法,称为对齐模式簇 (APC)。为了解决解决方案集过大的问题,我们的方法揭示了一组候选解决方案,而不会丢失任何信息。为了解决表示问题,我们的方法捕获了对齐模式的氨基酸关联和守恒性。我们的算法生成了一组 APC,其中一组模式是从蛋白质家族的输入序列中发现、修剪、对齐和合成的。
我们的算法从 UniProt 中提取的细胞色素 c 和泛素蛋白家族中确定了结合或其他功能片段及其嵌入式残基,这些残基是重要的药物靶点。结果由 pFam 的多重序列比对独立确认。对于细胞色素 c 蛋白,与无变化的原始模式相比,具有变化的结果模式数量减少了 76.62%。此外,前四个候选 APC 都对应于结合片段,其中每个片段的保守氨基酸之一都是结合残基。发现的近端 APC 与 pFam 和 PROSITE 的结果一致。令人惊讶的是,我们的算法发现的远端结合位点没有被 pFam 或 PROSITE 发现,但被三维细胞色素 c 结构所证实。当应用于泛素蛋白家族时,我们的结果与 pFam 一致,并揭示了七个赖氨酸结合残基中的六个作为保守的对齐列,其熵冗余度量为 1.0。
模式的发现、排序、减少和表示对于避免蛋白质组学研究和药物发现过程中的耗时且昂贵的模拟和实验非常重要。