BMC Bioinformatics. 2014;15 Suppl 12(Suppl 12):S2. doi: 10.1186/1471-2105-15-S12-S2. Epub 2014 Nov 6.
The large influx of biological sequences poses the importance of identifying and correlating conserved regions in homologous sequences to acquire valuable biological knowledge. These conserved regions contain statistically significant residue associations as sequence patterns. Thus, patterns from two conserved regions co-occurring frequently on the same sequences are inferred to have joint functionality. A method for finding conserved regions in protein families with frequent co-occurrence patterns is proposed. The biological significance of the discovered clusters of conserved regions with co-occurrences patterns can be validated by their three-dimensional closeness of amino acids and the biological functionality found in those regions as supported by published work.
Using existing algorithms, we discovered statistically significant amino acid associations as sequence patterns. We then aligned and clustered them into Aligned Pattern Clusters (APCs) corresponding to conserved regions with amino acid conservation and variation. When one APC frequently co-occurred with another APC, the two APCs have high co-occurrence. We then clustered APCs with high co-occurrence into what we refer to as Co-occurrence APC Clusters (Co-occurrence Clusters).
Our results show that for Co-occurrence Clusters, the three-dimensional distance between their amino acids is closer than average amino acid distances. For the Co-occurrence Clusters of the ubiquitin and the cytochrome c families, we observed biological significance among the residing amino acids of the APCs within the same cluster. In ubiquitin, the residues are responsible for ubiquitination as well as conventional and unconventional ubiquitin-bindings. In cytochrome c, amino acids in the first co-occurrence cluster contribute to binding of other proteins in the electron transport chain, and amino acids in the second co-occurrence cluster contribute to the stability of the axial heme ligand.
Thus, our co-occurrence clustering algorithm can efficiently find and rank conserved regions that contain patterns that frequently co-occurring on the same proteins. Co-occurring patterns are biologically significant due to their three-dimensional closeness and other evidences reported in literature. These results play an important role in drug discovery as biologists can quickly identify the target for drugs to conduct detailed preclinical studies.
大量涌入的生物序列使得识别和关联同源序列中的保守区域以获取有价值的生物学知识变得尤为重要。这些保守区域包含具有统计学意义的残基关联,表现为序列模式。因此,如果两个保守区域在同一序列上频繁共现,则推断它们具有共同的功能。本文提出了一种在蛋白质家族中发现具有频繁共现模式的保守区域的方法。通过氨基酸的三维接近程度和这些区域在已发表文献中发现的生物学功能,可以验证所发现的具有共现模式的保守区域簇的生物学意义。
使用现有的算法,我们发现了作为序列模式的具有统计学意义的氨基酸关联。然后,我们将它们对齐并聚类为具有氨基酸保守性和变异性的一致模式簇(APC)。当一个 APC 频繁与另一个 APC 共现时,这两个 APC 具有高共现性。然后,我们将具有高共现性的 APC 聚类为我们所称的共现 APC 簇(共现簇)。
我们的结果表明,对于共现簇,它们的氨基酸之间的三维距离比平均氨基酸距离更近。对于泛素和细胞色素 c 家族的共现簇,我们观察到了同一簇内 APC 驻留氨基酸之间的生物学意义。在泛素中,残基负责泛素化以及常规和非传统的泛素结合。在细胞色素 c 中,第一共现簇中的氨基酸有助于与电子传递链中的其他蛋白质结合,第二共现簇中的氨基酸有助于轴向血红素配体的稳定性。
因此,我们的共现聚类算法可以有效地发现和排列包含在同一蛋白质上频繁共现的模式的保守区域。共现模式具有生物学意义,因为它们的三维接近程度以及文献中报道的其他证据。这些结果在药物发现中起着重要作用,因为生物学家可以快速识别药物的靶标,以进行详细的临床前研究。