Boari de Lima Elisa, Meira Wagner, Melo-Minardi Raquel Cardoso de
Department of Biochemistry and Immunology, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil.
Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil.
PLoS Comput Biol. 2016 Jun 27;12(6):e1005001. doi: 10.1371/journal.pcbi.1005001. eCollection 2016 Jun.
As increasingly more genomes are sequenced, the vast majority of proteins may only be annotated computationally, given experimental investigation is extremely costly. This highlights the need for computational methods to determine protein functions quickly and reliably. We believe dividing a protein family into subtypes which share specific functions uncommon to the whole family reduces the function annotation problem's complexity. Hence, this work's purpose is to detect isofunctional subfamilies inside a family of unknown function, while identifying differentiating residues. Similarity between protein pairs according to various properties is interpreted as functional similarity evidence. Data are integrated using genetic programming and provided to a spectral clustering algorithm, which creates clusters of similar proteins. The proposed framework was applied to well-known protein families and to a family of unknown function, then compared to ASMC. Results showed our fully automated technique obtained better clusters than ASMC for two families, besides equivalent results for other two, including one whose clusters were manually defined. Clusters produced by our framework showed great correspondence with the known subfamilies, besides being more contrasting than those produced by ASMC. Additionally, for the families whose specificity determining positions are known, such residues were among those our technique considered most important to differentiate a given group. When run with the crotonase and enolase SFLD superfamilies, the results showed great agreement with this gold-standard. Best results consistently involved multiple data types, thus confirming our hypothesis that similarities according to different knowledge domains may be used as functional similarity evidence. Our main contributions are the proposed strategy for selecting and integrating data types, along with the ability to work with noisy and incomplete data; domain knowledge usage for detecting subfamilies in a family with different specificities, thus reducing the complexity of the experimental function characterization problem; and the identification of residues responsible for specificity.
随着越来越多的基因组被测序,鉴于实验研究成本极高,绝大多数蛋白质可能只能通过计算进行注释。这凸显了快速且可靠地确定蛋白质功能的计算方法的必要性。我们认为,将蛋白质家族划分为具有整个家族所不常见的特定功能的亚型,可以降低功能注释问题的复杂性。因此,这项工作的目的是在一个未知功能的家族中检测同功能亚家族,同时识别区分性残基。根据各种属性的蛋白质对之间的相似性被解释为功能相似性证据。使用遗传编程对数据进行整合,并将其提供给谱聚类算法,该算法创建相似蛋白质的簇。将所提出的框架应用于知名蛋白质家族和一个未知功能的家族,然后与ASMC进行比较。结果表明,除了在其他两个家族中得到等效结果(包括一个其簇是手动定义的家族)外,我们的全自动技术在两个家族中获得了比ASMC更好的簇。我们框架产生的簇与已知亚家族显示出高度一致性,并且比ASMC产生的簇更具对比性。此外,对于那些已知特异性决定位置的家族,这些残基是我们的技术认为对区分给定组最重要的残基之一。当与巴豆酸酶和烯醇酶SFLD超家族一起运行时,结果与这个黄金标准显示出高度一致性。最佳结果始终涉及多种数据类型,从而证实了我们的假设,即根据不同知识领域的相似性可以用作功能相似性证据。我们的主要贡献包括提出的选择和整合数据类型的策略,以及处理噪声和不完整数据的能力;利用领域知识在具有不同特异性的家族中检测亚家族,从而降低实验功能表征问题的复杂性;以及识别负责特异性的残基。