Ghouila Amel, Florent Isabelle, Guerfali Fatma Zahra, Terrapon Nicolas, Laouini Dhafer, Yahia Sadok Ben, Gascuel Olivier, Bréhélin Laurent
Institut de Biologie Computationnelle, LIRMM, CNRS, Univ. Montpellier 2, Montpellier, France; Computer Science Department, Faculty of Sciences of Tunis, Tunis, Tunisia.
Centre National de la Recherche Scientifique/Muséum National d'Histoire Naturelle, UMR7245 CNRS-MNHN, Molécules de Communication et Adaptation des Micro-organismes, Adaptation des Protozoaires à leur Environnent, Paris, France.
PLoS One. 2014 Jun 5;9(6):e95275. doi: 10.1371/journal.pone.0095275. eCollection 2014.
Identification of protein domains is a key step for understanding protein function. Hidden Markov Models (HMMs) have proved to be a powerful tool for this task. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in sequenced organisms. This is done via sequence/HMM comparisons. However, this approach may lack sensitivity when searching for domains in divergent species. Recently, methods for HMM/HMM comparisons have been proposed and proved to be more sensitive than sequence/HMM approaches in certain cases. However, these approaches are usually not used for protein domain discovery at a genome scale, and the benefit that could be expected from their utilization for this problem has not been investigated. Using proteins of P. falciparum and L. major as examples, we investigate the extent to which HMM/HMM comparisons can identify new domain occurrences not already identified by sequence/HMM approaches. We show that although HMM/HMM comparisons are much more sensitive than sequence/HMM comparisons, they are not sufficiently accurate to be used as a standalone complement of sequence/HMM approaches at the genome scale. Hence, we propose to use domain co-occurrence--the general domain tendency to preferentially appear along with some favorite domains in the proteins--to improve the accuracy of the approach. We show that the combination of HMM/HMM comparisons and co-occurrence domain detection boosts protein annotations. At an estimated False Discovery Rate of 5%, it revealed 901 and 1098 new domains in Plasmodium and Leishmania proteins, respectively. Manual inspection of part of these predictions shows that it contains several domain families that were missing in the two organisms. All new domain occurrences have been integrated in the EuPathDomains database, along with the GO annotations that can be deduced.
蛋白质结构域的识别是理解蛋白质功能的关键步骤。隐马尔可夫模型(HMMs)已被证明是完成这项任务的强大工具。Pfam数据库尤其提供了大量的HMMs,它们被广泛用于对已测序生物中的蛋白质进行注释。这是通过序列/HMM比较来完成的。然而,在寻找不同物种中的结构域时,这种方法可能缺乏敏感性。最近,有人提出了HMM/HMM比较的方法,并且在某些情况下被证明比序列/HMM方法更敏感。然而,这些方法通常不用于基因组规模的蛋白质结构域发现,而且尚未研究将其用于这个问题可能带来的好处。以恶性疟原虫和硕大利什曼原虫的蛋白质为例,我们研究了HMM/HMM比较在多大程度上能够识别序列/HMM方法尚未识别的新的结构域出现情况。我们表明,尽管HMM/HMM比较比序列/HMM比较敏感得多,但在基因组规模上,它们的准确性不足以作为序列/HMM方法的独立补充。因此,我们建议使用结构域共现——即一般的结构域倾向于优先与蛋白质中的某些偏好结构域一起出现——来提高该方法的准确性。我们表明,HMM/HMM比较和共现结构域检测的结合提高了蛋白质注释。在估计的5%的错误发现率下,它分别在疟原虫和利什曼原虫的蛋白质中揭示了901个和1098个新结构域。对这些预测结果的部分人工检查表明,其中包含这两种生物中缺失的几个结构域家族。所有新的结构域出现情况都已整合到EuPathDomains数据库中,并附上了可以推导出来的GO注释。