Méthodes et Algorithmes pour la Bioinformatique, LIRMM, Université Montpellier 2, France.
BMC Bioinformatics. 2012 May 1;13:67. doi: 10.1186/1471-2105-13-67.
Hidden Markov Models (HMMs) are a powerful tool for protein domain identification. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in new sequenced organisms. In Pfam, each domain family is represented by a curated multiple sequence alignment from which a profile HMM is built. In spite of their high specificity, HMMs may lack sensitivity when searching for domains in divergent organisms. This is particularly the case for species with a biased amino-acid composition, such as P. falciparum, the main causal agent of human malaria. In this context, fitting HMMs to the specificities of the target proteome can help identify additional domains.
Using P. falciparum as an example, we compare approaches that have been proposed for this problem, and present two alternative methods. Because previous attempts strongly rely on known domain occurrences in the target species or its close relatives, they mainly improve the detection of domains which belong to already identified families. Our methods learn global correction rules that adjust amino-acid distributions associated with the match states of HMMs. These rules are applied to all match states of the whole HMM library, thus enabling the detection of domains from previously absent families. Additionally, we propose a procedure to estimate the proportion of false positives among the newly discovered domains. Starting with the Pfam standard library, we build several new libraries with the different HMM-fitting approaches. These libraries are first used to detect new domain occurrences with low E-values. Second, by applying the Co-Occurrence Domain Discovery (CODD) procedure we have recently proposed, the libraries are further used to identify likely occurrences among potential domains with higher E-values.
We show that the new approaches allow identification of several domain families previously absent in the P. falciparum proteome and the Apicomplexa phylum, and identify many domains that are not detected by previous approaches. In terms of the number of new discovered domains, the new approaches outperform the previous ones when no close species are available or when they are used to identify likely occurrences among potential domains with high E-values. All predictions on P. falciparum have been integrated into a dedicated website which pools all known/new annotations of protein domains and functions for this organism. A software implementing the two proposed approaches is available at the same address: http://www.lirmm.fr/~terrapon/HMMfit/
隐马尔可夫模型(HMMs)是蛋白质结构域识别的强大工具。 Pfam 数据库提供了大量的 HMM 集合,广泛用于注释新测序生物的蛋白质。在 Pfam 中,每个结构域家族都由一个经过精心整理的多序列比对表示,从中构建一个轮廓 HMM。尽管它们具有很高的特异性,但在搜索来自不同生物的结构域时,HMM 可能缺乏敏感性。对于氨基酸组成偏倚的物种(例如恶性疟原虫,人类疟疾的主要病原体)尤其如此。在这种情况下,根据目标蛋白质组的特异性拟合 HMM 可以帮助识别其他结构域。
我们以恶性疟原虫为例,比较了针对该问题提出的方法,并提出了两种替代方法。由于先前的尝试强烈依赖于目标物种或其近亲中已知的结构域存在,因此它们主要提高了属于已识别家族的结构域的检测能力。我们的方法学习全局校正规则,这些规则调整与 HMM 匹配状态相关的氨基酸分布。这些规则应用于整个 HMM 库的所有匹配状态,从而能够检测来自以前不存在的家族的结构域。此外,我们提出了一种估计新发现结构域中假阳性比例的程序。从 Pfam 标准库开始,我们使用不同的 HMM 拟合方法构建了几个新的库。这些库首先用于检测具有低 E 值的新结构域出现。其次,通过应用我们最近提出的共现结构域发现(CODD)程序,这些库进一步用于识别高 E 值潜在结构域中的可能出现。
我们表明,新方法允许识别以前在恶性疟原虫蛋白质组和顶复门中不存在的几个结构域家族,并识别出以前方法无法检测到的许多结构域。就新发现的结构域数量而言,在没有近缘物种或当它们用于识别高 E 值潜在结构域中的可能出现时,新方法优于以前的方法。对恶性疟原虫的所有预测都已集成到一个专门的网站中,该网站汇集了该生物体所有已知/新的蛋白质结构域和功能注释。一个实现这两种方法的软件可在同一地址获得:http://www.lirmm.fr/~terrapon/HMMfit/