Stanke Mario, Schöffmann Oliver, Morgenstern Burkhard, Waack Stephan
lnstitut für Mikrobiologie und Genetik, Universität Göttingen, Göttingen, Germany.
BMC Bioinformatics. 2006 Feb 9;7:62. doi: 10.1186/1471-2105-7-62.
In order to improve gene prediction, extrinsic evidence on the gene structure can be collected from various sources of information such as genome-genome comparisons and EST and protein alignments. However, such evidence is often incomplete and usually uncertain. The extrinsic evidence is usually not sufficient to recover the complete gene structure of all genes completely and the available evidence is often unreliable. Therefore extrinsic evidence is most valuable when it is balanced with sequence-intrinsic evidence.
We present a fairly general method for integration of external information. Our method is based on the evaluation of hints to potentially protein-coding regions by means of a Generalized Hidden Markov Model (GHMM) that takes both intrinsic and extrinsic information into account. We used this method to extend the ab initio gene prediction program AUGUSTUS to a versatile tool that we call AUGUSTUS+. In this study, we focus on hints derived from matches to an EST or protein database, but our approach can be used to include arbitrary user-defined hints. Our method is only moderately effected by the length of a database match. Further, it exploits the information that can be derived from the absence of such matches. As a special case, AUGUSTUS+ can predict genes under user-defined constraints, e.g. if the positions of certain exons are known. With hints from EST and protein databases, our new approach was able to predict 89% of the exons in human chromosome 22 correctly.
Sensitive probabilistic modeling of extrinsic evidence such as sequence database matches can increase gene prediction accuracy. When a match of a sequence interval to an EST or protein sequence is used it should be treated as compound information rather than as information about individual positions.
为了改进基因预测,可以从各种信息源收集有关基因结构的外在证据,如基因组-基因组比较以及EST和蛋白质比对。然而,此类证据往往不完整且通常具有不确定性。外在证据通常不足以完全恢复所有基因的完整基因结构,而且现有证据往往不可靠。因此,当外在证据与序列内在证据相平衡时,其价值最大。
我们提出了一种相当通用的整合外部信息的方法。我们的方法基于通过广义隐马尔可夫模型(GHMM)对潜在蛋白质编码区域的提示进行评估,该模型同时考虑了内在和外在信息。我们使用此方法将从头基因预测程序AUGUSTUS扩展为一个多功能工具,我们称之为AUGUSTUS+。在本研究中,我们专注于源自与EST或蛋白质数据库匹配的提示,但我们的方法可用于纳入任意用户定义的提示。我们的方法仅受到数据库匹配长度的适度影响。此外,它利用了可从不存在此类匹配中得出的信息。作为一种特殊情况,AUGUSTUS+可以在用户定义的约束条件下预测基因,例如,如果某些外显子的位置已知。借助来自EST和蛋白质数据库的提示,我们的新方法能够正确预测人类22号染色体中89%的外显子。
对诸如序列数据库匹配等外在证据进行灵敏的概率建模可以提高基因预测的准确性。当使用序列区间与EST或蛋白质序列的匹配时,应将其视为复合信息而非关于单个位置的信息。