Djebali Sarah, Delaplace Franck, Roest Crollius Hugues
Dyogen Lab, CNRS UMR8541, Ecole Normale Supérieure, 46 rue d'Ulm, 75005 Paris, France.
Genome Biol. 2006;7 Suppl 1(Suppl 1):S7.1-10. doi: 10.1186/gb-2006-7-s1-s7. Epub 2006 Aug 7.
Accurate and automatic gene identification in eukaryotic genomic DNA is more than ever of crucial importance to efficiently exploit the large volume of assembled genome sequences available to the community. Automatic methods have always been considered less reliable than human expertise. This is illustrated in the EGASP project, where reference annotations against which all automatic methods are measured are generated by human annotators and experimentally verified. We hypothesized that replicating the accuracy of human annotators in an automatic method could be achieved by formalizing the rules and decisions that they use, in a mathematical formalism.
We have developed Exogean, a flexible framework based on directed acyclic colored multigraphs (DACMs) that can represent biological objects (for example, mRNA, ESTs, protein alignments, exons) and relationships between them. Graphs are analyzed to process the information according to rules that replicate those used by human annotators. Simple individual starting objects given as input to Exogean are thus combined and synthesized into complex objects such as protein coding transcripts.
We show here, in the context of the EGASP project, that Exogean is currently the method that best reproduces protein coding gene annotations from human experts, in terms of identifying at least one exact coding sequence per gene. We discuss current limitations of the method and several avenues for improvement.
在真核生物基因组DNA中准确、自动地识别基因,对于有效利用科学界可获得的大量已组装基因组序列而言,比以往任何时候都更为重要。自动方法一直被认为不如人工专业知识可靠。这在EGASP项目中得到了体现,在该项目中,所有自动方法所依据的参考注释是由人工注释员生成并经过实验验证的。我们假设,通过将人工注释员所使用的规则和决策形式化为一种数学形式,能够在一种自动方法中复制人工注释员的准确性。
我们开发了Exogean,这是一个基于有向无环彩色多重图(DACM)的灵活框架,它可以表示生物对象(例如,mRNA、EST、蛋白质比对、外显子)以及它们之间的关系。根据复制人工注释员所使用规则的规则对图进行分析,以处理信息。因此,作为Exogean输入给出的简单单个起始对象被组合并合成为诸如蛋白质编码转录本之类的复杂对象。
我们在此表明,在EGASP项目的背景下,就每个基因识别至少一个精确编码序列而言,Exogean是目前最能重现人工专家蛋白质编码基因注释的方法。我们讨论了该方法当前的局限性以及几种改进途径。