Yok Non, Rosen Gail
Drexel University, Electrical and Computer Engineering Department, 3141 Chestnut Street, PA 19104, USA.
Annu Int Conf IEEE Eng Med Biol Soc. 2010;2010:6190-3. doi: 10.1109/IEMBS.2010.5627744.
This manuscript presents the most rigorous benchmarking of gene annotation algorithms for metagenomic datasets to date. We compare three different programs: GeneMark, MetaGeneAnnotator (MGA) and Orphelia. The comparisons are based on their performances over simulated fragments from one hundred species of diverse lineages. We defined four different types of fragments; two types come from the inter- and intra-coding regions and the other types are from the gene edges. Hoff et al. used only 12 species in their comparison; therefore, their sample is too small to represent an environmental sample. Also, no predecessors has separately examined fragments that contain gene edges as opposed to intra-coding regions. General observations in our results are that performances of all these programs improve as we increase the length of the fragment. On the other hand, intra-coding fragments of our data show low annotation error in all of the programs if compared to the gene edge fragments. Overall, we found an upper-bound performance by combining all the methods.
本手稿展示了迄今为止针对宏基因组数据集的基因注释算法最严格的基准测试。我们比较了三种不同的程序:GeneMark、MetaGeneAnnotator(MGA)和Orphelia。这些比较基于它们对来自一百个不同谱系物种的模拟片段的性能表现。我们定义了四种不同类型的片段;两种类型来自编码区之间和编码区内,其他类型来自基因边缘。霍夫等人在他们的比较中仅使用了12个物种;因此,他们的样本太小,无法代表环境样本。此外,没有前人分别检查过包含基因边缘而非编码区内的片段。我们结果中的一般观察是,随着片段长度的增加,所有这些程序的性能都会提高。另一方面,与基因边缘片段相比,我们数据中的编码区内片段在所有程序中显示出较低的注释错误。总体而言,通过结合所有方法,我们发现了性能上限。