Liu Mingyi, Grigoriev Andrei
GPC Biotech AG Fraunhoferstrasse 20, 82152 Martinsried, Germany.
Bioinformatics. 2005 Jul 15;21(14):3189-90. doi: 10.1093/bioinformatics/bti488. Epub 2005 May 6.
NCBI completed the transition of its main genome annotation database from Locuslink to Entrez Gene in Spring 2005. However, to this date few parsers exist for the Entrez Gene annotation file. Owing to the widespread use of Locuslink and the popularity of Perl programming language in bioinformatics, a publicly available high performance Entrez Gene parser in Perl is urgently needed. We present four such parsers that were developed using several parsing approaches (Parse::RecDescent, Parse::Yapp, Perl-byacc and Perl 5 regular expressions) and provide the first in-depth comparison of these sophisticated Perl tools. Our fastest parser processes the entire human Entrez Gene annotation file in under 12 min on one Intel Xeon 2.4 GHz CPU and can be of help to the bioinformatics community during and after the transition from Locuslink to Entrez Gene.
美国国立医学图书馆国家生物技术信息中心(NCBI)于2005年春季完成了其主要基因组注释数据库从Locuslink到Entrez Gene的转换。然而,截至目前,针对Entrez Gene注释文件的解析器却很少。由于Locuslink的广泛使用以及Perl编程语言在生物信息学中的流行,迫切需要一个公开可用的高性能Perl语言Entrez Gene解析器。我们展示了使用几种解析方法(Parse::RecDescent、Parse::Yapp、Perl-byacc和Perl 5正则表达式)开发的四个这样的解析器,并首次对这些复杂的Perl工具进行了深入比较。我们最快的解析器在一台英特尔至强2.4 GHz CPU上,不到12分钟就能处理完整个人类Entrez Gene注释文件,并且在从Locuslink过渡到Entrez Gene的过程中及之后,能够对生物信息学社区有所帮助。