Lomsadze Alexandre, Bonny Christophe, Strozzi Francesco, Borodovsky Mark
Gene Probe, Inc., 1106 Wrights Mill Ct, Atlanta, GA 30324, USA.
Enterome, 94/96 avenue Ledru-Rollin, 75011 Paris, France.
NAR Genom Bioinform. 2021 May 26;3(2):lqab047. doi: 10.1093/nargab/lqab047. eCollection 2021 Jun.
Computational reconstruction of nearly complete genomes from metagenomic reads may identify thousands of new uncultured candidate bacterial species. We have shown that reconstructed prokaryotic genomes along with genomes of sequenced microbial isolates can be used to support more accurate gene prediction in novel sequences. We have proposed an approach that used three types of gene prediction algorithms and found for all contigs in a metagenome nearly optimal models of protein-coding regions either in libraries of pre-computed models or constructed de novo. The model selection process and gene annotation were done by the new GeneMark-HM pipeline. We have created a database of the species level pan-genomes for the human microbiome. To create a library of models representing each pan-genome we used a self-training algorithm GeneMarkS-2. Genes initially predicted in each contig served as queries for a fast similarity search through the pan-genome database. The best matches led to selection of the model for gene prediction. Contigs not assigned to pan-genomes were analyzed by crude, but still accurate models designed for sequences with particular GC compositions. Tests of GeneMark-HM on simulated metagenomes demonstrated improvement in gene annotation of human metagenomic sequences in comparison with the current state-of-the-art gene prediction tools.
从宏基因组读数中进行近乎完整基因组的计算重建可能会识别出数千种新的未培养候选细菌物种。我们已经表明,重建的原核基因组以及已测序微生物分离株的基因组可用于支持对新序列进行更准确的基因预测。我们提出了一种方法,该方法使用三种类型的基因预测算法,并在预计算模型库中或从头构建中为宏基因组中的所有重叠群找到蛋白质编码区域的近乎最优模型。模型选择过程和基因注释由新的GeneMark-HM管道完成。我们已经为人微生物组创建了一个物种水平泛基因组数据库。为了创建一个代表每个泛基因组的模型库,我们使用了一种自训练算法GeneMarkS-2。最初在每个重叠群中预测的基因用作通过泛基因组数据库进行快速相似性搜索的查询。最佳匹配导致选择用于基因预测的模型。未分配到泛基因组的重叠群通过为具有特定GC组成的序列设计的粗略但仍然准确的模型进行分析。与当前最先进的基因预测工具相比,GeneMark-HM对模拟宏基因组的测试表明人类宏基因组序列的基因注释有所改进。