Gabriel Lars, Becker Felix, Hoff Katharina J, Stanke Mario
Institute of Mathematics and Computer Science, University of Greifswald, Greifswald 17489, Germany.
Center for Functional Genomics of Microbes, University of Greifswald, Greifswald 17489, Germany.
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae685.
For more than 25 years, learning-based eukaryotic gene predictors were driven by hidden Markov models (HMMs), which were directly inputted a DNA sequence. Recently, Holst et al. demonstrated with their program Helixer that the accuracy of ab initio eukaryotic gene prediction can be improved by combining deep learning layers with a separate HMM postprocessor.
We present Tiberius, a novel deep learning-based ab initio gene predictor that end-to-end integrates convolutional and long short-term memory layers with a differentiable HMM layer. Tiberius uses a custom gene prediction loss and was trained for prediction in mammalian genomes and evaluated on human and two other genomes. It significantly outperforms existing ab initio methods, achieving F1 scores of 62% at gene level for the human genome, compared to 21% for the next best ab initio method. In de novo mode, Tiberius predicts the exon-intron structure of two out of three human genes without error. Remarkably, even Tiberius's ab initio accuracy matches that of BRAKER3, which uses RNA-seq data and a protein database. Tiberius's highly parallelized model is the fastest state-of-the-art gene prediction method, processing the human genome in under 2 hours.
25 多年来,基于学习的真核基因预测器一直由隐马尔可夫模型(HMM)驱动,该模型直接输入 DNA 序列。最近,霍尔斯特等人用他们的程序 Helixer 证明,通过将深度学习层与单独的 HMM 后处理器相结合,可以提高从头开始的真核基因预测的准确性。
我们展示了提比略(Tiberius),这是一种基于深度学习的新型从头开始的基因预测器,它将卷积层和长短期记忆层与可微的 HMM 层进行端到端集成。提比略使用自定义的基因预测损失,并针对哺乳动物基因组中的预测进行训练,并在人类和其他两个基因组上进行评估。它显著优于现有的从头开始的方法,在人类基因组的基因水平上实现了 62% 的 F1 分数,而次优的从头开始的方法为 21%。在从头开始模式下,提比略能无误地预测三个人类基因中的两个的外显子 - 内含子结构。值得注意的是,即使是提比略的从头开始的准确性也与使用 RNA 测序数据和蛋白质数据库的 BRAKER3 相当。提比略高度并行化的模型是目前最快的最先进的基因预测方法,处理人类基因组的时间不到 2 小时。