Fiedler Lisa, Middendorf Martin, Bernt Matthias
Department of Computer Science, Leipzig University, Leipzig, Germany.
Helmholtz Centre for Environmental Research-UFZ, Leipzig, Germany.
Front Genet. 2023 Aug 10;14:1250907. doi: 10.3389/fgene.2023.1250907. eCollection 2023.
A wide range of scientific fields, such as forensics, anthropology, medicine, and molecular evolution, benefits from the analysis of mitogenomic data. With the development of new sequencing technologies, the amount of mitochondrial sequence data to be analyzed has increased exponentially over the last few years. The accurate annotation of mitochondrial DNA is a prerequisite for any mitogenomic comparative analysis. To sustain with the growth of the available mitochondrial sequence data, highly efficient automatic computational methods are, hence, needed. Automatic annotation methods are typically based on databases that contain information about already annotated (and often pre-curated) mitogenomes of different species. However, the existing approaches have several shortcomings: 1) they do not scale well with the size of the database; 2) they do not allow for a fast (and easy) update of the database; and 3) they can only be applied to a relatively small taxonomic subset of all species. Here, we present a novel approach that does not have any of these aforementioned shortcomings, (1), (2), and (3). The reference database of mitogenomes is represented as a richly annotated de Bruijn graph. To generate gene predictions for a new user-supplied mitogenome, the method utilizes a clustering routine that uses the mapping information of the provided sequence to this graph. The method is implemented in a software package called DeGeCI Bruijn graph ne luster dentification). For a large set of mitogenomes, for which expert-curated annotations are available, DeGeCI generates gene predictions of high conformity. In a comparative evaluation with MITOS2, a state-of-the-art annotation tool for mitochondrial genomes, DeGeCI shows better database scalability while still matching MITOS2 in terms of result quality and providing a fully automated means to update the underlying database. Moreover, unlike MITOS2, DeGeCI can be run in parallel on several processors to make use of modern multi-processor systems.
法医学、人类学、医学和分子进化等广泛的科学领域都受益于线粒体基因组数据的分析。随着新测序技术的发展,在过去几年中,待分析的线粒体序列数据量呈指数级增长。线粒体DNA的准确注释是任何线粒体基因组比较分析的先决条件。因此,为了跟上可用线粒体序列数据的增长速度,需要高效的自动计算方法。自动注释方法通常基于包含不同物种已注释(且通常经过预整理)的线粒体基因组信息的数据库。然而,现有方法存在几个缺点:1)它们不能很好地随数据库大小扩展;2)它们不允许对数据库进行快速(且容易)更新;3)它们只能应用于所有物种中相对较小的分类子集。在这里,我们提出了一种没有上述任何缺点(1)、(2)和(3)的新方法。线粒体基因组的参考数据库表示为一个注释丰富的德布鲁因图。为了为新的用户提供的线粒体基因组生成基因预测,该方法利用了一个聚类程序,该程序使用所提供序列到该图的映射信息。该方法在一个名为DeGeCI(德布鲁因图聚类识别)的软件包中实现。对于一大组有专家整理注释的线粒体基因组,DeGeCI生成高度一致的基因预测。在与线粒体基因组的最新注释工具MITOS2的比较评估中,DeGeCI显示出更好的数据库可扩展性,同时在结果质量方面仍与MITOS2相当,并提供了一种完全自动化的方式来更新基础数据库。此外,与MITOS2不同,DeGeCI可以在多个处理器上并行运行,以利用现代多处理器系统。