Agribiotechnology and Precision Breeding for Food Security National Laboratory, Department of Animal Biotechnology, Institute of Genetics and Biotechnology, Hungarian University of Agriculture and Life Sciences, Szent-Györgyi Albert str. 4, 2100, Gödöllő, Hungary.
Group BM, Data Insights Team, _VOIS, Kerepesi str. 35, 1087, Budapest, Hungary.
BMC Genomics. 2024 Mar 14;25(1):278. doi: 10.1186/s12864-024-10201-9.
There is an ongoing process in which mitochondrial sequences are being integrated into the nuclear genome. The importance of these sequences has already been revealed in cancer biology, forensic, phylogenetic studies and in the evolution of the eukaryotic genetic information. Human and numerous model organisms' genomes were described from those sequences point of view. Furthermore, recent studies were published on the patterns of these nuclear localised mitochondrial sequences in different taxa.However, the results of the previously released studies are difficult to compare due to the lack of standardised methods and/or using few numbers of genomes. Therefore, in this paper our primary goal is to establish a uniform mining pipeline to explore these nuclear localised mitochondrial sequences.Our results show that the frequency of several repetitive elements is higher in the flanking regions of these sequences than expected. A machine learning model reveals that the flanking regions' repetitive elements and different structural characteristics are highly influential during the integration process.In this paper, we introduce a general mining pipeline for all mammalian genomes. The workflow is publicly available and is believed to serve as a validated baseline for future research in this field. We confirm the widespread opinion, on - as to our current knowledge - the largest dataset, that structural circumstances and events corresponding to repetitive elements are highly significant. An accurate model has also been trained to predict these sequences and their corresponding flanking regions.
线粒体序列正在被整合到核基因组中的这一过程一直在持续。这些序列的重要性已经在癌症生物学、法医学、系统发育研究以及真核遗传信息的进化中得到了揭示。从这些序列的角度已经描述了人类和许多模式生物的基因组。此外,最近还发表了关于不同分类群中这些核定位线粒体序列模式的研究。然而,由于缺乏标准化方法和/或使用的基因组数量较少,以前发布的研究结果难以进行比较。因此,在本文中,我们的主要目标是建立一个统一的挖掘管道来探索这些核定位线粒体序列。我们的研究结果表明,这些序列侧翼区域中的几个重复元件的频率高于预期。机器学习模型表明,在整合过程中,侧翼区域的重复元件和不同的结构特征具有高度影响力。在本文中,我们为所有哺乳动物基因组引入了一个通用的挖掘管道。该工作流程是公开的,我们相信它可以作为该领域未来研究的一个验证基准。我们在目前的知识范围内,在最大的数据集上证实了普遍存在的观点,即与重复元件相对应的结构情况和事件是高度显著的。我们还训练了一个准确的模型来预测这些序列及其相应的侧翼区域。