Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Colombia.
Department of Systems and Informatics, Universidad de Caldas, Manizales, Colombia.
J Integr Bioinform. 2022 Jul 12;19(3). doi: 10.1515/jib-2021-0036. eCollection 2022 Sep 1.
Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to understand key aspects of organisms such as phenotype variability, species evolution, and genome size, among others. Because of the way they replicate, LTR retrotransposons are the most common transposable elements in plants, accounting in some cases for up to 80% of all DNA information. To annotate these elements, a reference library is usually created, a curation process is performed, eliminating TE fragments and false positives and then annotated in the genome using the homology method. However, the curation process can take weeks, requires extensive manual work and the execution of multiple time-consuming bioinformatics software. Here, we propose a machine learning-based approach to perform this process automatically on plant genomes, obtaining up to 91.18% F1-score. This approach was tested with four plant species, obtaining up to 93.6% F1-score () in only 22.61 s, where bioinformatics methods took approximately 6 h. This acceleration demonstrates that the ML-based approach is efficient and could be used in massive sequencing projects.
转座元件是可移动的序列,可以在内部或外部刺激下移动并插入染色体,赋予生物体适应环境的能力。在基因组数据中注释转座元件目前被认为是理解生物体关键方面的关键任务,例如表型变异、物种进化和基因组大小等。由于它们的复制方式,LTR 反转录转座子是植物中最常见的转座元件,在某些情况下,它们占所有 DNA 信息的 80%。为了注释这些元件,通常会创建一个参考库,然后进行整理过程,消除 TE 片段和假阳性,然后使用同源性方法在基因组中进行注释。然而,整理过程可能需要数周时间,需要大量的人工工作和执行多个耗时的生物信息学软件。在这里,我们提出了一种基于机器学习的方法,可自动对植物基因组进行此过程,获得高达 91.18%的 F1 分数。该方法在四种植物物种上进行了测试,在仅 22.61 秒内获得了高达 93.6%的 F1 分数(),而生物信息学方法大约需要 6 小时。这种加速表明基于机器学习的方法是高效的,可以用于大规模测序项目。