Orozco-Arias Simon, Candamil-Cortés Mariana S, Jaimes Paula A, Piña Johan S, Tabares-Soto Reinel, Guyot Romain, Isaza Gustavo
Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Caldas, Colombia.
Department of Systems and Informatics, Universidad de Caldas, Manizales, Caldas, Colombia.
PeerJ. 2021 May 19;9:e11456. doi: 10.7717/peerj.11456. eCollection 2021.
Every day more plant genomes are available in public databases and additional massive sequencing projects (i.e., that aim to sequence thousands of individuals) are formulated and released. Nevertheless, there are not enough automatic tools to analyze this large amount of genomic information. LTR retrotransposons are the most frequent repetitive sequences in plant genomes; however, their detection and classification are commonly performed using semi-automatic and time-consuming programs. Despite the availability of several bioinformatic tools that follow different approaches to detect and classify them, none of these tools can individually obtain accurate results. Here, we used Machine Learning algorithms based on -mer counts to classify LTR retrotransposons from other genomic sequences and into lineages/families with an F1-Score of 95%, contributing to develop a free-alignment and automatic method to analyze these sequences.
每天都有更多的植物基因组出现在公共数据库中,并且更多大规模测序项目(即旨在对数千个个体进行测序的项目)被制定和发布。然而,目前还没有足够的自动化工具来分析如此大量的基因组信息。长末端重复序列反转录转座子(LTR反转录转座子)是植物基因组中最常见的重复序列;然而,它们的检测和分类通常使用半自动且耗时的程序来进行。尽管有几种采用不同方法来检测和分类它们的生物信息学工具,但这些工具都无法单独获得准确的结果。在这里,我们使用基于k-mer计数的机器学习算法,将LTR反转录转座子与其他基因组序列区分开来,并以95%的F1分数将其分类到不同的谱系/家族中,这有助于开发一种用于分析这些序列的免费比对和自动方法。