Department of Computer Science, Universidad Autónoma de Manizales, 170002 Manizales, Colombia.
Department of Systems and Informatics, Universidad de Caldas, 170002 Manizales, Colombia.
Genes (Basel). 2021 Jan 28;12(2):190. doi: 10.3390/genes12020190.
Long terminal repeat (LTR) retrotransposons are mobile elements that constitute the major fraction of most plant genomes. The identification and annotation of these elements via bioinformatics approaches represent a major challenge in the era of massive plant genome sequencing. In addition to their involvement in genome size variation, LTR retrotransposons are also associated with the function and structure of different chromosomal regions and can alter the function of coding regions, among others. Several sequence databases of plant LTR retrotransposons are available for public access, such as PGSB and RepetDB, or restricted access such as Repbase. Although these databases are useful to identify LTR-RTs in new genomes by similarity, the elements of these databases are not fully classified to the lineage (also called family) level. Here, we present InpactorDB, a semi-curated dataset composed of 130,439 elements from 195 plant genomes (belonging to 108 plant species) classified to the lineage level. This dataset has been used to train two deep neural networks (i.e., one fully connected and one convolutional) for the rapid classification of these elements. In lineage-level classification approaches, we obtain up to 98% performance, indicated by the F1-score, precision and recall scores.
长末端重复(LTR)反转录转座子是一类可移动元件,构成了大多数植物基因组的主要部分。通过生物信息学方法对这些元件进行鉴定和注释,是大规模植物基因组测序时代的主要挑战之一。除了参与基因组大小的变化外,LTR 反转录转座子还与不同染色体区域的功能和结构有关,并且可以改变编码区域等的功能。有几个植物 LTR 反转录转座子的序列数据库可供公众访问,例如 PGSB 和 RepetDB,或者限制访问,例如 Repbase。虽然这些数据库通过相似性有助于识别新基因组中的 LTR-RTs,但这些数据库中的元件并没有完全分类到谱系(也称为家族)水平。在这里,我们展示了 InpactorDB,这是一个由来自 195 个植物基因组(属于 108 个植物物种)的 130,439 个元件组成的半注释数据集,这些元件被分类到谱系水平。该数据集已用于训练两个深度神经网络(即一个全连接神经网络和一个卷积神经网络),以快速对这些元件进行分类。在谱系水平的分类方法中,我们获得了高达 98%的性能,由 F1 分数、精确率和召回率得分表示。