Panta Manisha, Mishra Avdesh, Hoque Md Tamjidul, Atallah Joel
Department of Computer Science, University of New Orleans, New Orleans, LA 70148, USA.
Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX 78363, USA.
Bioinformatics. 2021 Sep 9;37(17):2529-2536. doi: 10.1093/bioinformatics/btab146.
Transposable Elements (TEs) or jumping genes are DNA sequences that have an intrinsic capability to move within a host genome from one genomic location to another. Studies show that the presence of a TE within or adjacent to a functional gene may alter its expression. TEs can also cause an increase in the rate of mutation and can even mediate duplications and large insertions and deletions in the genome, promoting gross genetic rearrangements. The proper classification of identified jumping genes is important for analyzing their genetic and evolutionary effects. An effective classifier, which can explain the role of TEs in germline and somatic evolution more accurately, is needed. In this study, we examine the performance of a variety of machine learning (ML) techniques and propose a robust method, ClassifyTE, for the hierarchical classification of TEs with high accuracy, using a stacking-based ML method.
We propose a stacking-based approach for the hierarchical classification of TEs. When trained on three different benchmark datasets, our proposed system achieved 4%, 10.68% and 10.13% average percentage improvement (using the hF measure) compared to several state-of-the-art methods. We developed an end-to-end automated hierarchical classification tool based on the proposed approach, ClassifyTE, to classify TEs up to the super-family level. We further evaluated our method on a new TE library generated by a homology-based classification method and found relatively high concordance at higher taxonomic levels. Thus, ClassifyTE paves the way for a more accurate analysis of the role of TEs.
The source code and data are available at https://github.com/manisa/ClassifyTE.
Supplementary data are available at Bioinformatics online.
转座元件(TEs)或跳跃基因是具有在宿主基因组内从一个基因组位置移动到另一个位置的内在能力的DNA序列。研究表明,功能基因内部或附近存在转座元件可能会改变其表达。转座元件还会导致突变率增加,甚至可以介导基因组中的重复以及大的插入和缺失,促进大规模的基因重排。对已识别的跳跃基因进行正确分类对于分析其遗传和进化效应至关重要。需要一种能够更准确地解释转座元件在种系和体细胞进化中作用的有效分类器。在本研究中,我们检验了多种机器学习(ML)技术的性能,并提出了一种强大的方法ClassifyTE,用于使用基于堆叠的ML方法对转座元件进行高精度的层次分类。
我们提出了一种基于堆叠的方法用于转座元件的层次分类。当在三个不同的基准数据集上进行训练时,与几种最先进的方法相比,我们提出的系统实现了4%、10.68%和10.13%的平均百分比提升(使用hF度量)。我们基于所提出的方法开发了一个端到端的自动层次分类工具ClassifyTE,用于将转座元件分类到超家族水平。我们进一步在通过基于同源性的分类方法生成的新转座元件库上评估了我们的方法,发现在较高分类水平上具有相对较高的一致性。因此,ClassifyTE为更准确地分析转座元件的作用铺平了道路。
源代码和数据可在https://github.com/manisa/ClassifyTE获取。
补充数据可在《生物信息学》在线获取。