Centre for Ecology and Conservation, University of Exeter, Penryn Campus, Cornwall TR10 9FE, UK.
Laboratory of Evolutionary Genetics, Institute of Biology, University of Neuchâtel, 2000 Neuchâtel, Switzerland.
Mol Biol Evol. 2024 Apr 2;41(4). doi: 10.1093/molbev/msae068.
Transposable elements (TEs) are major components of eukaryotic genomes and are implicated in a range of evolutionary processes. Yet, TE annotation and characterization remain challenging, particularly for nonspecialists, since existing pipelines are typically complicated to install, run, and extract data from. Current methods of automated TE annotation are also subject to issues that reduce overall quality, particularly (i) fragmented and overlapping TE annotations, leading to erroneous estimates of TE count and coverage, and (ii) repeat models represented by short sections of total TE length, with poor capture of 5' and 3' ends. To address these issues, we present Earl Grey, a fully automated TE annotation pipeline designed for user-friendly curation and annotation of TEs in eukaryotic genome assemblies. Using nine simulated genomes and an annotation of Drosophila melanogaster, we show that Earl Grey outperforms current widely used TE annotation methodologies in ameliorating the issues mentioned above while scoring highly in benchmarking for TE annotation and classification and being robust across genomic contexts. Earl Grey provides a comprehensive and fully automated TE annotation toolkit that provides researchers with paper-ready summary figures and outputs in standard formats compatible with other bioinformatics tools. Earl Grey has a modular format, with great scope for the inclusion of additional modules focused on further quality control and tailored analyses in future releases.
转座元件 (TEs) 是真核基因组的主要组成部分,涉及多种进化过程。然而,TE 注释和特征描述仍然具有挑战性,特别是对于非专业人员而言,因为现有的管道通常难以安装、运行和从其中提取数据。当前的自动化 TE 注释方法也存在一些问题,这些问题会降低整体质量,特别是:(i) TE 注释的碎片化和重叠,导致 TE 计数和覆盖范围的错误估计;(ii) 重复模型由 TE 总长度的短片段表示,5' 和 3' 端的捕获效果较差。为了解决这些问题,我们提出了 Earl Grey,这是一个完全自动化的 TE 注释管道,旨在方便用户对真核基因组组装中的 TE 进行注释和管理。使用九个模拟基因组和 Drosophila melanogaster 的注释,我们表明 Earl Grey 在缓解上述问题方面优于当前广泛使用的 TE 注释方法,同时在 TE 注释和分类的基准测试中得分很高,并且在基因组背景下具有很强的稳健性。Earl Grey 提供了一个全面的、完全自动化的 TE 注释工具包,为研究人员提供了准备好发表论文的总结图和输出,这些输出采用与其他生物信息学工具兼容的标准格式。Earl Grey 具有模块化格式,在未来的版本中可以很方便地添加额外的模块,用于进一步的质量控制和定制分析。