The Bioinformatics Toolsmith Laboratory, Tandy School of Computer Science, University of Tulsa, 800 South Tucker Drive, Tulsa, 74104, OK, USA.
BMC Genomics. 2019 Jun 3;20(1):450. doi: 10.1186/s12864-019-5796-9.
Long terminal repeat retrotransposons are the most abundant transposons in plants. They play important roles in alternative splicing, recombination, gene regulation, and defense mechanisms. Large-scale sequencing projects for plant genomes are currently underway. Software tools are important for annotating long terminal repeat retrotransposons in these newly available genomes. However, the available tools are not very sensitive to known elements and perform inconsistently on different genomes. Some are hard to install or obsolete. They may struggle to process large plant genomes. None can be executed in parallel out of the box and very few have features to support visual review of new elements. To overcome these limitations, we developed LtrDetector, which uses techniques inspired by signal-processing.
We compared LtrDetector to LTR_Finder and LTRharvest, the two most successful predecessor tools, on six plant genomes. For each organism, we constructed a ground truth data set based on queries from a consensus sequence database. According to this evaluation, LtrDetector was the most sensitive tool, achieving 16-23% improvement in sensitivity over LTRharvest and 21% improvement over LTR_Finder. All three tools had low false positive rates, with LtrDetector achieving 98.2% precision, in between its two competitors. Overall, LtrDetector provides the best compromise between high sensitivity and low false positive rate while requiring moderate time and utilizing memory available on personal computers.
LtrDetector uses a novel methodology revolving around k-mer distributions, which allows it to produce high-quality results using relatively lightweight procedures. It is easy to install and use. It is not species specific, performing well using its default parameters on genomes of varying size and repeat content. It is automatically configured for parallel execution and runs efficiently on an ordinary personal computer. It includes a k-mer scores visualization tool to facilitate manual review of the identified elements. These features make LtrDetector an attractive tool for future annotation projects involving long terminal repeat retrotransposons.
长末端重复逆转录转座子是植物中最丰富的转座子。它们在可变剪接、重组、基因调控和防御机制中发挥重要作用。目前正在进行大规模的植物基因组测序项目。软件工具对于注释这些新可用基因组中的长末端重复逆转录转座子非常重要。然而,现有的工具对已知元件的敏感性不高,并且在不同的基因组上表现不一致。有些难以安装或已过时。它们可能难以处理大型植物基因组。没有一个可以开箱即用并行执行,很少有功能支持新元素的可视化审查。为了克服这些限制,我们开发了 LtrDetector,它使用受信号处理启发的技术。
我们将 LtrDetector 与 LTR_Finder 和 LTRharvest 这两个最成功的前导工具在六个植物基因组上进行了比较。对于每个生物体,我们根据来自共识序列数据库的查询构建了一个ground truth 数据集。根据该评估,LtrDetector 是最敏感的工具,与 LTRharvest 相比,灵敏度提高了 16-23%,与 LTR_Finder 相比,灵敏度提高了 21%。所有三种工具的假阳性率都很低,LtrDetector 的准确率为 98.2%,介于其两个竞争对手之间。总体而言,LtrDetector 在高灵敏度和低假阳性率之间提供了最佳折衷,同时需要适度的时间和利用个人计算机上可用的内存。
LtrDetector 使用一种围绕 k-mer 分布的新颖方法,允许使用相对轻量级的过程产生高质量的结果。它易于安装和使用。它不是特定于物种的,使用其默认参数在大小和重复内容不同的基因组上表现良好。它自动配置为并行执行,在普通个人计算机上高效运行。它包括一个 k-mer 分数可视化工具,以方便手动审查识别出的元素。这些功能使 LtrDetector 成为未来涉及长末端重复逆转录转座子的注释项目的有吸引力的工具。