Institute of Crop Science, National Agriculture and Food Research Organization, 2-1-2, Kannondai, Tsukuba, Ibaraki, 305-8518, Japan.
BMC Bioinformatics. 2022 Nov 22;23(1):500. doi: 10.1186/s12859-022-05011-3.
Detection of newly transposed events by transposable elements (TEs) from next generation sequence (NGS) data is difficult, due to their multiple distribution sites over the genome containing older TEs. The previously reported Transposon Insertion Finder (TIF) detects TE transpositions on the reference genome from NGS short reads using end sequences of target TE. TIF requires the sequence of target TE and is not able to detect transpositions for TEs with an unknown sequence.
The new algorithm Transposable Element Finder (TEF) enables the detection of TE transpositions, even for TEs with an unknown sequence. TEF is a finding tool of transposed TEs, in contrast to TIF as a detection tool of transposed sites for TEs with a known sequence. The transposition event is often accompanied with a target site duplication (TSD). Focusing on TSD, two algorithms to detect both ends of TE, TSDs and target sites are reported here. One is based on the grouping with TSDs and direct comparison of k-mers from NGS without similarity search. The other is based on the junction mapping of TE end sequence candidates. Both methods succeed to detect both ends and TSDs of known active TEs in several tests with rice, Arabidopsis and Drosophila data and discover several new TEs in new locations. PCR confirmed the detected transpositions of TEs in several test cases in rice.
TEF detects transposed TEs with TSDs as a result of TE transposition, sequences of both ends and their inserted positions of transposed TEs by direct comparison of NGS data between two samples. Genotypes of transpositions are verified by counting of junctions of head and tail, and non-insertion sequences in NGS reads. TEF is easy to run and independent of any TE library, which makes it useful to detect insertions from unknown TEs bypassed by common TE annotation pipelines.
由于转座元件 (TEs) 在基因组中分布广泛,包含许多旧的 TEs,因此从下一代序列 (NGS) 数据中检测新转座事件较为困难。先前报道的转座子插入检测工具 (TIF) 使用靶 TE 的末端序列从 NGS 短读序列中检测参考基因组上的 TE 转座。TIF 需要靶 TE 的序列,并且无法检测未知序列的 TE 的转座。
新算法转座元件发现器 (TEF) 能够检测 TE 的转座,即使是对于未知序列的 TE 也是如此。TEF 是转座 TE 的发现工具,而 TIF 是用于检测具有已知序列的 TE 的转座位点的检测工具。转座事件通常伴随着靶位点重复 (TSD)。这里报告了两种用于检测 TE、TSD 和靶位点两端的算法,它们都聚焦于 TSD。一种方法基于 TSD 的分组和来自 NGS 的无相似性搜索的 k-mer 的直接比较。另一种方法基于 TE 末端序列候选物的连接映射。这两种方法都成功地在使用水稻、拟南芥和果蝇数据的几个测试中检测到了已知活性 TE 的两端和 TSD,并在新位置发现了几个新的 TE。在水稻的几个测试案例中,PCR 证实了检测到的 TE 转座。
TEF 通过两个样本之间的 NGS 数据的直接比较,检测到 TSD 作为 TE 转座的结果、转座 TE 的两端及其插入位置的转座 TE。通过对 NGS 读取中的连接和非插入序列进行计数,验证转座的基因型。TEF 易于运行,且不依赖于任何 TE 文库,这使其能够用于检测通过常见 TE 注释管道绕过的未知 TE 的插入。