Chen Jinfeng, Wrightsman Travis R, Wessler Susan R, Stajich Jason E
Department of Plant Pathology & Microbiology, University of California, Riverside, CA, United States; Institute for Integrative Genome Biology, University of California, Riverside, CA, United States; Department of Botany and Plant Sciences, University of California, Riverside, CA, United States.
Department of Botany and Plant Sciences, University of California , Riverside , CA , United States.
PeerJ. 2017 Jan 26;5:e2942. doi: 10.7717/peerj.2942. eCollection 2017.
Transposable element (TE) polymorphisms are important components of population genetic variation. The functional impacts of TEs in gene regulation and generating genetic diversity have been observed in multiple species, but the frequency and magnitude of TE variation is under appreciated. Inexpensive and deep sequencing technology has made it affordable to apply population genetic methods to whole genomes with methods that identify single nucleotide and insertion/deletion polymorphisms. However, identifying TE polymorphisms, particularly transposition events or non-reference insertion sites can be challenging due to the repetitive nature of these sequences, which hamper both the sensitivity and specificity of analysis tools.
We have developed the tool RelocaTE2 for identification of TE insertion sites at high sensitivity and specificity. RelocaTE2 searches for known TE sequences in whole genome sequencing reads from second generation sequencing platforms such as Illumina. These sequence reads are used as seeds to pinpoint chromosome locations where TEs have transposed. RelocaTE2 detects target site duplication (TSD) of TE insertions allowing it to report TE polymorphism loci with single base pair precision.
The performance of RelocaTE2 is evaluated using both simulated and real sequence data. RelocaTE2 demonstrate high level of sensitivity and specificity, particularly when the sequence coverage is not shallow. In comparison to other tools tested, RelocaTE2 achieves the best balance between sensitivity and specificity. In particular, RelocaTE2 performs best in prediction of TSDs for TE insertions. Even in highly repetitive regions, such as those tested on rice chromosome 4, RelocaTE2 is able to report up to 95% of simulated TE insertions with less than 0.1% false positive rate using 10-fold genome coverage resequencing data. RelocaTE2 provides a robust solution to identify TE insertion sites and can be incorporated into analysis workflows in support of describing the complete genotype from light coverage genome sequencing.
转座元件(TE)多态性是群体遗传变异的重要组成部分。在多个物种中已观察到TEs在基因调控和产生遗传多样性方面的功能影响,但TE变异的频率和幅度尚未得到充分认识。廉价且深度的测序技术使得采用群体遗传学方法对全基因组进行单核苷酸和插入/缺失多态性鉴定成为可能。然而,由于这些序列的重复性,识别TE多态性,特别是转座事件或非参考插入位点可能具有挑战性,这会影响分析工具的敏感性和特异性。
我们开发了RelocaTE2工具,用于以高敏感性和特异性识别TE插入位点。RelocaTE2在来自第二代测序平台(如Illumina)的全基因组测序读数中搜索已知的TE序列。这些序列读数用作种子来确定TEs转座的染色体位置。RelocaTE2检测TE插入的靶位点重复(TSD),从而能够以单碱基对精度报告TE多态性位点。
使用模拟和真实序列数据评估RelocaTE2的性能。RelocaTE2表现出高水平的敏感性和特异性,特别是在序列覆盖度不低时。与其他测试工具相比,RelocaTE2在敏感性和特异性之间实现了最佳平衡。特别是,RelocaTE2在预测TE插入的TSD方面表现最佳。即使在高度重复的区域,如在水稻第4号染色体上测试的区域,使用10倍基因组覆盖度重测序数据,RelocaTE2能够报告高达95%的模拟TE插入,假阳性率低于0.1%。RelocaTE2为识别TE插入位点提供了可靠的解决方案,可纳入分析工作流程,以支持从低覆盖度基因组测序描述完整基因型。