Nelson Michael G, Linheiro Raquel S, Bergman Casey M
Faculty of Life Sciences, University of Manchester, M13 9PL, United Kingdom.
Faculty of Life Sciences, University of Manchester, M13 9PL, United Kingdom
G3 (Bethesda). 2017 Aug 7;7(8):2763-2778. doi: 10.1534/g3.117.043893.
Transposable element (TE) insertions are among the most challenging types of variants to detect in genomic data because of their repetitive nature and complex mechanisms of replication . Nevertheless, the recent availability of large resequencing data sets has spurred the development of many new methods to detect TE insertions in whole-genome shotgun sequences. Here we report an integrated bioinformatics pipeline for the detection of TE insertions in whole-genome shotgun data, called McClintock (https://github.com/bergmanlab/mcclintock), which automatically runs and standardizes output for multiple TE detection methods. We demonstrate the utility of McClintock by evaluating six TE detection methods using simulated and real genome data from the model microbial eukaryote, We find substantial variation among McClintock component methods in their ability to detect nonreference TEs in the yeast genome, but show that nonreference TEs at nearly all biologically realistic locations can be detected in simulated data by combining multiple methods that use split-read and read-pair evidence. In general, our results reveal that split-read methods detect fewer nonreference TE insertions than read-pair methods, but generally have much higher positional accuracy. Analysis of a large sample of real yeast genomes reveals that most McClintock component methods can recover known aspects of TE biology in yeast such as the transpositional activity status of families, target preferences, and target site duplication structure, albeit with varying levels of accuracy. Our work provides a general framework for integrating and analyzing results from multiple TE detection methods, as well as useful guidance for researchers studying TEs in yeast resequencing data.
由于转座元件(TE)具有重复性质和复杂的复制机制,其插入是基因组数据中最难检测的变异类型之一。尽管如此,近期大量重测序数据集的出现推动了许多用于在全基因组鸟枪法测序序列中检测TE插入的新方法的发展。在此,我们报告了一种用于在全基因组鸟枪法数据中检测TE插入的综合生物信息学流程,称为麦克林托克(https://github.com/bergmanlab/mcclintock),它能自动运行并标准化多种TE检测方法的输出。我们通过使用来自模式微生物真核生物的模拟和真实基因组数据评估六种TE检测方法,展示了麦克林托克的实用性。我们发现麦克林托克的各组成方法在检测酵母基因组中非参考TE的能力上存在显著差异,但表明通过结合使用拆分读段和读对证据的多种方法,几乎所有生物学上实际位置的非参考TE在模拟数据中都能被检测到。总体而言,我们的结果表明,拆分读段方法检测到的非参考TE插入比读对方法少,但通常具有更高的位置准确性。对大量真实酵母基因组样本的分析表明,大多数麦克林托克组成方法能够恢复酵母中TE生物学的已知方面,如家族的转座活性状态、靶标偏好和靶位点重复结构,尽管准确性水平各不相同。我们的工作为整合和分析多种TE检测方法的结果提供了一个通用框架,也为研究酵母重测序数据中TE的研究人员提供了有用的指导。