Jin Ying, Hammell Molly
Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
Methods Mol Biol. 2018;1751:153-167. doi: 10.1007/978-1-4939-7710-9_11.
Transposable elements (TE) are mobile genetic elements that can readily change their genomic position. When not properly silenced, TEs can contribute a substantial portion to the cell's transcriptome, but are typically ignored in most RNA-seq data analyses. One reason for leaving TE-derived reads out of RNA-seq analyses is the complexities involved in properly aligning short sequencing reads to these highly repetitive regions. Here we describe a method for including TE-derived reads in RNA-seq differential expression analysis using an open source software package called TEtranscripts. TEtranscripts is designed to assign both uniquely and ambiguously mapped reads to all possible gene and TE-derived transcripts in order to statistically infer the correct gene/TE abundances. Here, we provide a detailed tutorial of TEtranscripts using a published qPCR validated dataset.Barbara McClintock laid the foundation for TE research with her discoveries in maize of mobile genetic elements capable of inserting into novel locations in the genome, altering the expression of nearby genes [1]. Since then, our appreciation of the contribution of repetitive TE-derived sequences to eukaryotic genomes has vastly increased. With the publication of the first human genome draft by the Human Genome Project, it was determined that nearly half of the human genome is derived from TE sequences [2, 3], with varying levels of repetitive DNA present in most plant and animal species. More recent studies looking at distantly related TE-like sequences have estimated that up to two thirds of the human genome might be repeat-derived [4], with the vast majority of these sequences attributed to retrotransposons that require transcription as part of the mobilization process, as discussed below.
转座元件(TE)是能够轻易改变其基因组位置的可移动遗传元件。当未被适当沉默时,TE可在细胞转录组中占据相当大的比例,但在大多数RNA测序数据分析中通常被忽略。在RNA测序分析中不考虑TE衍生读数的一个原因是,将短测序读数正确比对到这些高度重复区域存在复杂性。在此,我们描述了一种使用名为TEtranscripts的开源软件包,将TE衍生读数纳入RNA测序差异表达分析的方法。TEtranscripts旨在将唯一比对和模糊比对的读数分配给所有可能的基因和TE衍生转录本,以便通过统计学推断正确的基因/TE丰度。在此,我们使用一个已发表的经qPCR验证的数据集,提供了TEtranscripts的详细教程。芭芭拉·麦克林托克在玉米中发现了能够插入基因组新位置、改变附近基因表达的可移动遗传元件,为TE研究奠定了基础[1]。从那时起,我们对重复的TE衍生序列对真核基因组贡献的认识大幅提高。随着人类基因组计划公布首个人类基因组草图,人们确定近一半的人类基因组源自TE序列[2,3],大多数动植物物种中都存在不同水平的重复DNA。最近对远缘TE样序列的研究估计,人类基因组中高达三分之二可能源自重复序列[4],如下所述,这些序列绝大多数归因于逆转录转座子,它们在移动过程中需要转录。