Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
Bioinformatics. 2011 Oct 1;27(19):2633-40. doi: 10.1093/bioinformatics/btr458. Epub 2011 Aug 8.
In eukaryotic cells, alternative splicing expands the diversity of RNA transcripts and plays an important role in tissue-specific differentiation, and can be misregulated in disease. To understand these processes, there is a great need for methods to detect differential transcription between samples. Our focus is on samples observed using short-read RNA sequencing (RNA-seq).
We characterize differential transcription between two samples as the difference in the relative abundance of the transcript isoforms present in the samples. The magnitude of differential transcription of a gene between two samples can be measured by the square root of the Jensen Shannon Divergence (JSD*) between the gene's transcript abundance vectors in each sample. We define a weighted splice-graph representation of RNA-seq data, summarizing in compact form the alignment of RNA-seq reads to a reference genome. The flow difference metric (FDM) identifies regions of differential RNA transcript expression between pairs of splice graphs, without need for an underlying gene model or catalog of transcripts. We present a novel non-parametric statistical test between splice graphs to assess the significance of differential transcription, and extend it to group-wise comparison incorporating sample replicates.
Using simulated RNA-seq data consisting of four technical replicates of two samples with varying transcription between genes, we show that (i) the FDM is highly correlated with JSD* (r=0.82) when average RNA-seq coverage of the transcripts is sufficiently deep; and (ii) the FDM is able to identify 90% of genes with differential transcription when JSD* >0.28 and coverage >7. This represents higher sensitivity than Cufflinks (without annotations) and rDiff (MMD), which respectively identified 69 and 49% of the genes in this region as differential transcribed. Using annotations identifying the transcripts, Cufflinks was able to identify 86% of the genes in this region as differentially transcribed. Using experimental data consisting of four replicates each for two cancer cell lines (MCF7 and SUM102), FDM identified 1425 genes as significantly different in transcription. Subsequent study of the samples using quantitative real time polymerase chain reaction (qRT-PCR) of several differential transcription sites identified by FDM, confirmed significant differences at these sites.
http://csbio-linux001.cs.unc.edu/nextgen/software/FDM CONTACT: darshan@email.unc.edu
Supplementary data are available at Bioinformatics online.
在真核细胞中,可变剪接扩展了 RNA 转录本的多样性,并在组织特异性分化中发挥重要作用,并且在疾病中可能失调。为了理解这些过程,非常需要方法来检测样品之间的差异转录。我们的重点是使用短读长 RNA 测序(RNA-seq)观察的样品。
我们将两个样品之间的差异转录特征化为样品中存在的转录本异构体的相对丰度的差异。两个样品之间的基因差异转录的幅度可以通过基因在每个样品中的转录丰度向量之间的 Jensen Shannon 散度(JSD*)的平方根来测量。我们定义了 RNA-seq 数据的加权剪接图表示,以紧凑的形式总结 RNA-seq 读段与参考基因组的比对。流差度量(FDM)识别一对剪接图之间的差异 RNA 转录表达区域,而无需底层基因模型或转录本目录。我们提出了一种新的非参数统计测试方法,用于比较剪接图,以评估差异转录的显著性,并将其扩展到包含样本重复的组间比较。
使用由两个样品的四个技术重复组成的模拟 RNA-seq 数据,我们表明(i)当转录本的平均 RNA-seq 覆盖度足够深时,FDM 与 JSD*(r=0.82)高度相关;(ii)当 JSD*>0.28 且覆盖度>7 时,FDM 能够识别 90%的具有差异转录的基因。这代表了比 Cufflinks(无注释)和 rDiff(MMD)更高的灵敏度,Cufflinks 分别识别了该区域 69%和 49%的基因作为差异转录。使用鉴定转录本的注释,Cufflinks 能够将该区域 86%的基因鉴定为差异转录。使用由两个癌细胞系(MCF7 和 SUM102)的四个重复组成的实验数据,FDM 鉴定了 1425 个基因在转录上有显著差异。随后使用 FDM 鉴定的几个差异转录位点的定量实时聚合酶链反应(qRT-PCR)研究样本,证实了这些位点的显著差异。
http://csbio-linux001.cs.unc.edu/nextgen/software/FDM
补充数据可在《生物信息学》在线获取。