一种基于模型的使用RNA测序数据进行基因表达调用的标准。

A model based criterion for gene expression calls using RNA-seq data.

作者信息

Wagner Günter P, Kin Koryu, Lynch Vincent J

机构信息

Yale Systems Biology Institute, 300 Heffernan Drive, West Haven, CT 06516, USA.

出版信息

Theory Biosci. 2013 Sep;132(3):159-64. doi: 10.1007/s12064-013-0178-3. Epub 2013 Apr 25.

DOI:10.1007/s12064-013-0178-3

PMID:23615947

Abstract

The power of deep sequencing technology to reliably detect single RNA reads leads to a paradoxical problem of high sensitivity. In hybridization or PCR based methods for RNA quantification, the concern is low sensitivity, i.e., the problem that the signal from truly expressed genes might not be distinguishable from noise. In contrast, the problem with RNA-seq is that it is not clear whether genes with very low read counts are from low expressed genes or merely transcriptional noise. The frequency distribution for read counts does not show a clear separation in two classes of genes, which makes the decision whether a gene is to be considered expressed or not seemingly arbitrary. Here we address this problem by suggesting a statistical model that considers the number of transcripts detected in a RNA-seq study as a mixture of two distributions: one is a exponential distribution for transcripts from inactive genes, and a negative binomial distribution for actively transcribed genes. We apply this model to a number of RNA-seq data sets and find that the model fits the data very well. The calculated criteria for distinguishing between expressed and non-expressed gene is remarkably consistent among data sets, suggesting genes with more than two transcripts per million transcripts (TPM) are highly likely from actively transcribed genes. This criterion is consistent with the criterion of 1 RPKM proposed by Hebenstreit et al. Mol Sys Biol 7:497 (2011), based on chromatin modification and per cell RNA expression data. Hence, the regression model correctly identifies the not actively expressed class of genes and thus, provides an operational criterion for classifying genes in expressed and non-expressed sets, facilitating the interpretation of RNA-seq data.

摘要

深度测序技术可靠检测单个RNA读数的能力导致了一个具有高灵敏度的矛盾问题。在基于杂交或PCR的RNA定量方法中，人们关注的是低灵敏度，即真正表达的基因发出的信号可能无法与噪声区分开来的问题。相比之下，RNA测序的问题在于，尚不清楚读数计数非常低的基因是来自低表达基因还是仅仅是转录噪声。读数计数的频率分布在两类基因中没有显示出明显的区分，这使得决定一个基因是否应被视为已表达似乎具有随意性。在此，我们通过提出一种统计模型来解决这个问题，该模型将RNA测序研究中检测到的转录本数量视为两种分布的混合：一种是来自不活跃基因的转录本的指数分布，另一种是活跃转录基因的负二项分布。我们将此模型应用于多个RNA测序数据集，发现该模型与数据拟合得非常好。区分已表达和未表达基因的计算标准在各数据集之间非常一致，这表明每百万转录本（TPM）中具有超过两个转录本的基因极有可能来自活跃转录基因。该标准与赫本施特赖特等人在《分子系统生物学》7:497（2011年）中基于染色质修饰和单细胞RNA表达数据提出的1 RPKM标准一致。因此，回归模型正确地识别出不活跃表达的基因类别，从而为将基因分类为已表达和未表达集合提供了一个操作标准，便于对RNA测序数据进行解读。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

一种基于模型的使用RNA测序数据进行基因表达调用的标准。

A model based criterion for gene expression calls using RNA-seq data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

一种基于模型的使用RNA测序数据进行基因表达调用的标准。

A model based criterion for gene expression calls using RNA-seq data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献