Tauber Stefanie, von Haeseler Arndt
Center for Integrative Bioinformatics, Max F Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna, Austria.
Stat Appl Genet Mol Biol. 2013 Apr 16;12(2):175-88. doi: 10.1515/sagmb-2012-0049.
How deep is deep enough? While RNA-sequencing represents a well-established technology, the required sequencing depth for detecting all expressed genes is not known. If we leave the entire biological overhead and meta-information behind we are dealing with a classical sampling process. Such sampling processes are well known from population genetics and thoroughly investigated. Here we use the Pitman Sampling Formula to model the sampling process of RNA-sequencing. By doing so we characterize the sampling by means of two parameters which grasp the conglomerate of different sequencing technologies, protocols and their associated biases. We differ between two levels of sampling: number of reads per gene and respectively, number of reads starting at each position of a specific gene. The latter approach allows us to evaluate the theoretical expectation of uniform coverage and the performance of sequencing protocols in that respect. Most importantly, given a pilot sequencing experiment we provide an estimate for the size of the underlying sampling universe and, based on these findings, evaluate an estimator for the number of newly detected genes when sequencing an additional sample of arbitrary size.
多深才算足够深?虽然RNA测序是一项成熟的技术,但检测所有表达基因所需的测序深度尚不清楚。如果我们抛开所有生物学开销和元信息,那我们面对的就是一个经典的采样过程。这种采样过程在群体遗传学中很常见且已得到充分研究。在此,我们使用皮特曼采样公式对RNA测序的采样过程进行建模。通过这样做,我们用两个参数来表征采样,这两个参数涵盖了不同测序技术、方案及其相关偏差的总和。我们区分两种采样水平:每个基因的读数数量以及分别从特定基因的每个位置起始的读数数量。后一种方法使我们能够评估均匀覆盖的理论期望以及在这方面测序方案的性能。最重要的是,给定一个先导测序实验,我们可以估计潜在采样总体的大小,并基于这些发现评估在对任意大小的额外样本进行测序时新检测到的基因数量的估计器。