Benidt Sam, Nettleton Dan
Department of Statistics, Iowa State University, Ames, IA 50011-1210, USA.
Bioinformatics. 2015 Jul 1;31(13):2131-40. doi: 10.1093/bioinformatics/btv124. Epub 2015 Feb 26.
RNA sequencing analysis methods are often derived by relying on hypothetical parametric models for read counts that are not likely to be precisely satisfied in practice. Methods are often tested by analyzing data that have been simulated according to the assumed model. This testing strategy can result in an overly optimistic view of the performance of an RNA-seq analysis method.
We develop a data-based simulation algorithm for RNA-seq data. The vector of read counts simulated for a given experimental unit has a joint distribution that closely matches the distribution of a source RNA-seq dataset provided by the user. We conduct simulation experiments based on the negative binomial distribution and our proposed nonparametric simulation algorithm. We compare performance between the two simulation experiments over a small subset of statistical methods for RNA-seq analysis available in the literature. We use as a benchmark the ability of a method to control the false discovery rate. Not surprisingly, methods based on parametric modeling assumptions seem to perform better with respect to false discovery rate control when data are simulated from parametric models rather than using our more realistic nonparametric simulation strategy.
The nonparametric simulation algorithm developed in this article is implemented in the R package SimSeq, which is freely available under the GNU General Public License (version 2 or later) from the Comprehensive R Archive Network (http://cran.rproject.org/).
Supplementary data are available at Bioinformatics online.
RNA测序分析方法通常依赖于对读取计数的假设参数模型推导而来,而这些模型在实际中不太可能精确满足。方法通常通过分析根据假设模型模拟的数据进行测试。这种测试策略可能会导致对RNA测序分析方法性能的过度乐观看法。
我们开发了一种基于数据的RNA测序数据模拟算法。为给定实验单元模拟的读取计数向量具有与用户提供的源RNA测序数据集分布紧密匹配的联合分布。我们基于负二项分布和我们提出的非参数模拟算法进行模拟实验。我们在文献中可用的一小部分RNA测序分析统计方法上比较了两种模拟实验的性能。我们将一种方法控制错误发现率的能力用作基准。不出所料,当从参数模型模拟数据而不是使用我们更现实的非参数模拟策略时,基于参数建模假设的方法在错误发现率控制方面似乎表现更好。
本文中开发的非参数模拟算法在R包SimSeq中实现,该包可从综合R存档网络(http://cran.rproject.org/)根据GNU通用公共许可证(第2版或更高版本)免费获得。
补充数据可在《生物信息学》在线获取。