大规模最大平均功率多重推断在时间序列计数数据中的应用及在 RNA-seq 分析中的应用。

Large scale maximum average power multiple inference on time-course count data with application to RNA-seq analysis.

机构信息

Department of Statistics, Colorado State University, Fort Collins, Colorado.

Department of Biology, Colorado State University, Fort Collins, Colorado.

出版信息

Biometrics. 2020 Mar;76(1):9-22. doi: 10.1111/biom.13144. Epub 2019 Nov 14.

DOI:10.1111/biom.13144

PMID:31483480

Abstract

Experiments that longitudinally collect RNA sequencing (RNA-seq) data can provide transformative insights in biology research by revealing the dynamic patterns of genes. Such experiments create a great demand for new analytic approaches to identify differentially expressed (DE) genes based on large-scale time-course count data. Existing methods, however, are suboptimal with respect to power and may lack theoretical justification. Furthermore, most existing tests are designed to distinguish among conditions based on overall differential patterns across time, though in practice, a variety of composite hypotheses are of more scientific interest. Finally, some current methods may fail to control the false discovery rate. In this paper, we propose a new model and testing procedure to address the above issues simultaneously. Specifically, conditional on a latent Gaussian mixture with evolving means, we model the data by negative binomial distributions. Motivated by Storey (2007) and Hwang and Liu (2010), we introduce a general testing framework based on the proposed model and show that the proposed test enjoys the optimality property of maximum average power. The test allows not only identification of traditional DE genes but also testing of a variety of composite hypotheses of biological interest. We establish the identifiability of the proposed model, implement the proposed method via efficient algorithms, and demonstrate its good performance via simulation studies. The procedure reveals interesting biological insights, when applied to data from an experiment that examines the effect of varying light environments on the fundamental physiology of the marine diatom Phaeodactylum tricornutum.

摘要

进行纵向收集 RNA 测序（RNA-seq）数据的实验可以通过揭示基因的动态模式，为生物学研究提供变革性的见解。此类实验对新的分析方法提出了巨大需求，以便根据大规模时间序列计数数据识别差异表达（DE）基因。然而，现有的方法在功效方面并不理想，并且可能缺乏理论依据。此外，大多数现有的检验方法是基于整个时间点的整体差异模式来区分条件的，尽管实际上，各种综合假设更具有科学意义。最后，一些当前的方法可能无法控制错误发现率。在本文中，我们提出了一种新的模型和检验程序来同时解决上述问题。具体来说，在具有演变均值的潜在高斯混合条件下，我们通过负二项式分布对数据进行建模。受 Storey（2007）和 Hwang 和 Liu（2010）的启发，我们引入了一个基于所提出模型的一般检验框架，并表明所提出的检验具有最大平均功效的最优性。该检验不仅可以识别传统的 DE 基因，还可以检验各种具有生物学意义的综合假设。我们确定了所提出模型的可识别性，通过有效的算法实现了所提出的方法，并通过模拟研究证明了其良好的性能。当应用于研究不同光照环境对海洋硅藻 Phaeodactylum tricornutum 基本生理影响的实验数据时，该程序揭示了有趣的生物学见解。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

大规模最大平均功率多重推断在时间序列计数数据中的应用及在 RNA-seq 分析中的应用。

Large scale maximum average power multiple inference on time-course count data with application to RNA-seq analysis.

机构信息

出版信息

相似文献

引用本文的文献

大规模最大平均功率多重推断在时间序列计数数据中的应用及在 RNA-seq 分析中的应用。

Large scale maximum average power multiple inference on time-course count data with application to RNA-seq analysis.

机构信息

出版信息

相似文献

引用本文的文献