Suppr超能文献

RNA测序数据基因集分析方法的比较评估

Comparative evaluation of gene set analysis approaches for RNA-Seq data.

作者信息

Rahmatallah Yasir, Emmert-Streib Frank, Glazko Galina

机构信息

Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA.

Computational Biology and Machine Learning Laboratory, Center for Cancer Research and Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, 97 Lisburn Road, Belfast, BT9 7BL, UK.

出版信息

BMC Bioinformatics. 2014 Dec 5;15(1):397. doi: 10.1186/s12859-014-0397-8.

Abstract

BACKGROUND

Over the last few years transcriptome sequencing (RNA-Seq) has almost completely taken over microarrays for high-throughput studies of gene expression. Currently, the most popular use of RNA-Seq is to identify genes which are differentially expressed between two or more conditions. Despite the importance of Gene Set Analysis (GSA) in the interpretation of the results from RNA-Seq experiments, the limitations of GSA methods developed for microarrays in the context of RNA-Seq data are not well understood.

RESULTS

We provide a thorough evaluation of popular multivariate and gene-level self-contained GSA approaches on simulated and real RNA-Seq data. The multivariate approach employs multivariate non-parametric tests combined with popular normalizations for RNA-Seq data. The gene-level approach utilizes univariate tests designed for the analysis of RNA-Seq data to find gene-specific P-values and combines them into a pathway P-value using classical statistical techniques. Our results demonstrate that the Type I error rate and the power of multivariate tests depend only on the test statistics and are insensitive to the different normalizations. In general standard multivariate GSA tests detect pathways that do not have any bias in terms of pathways size, percentage of differentially expressed genes, or average gene length in a pathway. In contrast the Type I error rate and the power of gene-level GSA tests are heavily affected by the methods for combining P-values, and all aforementioned biases are present in detected pathways.

CONCLUSIONS

Our result emphasizes the importance of using self-contained non-parametric multivariate tests for detecting differentially expressed pathways for RNA-Seq data and warns against applying gene-level GSA tests, especially because of their high level of Type I error rates for both, simulated and real data.

摘要

背景

在过去几年中,转录组测序(RNA-Seq)几乎完全取代了微阵列用于基因表达的高通量研究。目前,RNA-Seq最常见的用途是识别在两种或更多条件之间差异表达的基因。尽管基因集分析(GSA)在解释RNA-Seq实验结果中很重要,但针对微阵列开发的GSA方法在RNA-Seq数据背景下的局限性尚未得到充分理解。

结果

我们对模拟和真实RNA-Seq数据上流行的多变量和基因水平自包含GSA方法进行了全面评估。多变量方法采用多变量非参数检验并结合RNA-Seq数据的流行归一化方法。基因水平方法利用为分析RNA-Seq数据而设计的单变量检验来找到基因特异性P值,并使用经典统计技术将它们组合成通路P值。我们的结果表明,I型错误率和多变量检验的功效仅取决于检验统计量,并且对不同的归一化不敏感。一般来说,标准的多变量GSA检验检测的通路在通路大小、差异表达基因的百分比或通路中的平均基因长度方面没有任何偏差。相比之下,基因水平GSA检验的I型错误率和功效受到P值组合方法的严重影响,并且在检测到的通路中存在所有上述偏差。

结论

我们的结果强调了使用自包含非参数多变量检验来检测RNA-Seq数据中差异表达通路的重要性,并警告不要应用基因水平GSA检验,特别是因为它们在模拟和真实数据中都有很高的I型错误率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/177a/4265362/028f99a6e0dd/12859_2014_397_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验