差异基因表达算法的验证：应用比较折叠变化估计与假设检验。

Validation of differential gene expression algorithms: application comparing fold-change estimation to hypothesis testing.

机构信息

Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology, and Immunology, University of Ottawa, Ottawa, Ontario, Canada.

出版信息

BMC Bioinformatics. 2010 Jan 28;11:63. doi: 10.1186/1471-2105-11-63.

DOI:10.1186/1471-2105-11-63

PMID:20109217

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3224549/

Abstract

BACKGROUND

Sustained research on the problem of determining which genes are differentially expressed on the basis of microarray data has yielded a plethora of statistical algorithms, each justified by theory, simulation, or ad hoc validation and yet differing in practical results from equally justified algorithms. Recently, a concordance method that measures agreement among gene lists have been introduced to assess various aspects of differential gene expression detection. This method has the advantage of basing its assessment solely on the results of real data analyses, but as it requires examining gene lists of given sizes, it may be unstable.

RESULTS

Two methodologies for assessing predictive error are described: a cross-validation method and a posterior predictive method. As a nonparametric method of estimating prediction error from observed expression levels, cross validation provides an empirical approach to assessing algorithms for detecting differential gene expression that is fully justified for large numbers of biological replicates. Because it leverages the knowledge that only a small portion of genes are differentially expressed, the posterior predictive method is expected to provide more reliable estimates of algorithm performance, allaying concerns about limited biological replication. In practice, the posterior predictive method can assess when its approximations are valid and when they are inaccurate. Under conditions in which its approximations are valid, it corroborates the results of cross validation. Both comparison methodologies are applicable to both single-channel and dual-channel microarrays. For the data sets considered, estimating prediction error by cross validation demonstrates that empirical Bayes methods based on hierarchical models tend to outperform algorithms based on selecting genes by their fold changes or by non-hierarchical model-selection criteria. (The latter two approaches have comparable performance.) The posterior predictive assessment corroborates these findings.

CONCLUSIONS

Algorithms for detecting differential gene expression may be compared by estimating each algorithm's error in predicting expression ratios, whether such ratios are defined across microarray channels or between two independent groups.According to two distinct estimators of prediction error, algorithms using hierarchical models outperform the other algorithms of the study. The fact that fold-change shrinkage performed as well as conventional model selection criteria calls for investigating algorithms that combine the strengths of significance testing and fold-change estimation.

摘要

背景

基于微阵列数据确定差异表达基因的问题的持续研究产生了大量的统计算法，每个算法都有理论、模拟或特定验证的依据，但在实际结果上与同样合理的算法有所不同。最近，引入了一种一致性方法来衡量基因列表之间的一致性，以评估差异基因表达检测的各个方面。这种方法的优点是仅基于实际数据分析的结果进行评估，但是由于它需要检查给定大小的基因列表，因此可能不稳定。

结果

描述了两种评估预测误差的方法：交叉验证方法和后验预测方法。作为一种从观察到的表达水平估计预测误差的非参数方法，交叉验证为检测差异基因表达的算法提供了一种经验方法，对于大量生物学重复是完全合理的。由于它利用了只有一小部分基因差异表达的知识，因此后验预测方法有望更可靠地估计算法性能，减轻对有限生物学重复的担忧。实际上，后验预测方法可以评估其近似值何时有效以及何时不准确。在后验预测方法的近似值有效的条件下，它可以证实交叉验证的结果。这两种比较方法都适用于单通道和双通道微阵列。对于所考虑的数据集，通过交叉验证估计预测误差表明，基于层次模型的经验贝叶斯方法往往优于基于选择基因的倍数变化或非层次模型选择标准的算法。（后两种方法的性能相当。）后验预测评估证实了这些发现。

结论

可以通过估计每种算法在预测表达比率时的误差来比较差异基因表达检测算法，无论这些比率是在微阵列通道之间定义的，还是在两个独立组之间定义的。根据两种不同的预测误差估计量，使用层次模型的算法优于研究中的其他算法。倍数变化收缩与传统模型选择标准一样表现良好，这需要研究将显著性检验和倍数变化估计的优势相结合的算法。