Donald Margaret R, Wilson Susan R
Stat Appl Genet Mol Biol. 2017 Mar 1;16(1):31-45. doi: 10.1515/sagmb-2016-0036.
Output from analysis of a high-throughput 'omics' experiment very often is a ranked list. One commonly encountered example is a ranked list of differentially expressed genes from a gene expression experiment, with a length of many hundreds of genes. There are numerous situations where interest is in the comparison of outputs following, say, two (or more) different experiments, or of different approaches to the analysis that produce different ranked lists. Rather than considering exact agreement between the rankings, following others, we consider two ranked lists to be in agreement if the rankings differ by some fixed distance. Generally only a relatively small subset of the k top-ranked items will be in agreement. So the aim is to find the point k at which the probability of agreement in rankings changes from being greater than 0.5 to being less than 0.5. We use penalized splines and a Bayesian logit model, to give a nonparametric smooth to the sequence of agreements, as well as pointwise credible intervals for the probability of agreement. Our approach produces a point estimate and a credible interval for k. R code is provided. The method is applied to rankings of genes from breast cancer microarray experiments.
高通量“组学”实验的分析结果通常是一个排序列表。一个常见的例子是基因表达实验中差异表达基因的排序列表,其长度有数百个基因。在许多情况下,人们感兴趣的是比较例如两个(或更多)不同实验的结果,或者比较产生不同排序列表的不同分析方法的结果。我们并不像其他人那样考虑排名之间的完全一致,而是认为如果两个排序列表的排名相差某个固定距离,那么它们就是一致的。通常,在排名靠前的k个项目中,只有相对较小的一部分会是一致的。因此,目标是找到这样一个点k,在该点上,排名一致的概率从大于0.5变为小于0.5。我们使用惩罚样条和贝叶斯逻辑模型,对一致性序列进行非参数平滑处理,并给出一致性概率的逐点可信区间。我们的方法会给出k的点估计和可信区间。文中提供了R代码。该方法应用于乳腺癌微阵列实验中基因的排名。