Morris Jeffrey S, Baggerly Keith A, Coombes Kevin R
Department of Biostatistics, University of Texas, M. D. Anderson Cancer Center, 1515 Holcombe Blvd., Box 447, Houston, Texas 77030-4009, USA.
Biometrics. 2003 Sep;59(3):476-86. doi: 10.1111/1541-0420.00057.
Serial analysis of gene expression (SAGE) is a technology for quantifying gene expression in biological tissue that yields count data that can be modeled by a multinomial distribution with two characteristics: skewness in the relative frequencies and small sample size relative to the dimension. As a result of these characteristics, a given SAGE sample may fail to capture a large number of expressed mRNA species present in the tissue. Empirical estimators of mRNA species' relative abundance effectively ignore these missing species, and as a result tend to overestimate the abundance of the scarce observed species comprising a vast majority of the total. We have developed a new Bayesian estimation procedure that quantifies our prior information about these characteristics, yielding a nonlinear shrinkage estimator with efficiency advantages over the MLE. Our prior is mixture of Dirichlets, whereby species are stochastically partitioned into abundant and scarce classes, each with its own multivariate prior. Simulation studies reveal our estimator has lower integrated mean squared error (IMSE) than the MLE for the SAGE scenarios simulated, and yields relative abundance profiles closer in Euclidean distance to the truth for all samples simulated. We apply our method to a SAGE library of normal colon tissue, and discuss its implications for assessing differential expression.
基因表达序列分析(SAGE)是一种用于定量生物组织中基因表达的技术,它产生的计数数据可以用具有两个特征的多项分布来建模:相对频率的偏度和相对于维度的小样本量。由于这些特征,给定的SAGE样本可能无法捕获组织中存在的大量表达的mRNA种类。mRNA种类相对丰度的经验估计器有效地忽略了这些缺失的种类,结果往往高估了构成总数绝大多数的稀缺观察种类的丰度。我们开发了一种新的贝叶斯估计程序,该程序量化了我们关于这些特征的先验信息,产生了一种比最大似然估计(MLE)具有效率优势的非线性收缩估计器。我们的先验是狄利克雷混合,据此将种类随机分为丰富类和稀缺类,每类都有自己的多变量先验。模拟研究表明,对于模拟的SAGE场景,我们的估计器比MLE具有更低的积分均方误差(IMSE),并且对于所有模拟样本,在欧几里得距离上产生的相对丰度分布更接近真实情况。我们将我们的方法应用于正常结肠组织的SAGE文库,并讨论其对评估差异表达的意义。