Attoor Sanju, Dougherty Edward R, Chen Yidong, Bittner Michael L, Trent Jeffrey M
Department of Electrical Engineering, Texas A&M University, College Station, TX 78041, USA.
Bioinformatics. 2004 Nov 1;20(16):2513-20. doi: 10.1093/bioinformatics/bth272. Epub 2004 Sep 28.
There are two general methods for making gene-expression microarrays: one is to hybridize a single test set of labeled targets to the probe, and measure the background-subtracted intensity at each probe site; the other is to hybridize both a test and a reference set of differentially labeled targets to a single detector array, and measure the ratio of the background-subtracted intensities at each probe site. Which method is better depends on the variability in the cell system and the random factors resulting from the microarray technology. It also depends on the purpose for which the microarray is being used. Classification is a fundamental application and it is the one considered here.
This paper describes a model-based simulation paradigm that compares the classification accuracy provided by these methods over a variety of noise types and presents the results of a study modeled on noise typical of cDNA microarray data. The model consists of four parts: (1) the measurement equation for genes in the reference state; (2) the measurement equation for genes in the test state; (3) the ratio and normalization procedure for a dual-channel system; and (4) the intensity and normalization procedure for a single-channel system. In the reference state, the mean intensities are modeled as a shifted exponential distribution, and the intensity for a particular gene is modeled via a normal distribution, Normal(I, alphaI), about its mean intensity I, with alpha being the coefficient of variation of the cell system. In the test state, some genes have their intensities up-regulated by a random factor. The model includes a number of random factors affecting intensity measurement: deposition gain d, labeling gain, and post-image-processing residual noise. The key conclusion resulting from the study is that the coefficient of variation governing the randomness of the intensities and the deposition gain are the most important factors for determining whether a single-channel or dual-channel system provides superior classification, and the decision region in the alpha-d plane is approximately linear.
制作基因表达微阵列有两种常用方法:一种是将一组标记好的单一测试靶标与探针杂交,并测量每个探针位点扣除背景后的强度;另一种是将一组差异标记的测试靶标和一组参考靶标都与单个检测阵列杂交,并测量每个探针位点扣除背景后的强度之比。哪种方法更好取决于细胞系统的变异性以及微阵列技术产生的随机因素。这也取决于微阵列的使用目的。分类是一项基本应用,也是本文所考虑的应用。
本文描述了一种基于模型的模拟范式,该范式比较了这些方法在各种噪声类型下的分类准确性,并展示了一项以典型cDNA微阵列数据噪声为模型的研究结果。该模型由四部分组成:(1)参考状态下基因的测量方程;(2)测试状态下基因的测量方程;(3)双通道系统的比率和归一化程序;(4)单通道系统的强度和归一化程序。在参考状态下,平均强度被建模为移位指数分布,特定基因的强度通过围绕其平均强度I的正态分布Normal(I, alphaI)进行建模,其中alpha是细胞系统的变异系数。在测试状态下,一些基因的强度因随机因素而上调。该模型包括一些影响强度测量的随机因素:沉积增益d、标记增益和图像后处理残余噪声。该研究得出的关键结论是,控制强度随机性的变异系数和沉积增益是决定单通道或双通道系统是否提供更优分类的最重要因素,并且alpha-d平面中的决策区域近似为线性。