Parodi Stefano, Pistoia Vito, Muselli Marco
Epidemiology and Biostatistics Section, Scientific Directorate, G. Gaslini Children's Hospital, Genoa, Italy.
BMC Bioinformatics. 2008 Oct 3;9:410. doi: 10.1186/1471-2105-9-410.
Most microarray experiments are carried out with the purpose of identifying genes whose expression varies in relation with specific conditions or in response to environmental stimuli. In such studies, genes showing similar mean expression values between two or more groups are considered as not differentially expressed, even if hidden subclasses with different expression values may exist. In this paper we propose a new method for identifying differentially expressed genes, based on the area between the ROC curve and the rising diagonal (ABCR). ABCR represents a more general approach than the standard area under the ROC curve (AUC), because it can identify both proper (i.e., concave) and not proper ROC curves (NPRC). In particular, NPRC may correspond to those genes that tend to escape standard selection methods.
We assessed the performance of our method using data from a publicly available database of 4026 genes, including 14 normal B cell samples (NBC) and 20 heterogeneous lymphomas (namely: 9 follicular lymphomas and 11 chronic lymphocytic leukemias). Moreover, NBC also included two sub-classes, i.e., 6 heavily stimulated and 8 slightly or not stimulated samples. We identified 1607 differentially expressed genes with an estimated False Discovery Rate of 15%. Among them, 16 corresponded to NPRC and all escaped standard selection procedures based on AUC and t statistics. Moreover, a simple inspection to the shape of such plots allowed to identify the two subclasses in either one class in 13 cases (81%).
NPRC represent a new useful tool for the analysis of microarray data.
大多数微阵列实验的开展目的是识别那些其表达随特定条件变化或对环境刺激作出反应而改变的基因。在这类研究中,即便可能存在具有不同表达值的隐藏亚类,在两个或更多组之间显示出相似平均表达值的基因也被视为无差异表达。在本文中,我们提出了一种基于ROC曲线与上升对角线之间的面积(ABCR)来识别差异表达基因的新方法。ABCR代表了一种比标准ROC曲线下面积(AUC)更通用的方法,因为它既能识别合适的(即凹形的)ROC曲线,也能识别不合适的ROC曲线(NPRC)。特别是,NPRC可能对应于那些倾向于逃避标准选择方法的基因。
我们使用来自一个包含4026个基因的公开可用数据库的数据评估了我们方法的性能,该数据库包括14个正常B细胞样本(NBC)和20个异质性淋巴瘤(即:9个滤泡性淋巴瘤和11个慢性淋巴细胞白血病)。此外,NBC还包括两个亚类,即6个高度刺激样本和8个轻度刺激或未刺激样本。我们识别出了1607个差异表达基因,估计错误发现率为15%。其中,16个对应于NPRC,并且所有这些基因都逃避了基于AUC和t统计量的标准选择程序。此外,对这些图的形状进行简单检查使得在13个案例(81%)中能够在任何一个类别中识别出这两个亚类。
NPRC是分析微阵列数据的一种新的有用工具。