Sorace James M, Zhan Min
Department of Pathology and Laboratory Services, Veterans Administration Maryland Health Care System, Baltimore 21201, USA.
BMC Bioinformatics. 2003 Jun 9;4:24. doi: 10.1186/1471-2105-4-24.
The early detection of ovarian cancer has the potential to dramatically reduce mortality. Recently, the use of mass spectrometry to develop profiles of patient serum proteins, combined with advanced data mining algorithms has been reported as a promising method to achieve this goal. In this report, we analyze the Ovarian Dataset 8-7-02 downloaded from the Clinical Proteomics Program Databank website, using nonparametric statistics and stepwise discriminant analysis to develop rules to diagnose patients, as well as to understand general patterns in the data that may guide future research.
The mass spectrometry serum profiles derived from cancer and controls exhibited numerous statistical differences. For example, use of the Wilcoxon test in comparing the intensity at each of the 15,154 mass to charge (M/Z) values between the cancer and controls, resulted in the detection of 3,591 M/Z values whose intensities differed by a p-value of 10-6 or less. The region containing the M/Z values of greatest statistical difference between cancer and controls occurred at M/Z values less than 500. For example the M/Z values of 2.7921478 and 245.53704 could be used to significantly separate the cancer from control groups. Three other sets of M/Z values were developed using a training set that could distinguish between cancer and control subjects in a test set with 100% sensitivity and specificity.
The ability to discriminate between cancer and control subjects based on the M/Z values of 2.7921478 and 245.53704 reveals the existence of a significant non-biologic experimental bias between these two groups. This bias may invalidate attempts to use this dataset to find patterns of reproducible diagnostic value. To minimize false discovery, results using mass spectrometry and data mining algorithms should be carefully reviewed and benchmarked with routine statistical methods.
卵巢癌的早期检测有可能显著降低死亡率。最近,有报道称,将质谱技术用于生成患者血清蛋白图谱,并结合先进的数据挖掘算法,是实现这一目标的一种很有前景的方法。在本报告中,我们分析了从临床蛋白质组学计划数据库网站下载的卵巢数据集8 - 7 - 02,使用非参数统计和逐步判别分析来制定诊断患者的规则,并了解数据中的一般模式,以指导未来的研究。
来自癌症患者和对照组的质谱血清图谱显示出许多统计学差异。例如,使用威尔科克森检验比较癌症患者和对照组之间15154个质荷比(M/Z)值处的强度,结果检测到3591个M/Z值,其强度差异的p值为10的 - 6次方或更小。癌症患者和对照组之间统计学差异最大的M/Z值所在区域出现在M/Z值小于500的范围内。例如,M/Z值2.7921478和245.53704可用于显著区分癌症组和对照组。另外还使用一个训练集开发了另外三组M/Z值,该训练集能够在测试集中以100%的灵敏度和特异性区分癌症患者和对照对象。
基于M/Z值2.7921478和245.53704区分癌症患者和对照对象的能力揭示了这两组之间存在显著的非生物学实验偏差。这种偏差可能会使利用该数据集寻找具有可重复诊断价值的模式的尝试无效。为了尽量减少错误发现,应仔细审查使用质谱和数据挖掘算法得到的结果,并用常规统计方法进行基准测试。