Li Lihua, Tang Hong, Wu Zuobao, Gong Jianli, Gruidl Michael, Zou Jun, Tockman Melvyn, Clark Robert A
Department of Radiology, College of Medicine, H. Lee Moffitt Cancer Center and Research Institute, University of South Florida, Tampa, FL 33612-4799, USA.
Artif Intell Med. 2004 Oct;32(2):71-83. doi: 10.1016/j.artmed.2004.03.006.
Pathological changes in an organ or tissue may be reflected in proteomic patterns in serum. It is possible that unique serum proteomic patterns could be used to discriminate cancer samples from non-cancer ones. Due to the complexity of proteomic profiling, a higher order analysis such as data mining is needed to uncover the differences in complex proteomic patterns. The objectives of this paper are (1) to briefly review the application of data mining techniques in proteomics for cancer detection/diagnosis; (2) to explore a novel analytic method with different feature selection methods; (3) to compare the results obtained on different datasets and that reported by Petricoin et al. in terms of detection performance and selected proteomic patterns.
Three serum SELDI MS data sets were used in this research to identify serum proteomic patterns that distinguish the serum of ovarian cancer cases from non-cancer controls. A support vector machine-based method is applied in this study, in which statistical testing and genetic algorithm-based methods are used for feature selection respectively. Leave-one-out cross validation with receiver operating characteristic (ROC) curve is used for evaluation and comparison of cancer detection performance.
The results showed that (1) data mining techniques can be successfully applied to ovarian cancer detection with a reasonably high performance; (2) the classification using features selected by the genetic algorithm consistently outperformed those selected by statistical testing in terms of accuracy and robustness; (3) the discriminatory features (proteomic patterns) can be very different from one selection method to another. In other words, the pattern selection and its classification efficiency are highly classifier dependent. Therefore, when using data mining techniques, the discrimination of cancer from normal does not depend solely upon the identity and origination of cancer-related proteins.
器官或组织的病理变化可能反映在血清蛋白质组模式中。独特的血清蛋白质组模式有可能用于区分癌症样本和非癌症样本。由于蛋白质组分析的复杂性,需要诸如数据挖掘等高阶分析来揭示复杂蛋白质组模式中的差异。本文的目的是:(1)简要回顾数据挖掘技术在蛋白质组学中用于癌症检测/诊断的应用;(2)探索一种采用不同特征选择方法的新型分析方法;(3)在检测性能和所选蛋白质组模式方面,比较在不同数据集上获得的结果以及Petricoin等人报告的结果。
本研究使用三个血清SELDI MS数据集来识别区分卵巢癌病例血清与非癌症对照血清的蛋白质组模式。本研究应用了一种基于支持向量机的方法,其中分别使用统计检验和基于遗传算法的方法进行特征选择。采用留一法交叉验证和受试者工作特征(ROC)曲线来评估和比较癌症检测性能。
结果表明:(1)数据挖掘技术可以成功应用于卵巢癌检测,且性能相当高;(2)就准确性和稳健性而言,使用遗传算法选择的特征进行分类始终优于使用统计检验选择的特征进行的分类;(3)不同选择方法的鉴别特征(蛋白质组模式)可能非常不同。换句话说,模式选择及其分类效率高度依赖于分类器。因此,在使用数据挖掘技术时,癌症与正常的区分不仅仅取决于癌症相关蛋白质的身份和来源。