Hauskrecht Milos, Pelikan Richard, Malehorn David E, Bigbee William L, Lotze Michael T, Zeh Herbert J, Whitcomb David C, Lyons-Weiler James
Department of Computer Science, University of Pittsburgh, Pittsburgh, Pennsylvania, USAUniversity of Pittsburgh Cancer Institute, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
Appl Bioinformatics. 2005;4(4):227-46. doi: 10.2165/00822942-200504040-00003.
Proteomic peptide profiling is an emerging technology harbouring great expectations to enable early detection, enhance diagnosis and more clearly define prognosis of many diseases. Although previous research work has illustrated the ability of proteomic data to discriminate between cases and controls, significantly less attention has been paid to the analysis of feature selection strategies that enable learning of such predictive models. Feature selection, in addition to classification, plays an important role in successful identification of proteomic biomarker panels.
We present a new, efficient, multivariate feature selection strategy that extracts useful feature panels directly from the high-throughput spectra. The strategy takes advantage of the characteristics of surface-enhanced laser desorption/ionisation time-of-flight mass spectrometry (SELDI-TOF-MS) profiles and enhances widely used univariate feature selection strategies with a heuristic based on multivariate de-correlation filtering. We analyse and compare two versions of the method: one in which all feature pairs must adhere to a maximum allowed correlation (MAC) threshold, and another in which the feature panel is built greedily by deciding among best univariate features at different MAC levels.
The analysis and comparison of feature selection strategies was carried out experimentally on the pancreatic cancer dataset with 57 cancers and 59 controls from the University of Pittsburgh Cancer Institute, Pittsburgh, Pennsylvania, USA. The analysis was conducted in both the whole-profile and peak-only modes. The results clearly show the benefit of the new strategy over univariate feature selection methods in terms of improved classification performance.
Understanding the characteristics of the spectra allows us to better assess the relative importance of potential features in the diagnosis of cancer. Incorporation of these characteristics into feature selection strategies often leads to a more efficient data analysis as well as improved classification performance.
蛋白质组学肽谱分析是一项新兴技术,人们对其寄予厚望,期望它能够实现多种疾病的早期检测、增强诊断并更清晰地界定预后。尽管先前的研究工作已经证明蛋白质组学数据能够区分病例和对照,但对于能够学习此类预测模型的特征选择策略的分析却少得多。除了分类之外,特征选择在成功识别蛋白质组学生物标志物组方面也起着重要作用。
我们提出了一种新的、高效的多变量特征选择策略,该策略可直接从高通量光谱中提取有用的特征组。该策略利用了表面增强激光解吸/电离飞行时间质谱(SELDI-TOF-MS)图谱的特征,并通过基于多变量去相关滤波的启发式方法增强了广泛使用的单变量特征选择策略。我们分析并比较了该方法的两个版本:一个版本要求所有特征对必须符合最大允许相关性(MAC)阈值,另一个版本则通过在不同MAC水平下的最佳单变量特征中进行选择来贪婪地构建特征组。
我们在美国宾夕法尼亚州匹兹堡大学癌症研究所的胰腺癌数据集上进行了实验,该数据集包含57例癌症患者和59例对照,对特征选择策略进行了分析和比较。分析在全图谱模式和仅峰值模式下进行。结果清楚地表明,新策略在分类性能方面优于单变量特征选择方法。
了解光谱特征使我们能够更好地评估潜在特征在癌症诊断中的相对重要性。将这些特征纳入特征选择策略通常会带来更高效的数据分析以及更好的分类性能。