Wang Zhanfeng, Chang Yuan-chin I, Ying Zhiliang, Zhu Liang, Yang Yaning
Department of Statistics and Finance, University of Science and Technology of China, Hefei, 230026, China.
Bioinformatics. 2007 Oct 15;23(20):2788-94. doi: 10.1093/bioinformatics/btm442. Epub 2007 Sep 18.
Protein expression profiling for differences indicative of early cancer holds promise for improving diagnostics. Due to their high dimensionality, statistical analysis of proteomic data from mass spectrometers is challenging in many aspects such as dimension reduction, feature subset selection as well as construction of classification rules. Search of an optimal feature subset, commonly known as the feature subset selection (FSS) problem, is an important step towards disease classification/diagnostics with biomarkers.
We develop a parsimonious threshold-independent feature selection (PTIFS) method based on the concept of area under the curve (AUC) of the receiver operating characteristic (ROC). To reduce computational complexity to a manageable level, we use a sigmoid approximation to the empirical AUC as the criterion function. Starting from an anchor feature, the PTIFS method selects a feature subset through an iterative updating algorithm. Highly correlated features that have similar discriminating power are precluded from being selected simultaneously. The classification rule is then determined from the resulting feature subset.
The performance of the proposed approach is investigated by extensive simulation studies, and by applying the method to two mass spectrometry data sets of prostate cancer and of liver cancer. We compare the new approach with the threshold gradient descent regularization (TGDR) method. The results show that our method can achieve comparable performance to that of the TGDR method in terms of disease classification, but with fewer features selected.
Supplementary Material and the PTIFS implementations are available at http://staff.ustc.edu.cn/~ynyang/PTIFS.
Supplementary data are available at Bioinformatics online.
通过蛋白质表达谱分析来寻找早期癌症的差异特征,有望改善诊断方法。由于蛋白质组学数据维度高,对质谱仪产生的蛋白质组数据进行统计分析在许多方面都具有挑战性,如降维、特征子集选择以及分类规则构建等。寻找最优特征子集,即通常所说的特征子集选择(FSS)问题,是利用生物标志物进行疾病分类/诊断的重要一步。
我们基于接收器操作特征(ROC)曲线下面积(AUC)的概念,开发了一种简洁的与阈值无关的特征选择(PTIFS)方法。为了将计算复杂度降低到可管理的水平,我们使用经验AUC的Sigmoid近似作为准则函数。从一个锚定特征开始,PTIFS方法通过迭代更新算法选择一个特征子集。具有相似区分能力的高度相关特征不会被同时选中。然后根据得到的特征子集确定分类规则。
通过广泛的模拟研究以及将该方法应用于前列腺癌和肝癌的两个质谱数据集,对所提出方法的性能进行了研究。我们将新方法与阈值梯度下降正则化(TGDR)方法进行了比较。结果表明,在疾病分类方面,我们的方法能够达到与TGDR方法相当的性能,但所选特征更少。
补充材料和PTIFS实现可在http://staff.ustc.edu.cn/~ynyang/PTIFS获取。
补充数据可在《生物信息学》在线获取。