Abdel-Aal R E
Physics Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia.
J Biomed Inform. 2005 Dec;38(6):456-68. doi: 10.1016/j.jbi.2005.03.003. Epub 2005 Apr 16.
Medical applications are often characterized by a large number of disease markers and a relatively small number of data records. We demonstrate that complete feature ranking followed by selection can lead to appreciable reductions in data dimensionality, with significant improvements in the implementation and performance of classifiers for medical diagnosis. We describe a novel approach for ranking all features according to their predictive quality using properties unique to learning algorithms based on the group method of data handling (GMDH). An abductive network training algorithm is repeatedly used to select groups of optimum predictors from the feature set at gradually increasing levels of model complexity specified by the user. Groups selected earlier are better predictors. The process is then repeated to rank features within individual groups. The resulting full feature ranking can be used to determine the optimum feature subset by starting at the top of the list and progressively including more features until the classification error rate on an out-of-sample evaluation set starts to increase due to overfitting. The approach is demonstrated on two medical diagnosis datasets (breast cancer and heart disease) and comparisons are made with other feature ranking and selection methods. Receiver operating characteristics (ROC) analysis is used to compare classifier performance. At default model complexity, dimensionality reduction of 22 and 54% could be achieved for the breast cancer and heart disease data, respectively, leading to improvements in the overall classification performance. For both datasets, considerable dimensionality reduction introduced no significant reduction in the area under the ROC curve. GMDH-based feature selection results have also proved effective with neural network classifiers.
医学应用通常具有大量疾病标志物和相对较少的数据记录。我们证明,先进行完整的特征排序然后再进行选择,可以显著降低数据维度,同时在医学诊断分类器的实现和性能方面有显著提升。我们描述了一种新颖的方法,即根据基于数据处理分组方法(GMDH)的学习算法所特有的属性,依据所有特征的预测质量对其进行排序。一种溯因网络训练算法被反复用于从用户指定的逐渐增加的模型复杂度水平下的特征集中选择最优预测变量组。较早选择的组是更好的预测变量。然后重复该过程对各个组内的特征进行排序。通过从列表顶部开始并逐步纳入更多特征,直到由于过拟合导致样本外评估集上的分类错误率开始增加,由此得到的完整特征排序可用于确定最优特征子集。该方法在两个医学诊断数据集(乳腺癌和心脏病)上进行了演示,并与其他特征排序和选择方法进行了比较。使用受试者工作特征(ROC)分析来比较分类器性能。在默认模型复杂度下,乳腺癌和心脏病数据分别可实现22%和54%的降维,从而提高了整体分类性能。对于这两个数据集,大幅降维并未导致ROC曲线下面积显著减小。基于GMDH的特征选择结果在神经网络分类器中也已证明是有效的。