Department of Biomedical Engineering, University of Calgary, Calgary, AB, Canada.
Undergraduate Medical Education, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.
Med Biol Eng Comput. 2021 Feb;59(2):471-482. doi: 10.1007/s11517-020-02301-x. Epub 2021 Feb 3.
Optimizing the number and utility of features to use in a classification analysis has been the subject of many research studies. Most current models use end-classifications as part of the feature reduction process, leading to circularity in the methodology. The approach demonstrated in the present research uses item response theory (IRT) to select features independent of the end-classification results without the biased accuracies that this circularity engenders. Dichotomous and polytomous IRT models were used to analyze 30 histological breast cancer features from 569 patients using the Wisconsin Diagnostic Breast Cancer data set. Based on their characteristics, three features were selected for use in a machine learning classifier. For comparison purposes, two machine learning-based feature selection protocols were run-recursive feature elimination (RFE) and ridge regression-and the three features selected from these analyses were also used in the subsequent learning classifier. Classification results demonstrated that all three selection processes performed comparably. The non-biased nature of the IRT protocol and information provided about the specific characteristics of the features as to why they are of use in classification help to shed light on understanding which attributes of features make them suitable for use in a machine learning context.
优化分类分析中使用的特征数量和效用一直是许多研究的主题。大多数现有模型将终端分类用作特征减少过程的一部分,导致方法学中的循环。本研究中展示的方法使用项目反应理论(IRT)在不产生这种循环的有偏差准确性的情况下,独立于终端分类结果选择特征。二项式和多项式 IRT 模型用于使用威斯康星州诊断乳腺癌数据集分析来自 569 名患者的 30 个乳腺癌组织学特征。基于其特征,选择了三个特征用于机器学习分类器。出于比较目的,运行了两种基于机器学习的特征选择协议——递归特征消除(RFE)和岭回归——并在后续学习分类器中使用了这些分析中选择的三个特征。分类结果表明,所有三个选择过程的性能相当。IRT 协议的无偏性质以及关于特征为何在分类中有用的特定特征的信息提供有助于阐明理解哪些特征属性使其适合在机器学习上下文中使用。