Kline Adrienne, Luo Yuan
Department of Surgery, Northwestern University, Chicago, postcode, USA.
Center for Artificial Intelligence, Northwestern Medicine, Chicago, USA.
Res Sq. 2024 Jul 2:rs.3.rs-4529519. doi: 10.21203/rs.3.rs-4529519/v1.
Most datasets suffer from partial or complete missing values, which has downstream limitations on the available models on which to test the data and on any statistical inferences that can be made from the data. Several imputation techniques have been designed to replace missing data with stand in values. The various approaches have implications for calculating clinical scores, model building and model testing. The work showcased here offers a novel means for categorical imputation based on item response theory (IRT) and compares it against several methodologies currently used in the machine learning field including k-nearest neighbors (kNN), multiple imputed chained equations (MICE) and Amazon Web Services (AWS) deep learning method, Datawig. Analyses comparing these techniques were performed on three different datasets that represented ordinal, nominal and binary categories. The data were modified so that they also varied on both the proportion of data missing and the systematization of the missing data. Two different assessments of performance were conducted: accuracy in reproducing the missing values, and predictive performance using the imputed data. Results demonstrated that the new method, Item Response Theory for Categorical Imputation (IRTCI), fared quite well compared to currently used methods, outperforming several of them in many conditions. Given the theoretical basis for the new approach, and the unique generation of probabilistic terms for determining category belonging for missing cells, IRTCI offers a viable alternative to current approaches.
大多数数据集都存在部分或完全缺失值的问题,这对可用于测试数据的现有模型以及可从数据中得出的任何统计推断都有下游限制。已经设计了几种插补技术,用替代值来替换缺失数据。各种方法对临床评分的计算、模型构建和模型测试都有影响。这里展示的工作提供了一种基于项目反应理论(IRT)进行分类插补的新方法,并将其与机器学习领域目前使用的几种方法进行比较,包括k近邻(kNN)、多重插补链式方程(MICE)和亚马逊网络服务(AWS)深度学习方法Datawig。在代表有序、名义和二元类别的三个不同数据集上对这些技术进行了比较分析。对数据进行了修改,使其在缺失数据的比例和缺失数据的系统化方面也有所不同。进行了两种不同的性能评估:重现缺失值的准确性,以及使用插补数据的预测性能。结果表明,与目前使用的方法相比,新方法——分类插补项目反应理论(IRTCI)表现相当出色,在许多情况下优于其中几种方法。鉴于新方法的理论基础,以及为确定缺失单元格的类别归属而独特生成的概率项,IRTCI为当前方法提供了一个可行的替代方案。