Johannes Gutenberg University, Mainz, Germany.
Cancer Registry of Rhineland-Palatinate in the Institute for Digital Health Data, Germany.
Stud Health Technol Inform. 2024 Aug 22;316:1800-1804. doi: 10.3233/SHTI240780.
Missing values (NA) often occur in cancer research, which may be due to reasons such as data protection, data loss, or missing follow-up data. Such incomplete patient information can have an impact on prediction models and other data analyses. Imputation methods are a tool for dealing with NA. Cancer data is often presented in an ordered categorical form, such as tumour grading and staging, which requires special methods. This work compares mode imputation, k nearest neighbour (knn) imputation, and, in the context of Multiple Imputation by Chained Equations (MICE), logistic regression model with proportional odds (mice_polr) and random forest (mice_rf) on a real-world prostate cancer dataset provided by the Cancer Registry of Rhineland-Palatinate in Germany. Our dataset contains relevant information for the risk classification of patients and the time between date of diagnosis and date of death. For the imputation comparison, we use Rubin's (1974) Missing Completely At Random (MCAR) mechanism to remove 10%, 20%, 30%, and 50% observations. The results are evaluated and ranked based on the accuracy per patient. Mice_rf performs significantly best for each percentage of NA, followed by knn, and mice_polr performs significantly worst. Furthermore, our findings indicate that the accuracy of imputation methods increases with a lower number of categories, a relatively even proportion of patients in the categories, or a majority of patients in a particular category.
在癌症研究中,经常会出现缺失值(NA),这可能是由于数据保护、数据丢失或缺失随访数据等原因。这种不完整的患者信息可能会对预测模型和其他数据分析产生影响。插补方法是处理 NA 的一种工具。癌症数据通常以有序分类的形式呈现,例如肿瘤分级和分期,这需要特殊的方法。本工作比较了模式插补、k 最近邻(knn)插补和在多链式方程(MICE)的背景下,逻辑回归模型与比例优势(mice_polr)和随机森林(mice_rf)在德国莱茵兰-普法尔茨癌症登记处提供的真实前列腺癌数据集上的应用。我们的数据集包含了患者风险分类和诊断日期与死亡日期之间时间的相关信息。对于插补比较,我们使用鲁宾(1974)的完全随机缺失(MCAR)机制来删除 10%、20%、30%和 50%的观测值。结果根据每个患者的准确性进行评估和排名。mice_rf 在每个缺失百分比下的表现都明显优于其他方法,其次是 knn,而 mice_polr 的表现明显最差。此外,我们的研究结果表明,插补方法的准确性随着类别数量的减少、类别中患者比例的相对均匀性或特定类别中大多数患者的增加而提高。