Hapfelmeier Alexander, Hothorn Torsten, Riediger Carina, Ulm Kurt
Int J Biostat. 2014;10(2):165-83. doi: 10.1515/ijb-2013-0038.
Abstract In the last few decades, new developments in liver surgery have led to an expanded applicability and an improved safety. However, liver surgery is still associated with postoperative morbidity and mortality, especially in extended resections. We analyzed a large liver surgery database to investigate whether laboratory parameters like haemoglobin, leucocytes, bilirubin, haematocrit and lactate might be relevant preoperative predictors. It is not uncommon to observe missing values in such data. This also holds for many other data sources and research fields. For analysis, one can make use of imputation methods or approaches that are able to deal with missing values in the predictor variables. A representative of the latter are Random Forests which also provide variable importance measures to assess a variable's relevance for prediction. Applied to the liver surgery data, we observed divergent results for the laboratory parameters, depending on the method used to cope with missing values. We therefore performed an extensive simulation study to investigate the properties of each approach. Findings and recommendations: Complete case analysis should not be used as it distorts the relevance of completely observed variables in an undesirable way. The estimation of a variable's importance by a self-contained measure that can deal with missing values appropriately reflects the decreased relevance of variables with missing values. It can therefore be used to obtain insight into Random Forests which are commonly fit without preprocessing of missing values in the data. By contrast, multiple imputation allows for the assessment of a variable's relevance one would potentially observe in complete-data situations, if imputation performs well. For the laboratory data, lactate and bilirubin seem to be associated with the risk of liver failure and postoperative complications. These relations should be investigated by future studies in more detail. However, it is important to carefully consider the method used for analysis when there are missing values in the predictor variables.
摘要 在过去几十年中,肝脏手术的新进展使其适用性得以扩大,安全性也有所提高。然而,肝脏手术仍与术后发病率和死亡率相关,尤其是在扩大切除术中。我们分析了一个大型肝脏手术数据库,以研究诸如血红蛋白、白细胞、胆红素、血细胞比容和乳酸等实验室参数是否可能是相关的术前预测指标。在这类数据中观察到缺失值并不罕见。许多其他数据源和研究领域也是如此。对于分析,可以使用插补方法或能够处理预测变量中缺失值的方法。后者的一个代表是随机森林,它还提供变量重要性度量来评估变量与预测的相关性。应用于肝脏手术数据时,我们根据用于处理缺失值的方法观察到实验室参数的结果存在差异。因此,我们进行了一项广泛的模拟研究,以研究每种方法的特性。研究结果与建议:不应使用完整病例分析,因为它会以不良方式扭曲完全观察到的变量的相关性。通过一种能够适当处理缺失值的独立度量来估计变量的重要性,能恰当地反映具有缺失值的变量相关性的降低。因此,它可用于深入了解通常在不对数据中的缺失值进行预处理的情况下拟合的随机森林。相比之下,如果插补效果良好,多重插补可以评估在完整数据情况下可能观察到的变量相关性。对于实验室数据,乳酸和胆红素似乎与肝衰竭风险和术后并发症相关。这些关系应在未来的研究中进行更详细的调查。然而,当预测变量中存在缺失值时,仔细考虑用于分析的方法非常重要。