Zamanian Alireza, von Kleist Henrik, Ciora Octavia-Andreea, Piperno Marta, Lancho Gino, Ahmidi Narges
Department of Computer Science, TUM School of Computation, Information and Technology, Technical University of Munich, 85748 Munich, Germany.
Fraunhofer Institute for Cognitive Systems IKS, 80686 Munich, Germany.
J Pers Med. 2024 May 11;14(5):514. doi: 10.3390/jpm14050514.
Despite the extensive literature on missing data theory and cautionary articles emphasizing the importance of realistic analysis for healthcare data, a critical gap persists in incorporating domain knowledge into the missing data methods. In this paper, we argue that the remedy is to identify the key scenarios that lead to data missingness and investigate their theoretical implications. Based on this proposal, we first introduce an analysis framework where we investigate how different observation agents, such as physicians, influence the data availability and then scrutinize each scenario with respect to the steps in the missing data analysis. We apply this framework to the case study of observational data in healthcare facilities. We identify ten fundamental missingness scenarios and show how they influence the identification step for missing data graphical models, inverse probability weighting estimation, and exponential tilting sensitivity analysis. To emphasize how domain-informed analysis can improve method reliability, we conduct simulation studies under the influence of various missingness scenarios. We compare the results of three common methods in medical data analysis: complete-case analysis, Missforest imputation, and inverse probability weighting estimation. The experiments are conducted for two objectives: variable mean estimation and classification accuracy. We advocate for our analysis approach as a reference for the observational health data analysis. Beyond that, we also posit that the proposed analysis framework is applicable to other medical domains.
尽管关于缺失数据理论的文献丰富,且有警示性文章强调对医疗数据进行现实分析的重要性,但在将领域知识纳入缺失数据方法方面仍存在关键差距。在本文中,我们认为补救办法是识别导致数据缺失的关键情形,并研究其理论影响。基于此提议,我们首先引入一个分析框架,在该框架中我们研究不同的观测主体(如医生)如何影响数据可用性,然后针对缺失数据分析的各个步骤仔细审查每种情形。我们将此框架应用于医疗机构观测数据的案例研究。我们识别出十种基本的缺失情形,并展示它们如何影响缺失数据图形模型的识别步骤、逆概率加权估计以及指数倾斜敏感性分析。为强调领域知情分析如何能提高方法的可靠性,我们在各种缺失情形的影响下进行模拟研究。我们比较医学数据分析中三种常用方法的结果:完整病例分析、Missforest插补法和逆概率加权估计。实验针对两个目标进行:变量均值估计和分类准确性。我们倡导将我们的分析方法作为观测健康数据分析的参考。除此之外,我们还认为所提出的分析框架适用于其他医学领域。