Department of Health Sciences and Technology (D-HEST), ETH Zurich, Universitätstrasse 2, 8092, Zürich, Switzerland.
Schulthess Klinik, Lengghalde 2, 8008, Zürich, Switzerland.
BMC Med Res Methodol. 2024 Jan 6;24(1):5. doi: 10.1186/s12874-023-02125-x.
In the last decades, medical research fields studying rare conditions such as spinal cord injury (SCI) have made extensive efforts to collect large-scale data. However, most analysis methods rely on complete data. This is particularly troublesome when studying clinical data as they are prone to missingness. Often, researchers mitigate this problem by removing patients with missing data from the analyses. Less commonly, imputation methods to infer likely values are applied.
Our objective was to study how handling missing data influences the results reported, taking the example of SCI registries. We aimed to raise awareness on the effects of missing data and provide guidelines to be applied for future research projects, in SCI research and beyond.
Using the Sygen clinical trial data (n = 797), we analyzed the impact of the type of variable in which data is missing, the pattern according to which data is missing, and the imputation strategy (e.g. mean imputation, last observation carried forward, multiple imputation).
Our simulations show that mean imputation may lead to results strongly deviating from the underlying expected results. For repeated measures missing at late stages (> = 6 months after injury in this simulation study), carrying the last observation forward seems the preferable option for the imputation. This simulation study could show that a one-size-fit-all imputation strategy falls short in SCI data sets.
Data-tailored imputation strategies are required (e.g., characterisation of the missingness pattern, last observation carried forward for repeated measures evolving to a plateau over time). Therefore, systematically reporting the extent, kind and decisions made regarding missing data will be essential to improve the interpretation, transparency, and reproducibility of the research presented.
在过去的几十年中,研究脊髓损伤(SCI)等罕见病症的医学研究领域已经做出了大量努力来收集大规模数据。然而,大多数分析方法都依赖于完整的数据。在研究临床数据时,这尤其麻烦,因为它们容易出现缺失。研究人员通常通过从分析中删除缺失数据的患者来解决此问题。较少情况下,应用推断可能值的插补方法。
我们的目的是以 SCI 登记处为例,研究处理缺失数据如何影响报告的结果。我们旨在提高对缺失数据影响的认识,并为未来的研究项目提供指导方针,不仅限于 SCI 研究领域。
使用 Sygen 临床试验数据(n=797),我们分析了缺失数据的变量类型、缺失数据的模式以及插补策略(例如均值插补、末次观测值结转、多重插补)对结果的影响。
我们的模拟表明,均值插补可能导致结果与基础预期结果严重偏离。对于晚期(在这项模拟研究中,在损伤后≥6 个月)缺失的重复测量数据,末次观测值结转似乎是插补的首选策略。这项模拟研究表明,一刀切的插补策略不能满足 SCI 数据集的要求。
需要针对数据的插补策略(例如,缺失模式的特征描述、随着时间推移逐渐趋于稳定的重复测量的末次观测值结转)。因此,系统地报告缺失数据的程度、类型和决策,对于提高研究结果的解释、透明度和可重复性至关重要。