基于聚类和过采样的循环神经网络在临床试验中缺失数据插补。

Missing data imputation in clinical trials using recurrent neural network facilitated by clustering and oversampling.

机构信息

Institute for Medical Information Processing, Biometry and Epidemiology (IBE), LMU Munich, Munich, Germany.

Alvotech Germany GmbH, Jülich, Germany.

出版信息

Biom J. 2022 Jun;64(5):863-882. doi: 10.1002/bimj.202000393. Epub 2022 Mar 10.

DOI:10.1002/bimj.202000393

PMID:35266565

Abstract

In clinical practice, the composition of missing data may be complex, for example, a mixture of missing at random (MAR) and missing not at random (MNAR) assumptions. Many methods under the assumption of MAR are available. Under the assumption of MNAR, likelihood-based methods require specification of the joint distribution of the data, and the missingness mechanism has been introduced as sensitivity analysis. These classic models heavily rely on the underlying assumption, and, in many realistic scenarios, they can produce unreliable estimates. In this paper, we develop a machine learning based missing data prediction framework with the aim of handling more realistic missing data scenarios. We use an imbalanced learning technique (i.e., oversampling of minority class) to handle the MNAR data. To implement oversampling in longitudinal continuous variable, we first perform clustering via -mean trajectories. And use the recurrent neural network (RNN) to model the longitudinal data. Further, we apply bootstrap aggregating to improve the accuracy of prediction and also to consider the uncertainty of a single prediction. We evaluate the proposed method using simulated data. The prediction result is evaluated at the individual patient level and the overall population level. We demonstrate the powerful predictive capability of RNN for longitudinal data and its flexibility for nonlinear modeling. Overall, the proposed method provides an accurate individual prediction for both MAR and MNAR data and reduce the bias of missing data in treatment effect estimation when compared to standard methods and classic models. Finally, we implement the proposed method in a real dataset from an antidepressant clinical trial. In summary, this paper offers an opportunity to encourage the integration of machine learning strategies for handling of missing data in the analysis of randomized clinical trials.

摘要

在临床实践中，缺失数据的构成可能很复杂，例如，混合了随机缺失（MAR）和非随机缺失（MNAR）假设。许多 MAR 假设下的方法都是可用的。在 MNAR 假设下，基于似然的方法需要指定数据的联合分布，并且已经将缺失机制作为敏感性分析引入。这些经典模型严重依赖于基本假设，并且在许多实际情况下，它们可能会产生不可靠的估计。在本文中，我们开发了一个基于机器学习的缺失数据预测框架，旨在处理更现实的缺失数据场景。我们使用不平衡学习技术（即少数类别的过采样）来处理 MNAR 数据。为了在纵向连续变量中执行过采样，我们首先通过 -mean 轨迹进行聚类。并使用递归神经网络（RNN）来对纵向数据进行建模。此外，我们应用引导聚合来提高预测的准确性，并考虑单个预测的不确定性。我们使用模拟数据评估所提出的方法。预测结果在个体患者水平和总体人群水平上进行评估。我们展示了 RNN 对纵向数据的强大预测能力及其对非线性建模的灵活性。总体而言，与标准方法和经典模型相比，所提出的方法为 MAR 和 MNAR 数据提供了准确的个体预测，并减少了缺失数据对治疗效果估计的偏差。最后，我们在抗抑郁药临床试验的真实数据集上实现了所提出的方法。总之，本文为鼓励将机器学习策略整合到随机临床试验的缺失数据分析中提供了机会。