Department of Future Technologies, University of Turku, Turku, Finland.
PerkinElmer, Turku, Finland.
J Am Med Inform Assoc. 2020 Nov 1;27(11):1667-1674. doi: 10.1093/jamia/ocaa127.
Minority oversampling is a standard approach used for adjusting the ratio between the classes on imbalanced data. However, established methods often provide modest improvements in classification performance when applied to data with extremely imbalanced class distribution and to mixed-type data. This is usual for vital statistics data, in which the outcome incidence dictates the amount of positive observations. In this article, we developed a novel neural network-based oversampling method called actGAN (activation-specific generative adversarial network) that can derive useful synthetic observations in terms of increasing prediction performance in this context.
From vital statistics data, the outcome of early stillbirth was chosen to be predicted based on demographics, pregnancy history, and infections. The data contained 363 560 live births and 139 early stillbirths, resulting in class imbalance of 99.96% and 0.04%. The hyperparameters of actGAN and a baseline method SMOTE-NC (Synthetic Minority Over-sampling Technique-Nominal Continuous) were tuned with Bayesian optimization, and both were compared against a cost-sensitive learning-only approach.
While SMOTE-NC provided mixed results, actGAN was able to improve true positive rate at a clinically significant false positive rate and area under the curve from the receiver-operating characteristic curve consistently.
Including an activation-specific output layer to a generator network of actGAN enables the addition of information about the underlying data structure, which overperforms the nominal mechanism of SMOTE-NC.
actGAN provides an improvement to the prediction performance for our learning task. Our developed method could be applied to other mixed-type data prediction tasks that are known to be afflicted by class imbalance and limited data availability.
少数过采样是一种用于调整不平衡数据中类之间比例的标准方法。然而,当应用于具有极不平衡的类分布和混合类型数据的数据时,已建立的方法通常在分类性能方面提供适度的改进。这对于生命统计数据来说很常见,其中结果发生率决定了阳性观察值的数量。在本文中,我们开发了一种新的基于神经网络的过采样方法,称为 actGAN(激活特定生成对抗网络),它可以根据在这种情况下提高预测性能的角度来衍生有用的合成观察值。
从生命统计数据中,选择根据人口统计学、妊娠史和感染来预测早期死产的结果。该数据包含 363560 例活产和 139 例早期死产,导致类不平衡为 99.96%和 0.04%。actGAN 和基线方法 SMOTE-NC(合成少数过采样技术-名义连续)的超参数使用贝叶斯优化进行调整,并与仅基于成本的学习方法进行比较。
虽然 SMOTE-NC 提供了混合结果,但 actGAN 能够在临床上有意义的假阳性率和接收者操作特征曲线下面积(ROC 曲线)一致的情况下提高真阳性率。
在 actGAN 的生成器网络中添加一个激活特定的输出层,使生成器能够添加有关底层数据结构的信息,从而在性能上优于 SMOTE-NC 的名义机制。
actGAN 提高了我们学习任务的预测性能。我们开发的方法可以应用于其他已知受到类不平衡和有限数据可用性影响的混合类型数据预测任务。