Centro de Investigación e Innovación en Bioingeniería, Universitat Politècnica de València, 46022 Valencia, Spain.
Servicio de Obstetricia, H.U.P. La Fe, 46026 Valencia, Spain.
Sensors (Basel). 2022 Jul 7;22(14):5098. doi: 10.3390/s22145098.
Due to its high sensitivity, electrohysterography (EHG) has emerged as an alternative technique for predicting preterm labor. The main obstacle in designing preterm labor prediction models is the inherent preterm/term imbalance ratio, which can give rise to relatively low performance. Numerous studies obtained promising preterm labor prediction results using the synthetic minority oversampling technique. However, these studies generally overestimate mathematical models' real generalization capacity by generating synthetic data before splitting the dataset, leaking information between the training and testing partitions and thus reducing the complexity of the classification task. In this work, we analyzed the effect of combining feature selection and resampling methods to overcome the class imbalance problem for predicting preterm labor by EHG. We assessed undersampling, oversampling, and hybrid methods applied to the training and validation dataset during feature selection by genetic algorithm, and analyzed the resampling effect on training data after obtaining the optimized feature subset. The best strategy consisted of undersampling the majority class of the validation dataset to 1:1 during feature selection, without subsequent resampling of the training data, achieving an AUC of 94.5 ± 4.6%, average precision of 84.5 ± 11.7%, maximum F1-score of 79.6 ± 13.8%, and recall of 89.8 ± 12.1%. Our results outperformed the techniques currently used in clinical practice, suggesting the EHG could be used to predict preterm labor in clinics.
由于其高灵敏度,电子宫描记术 (EHG) 已成为预测早产的替代技术。设计早产预测模型的主要障碍是固有的早产/足月不平衡比例,这可能导致性能相对较低。许多研究使用合成少数过采样技术获得了有希望的早产预测结果。然而,这些研究通常通过在数据集分割之前生成合成数据来高估数学模型的真实泛化能力,从而在训练和测试分区之间泄漏信息,并降低分类任务的复杂性。在这项工作中,我们分析了结合特征选择和重采样方法来克服 EHG 预测早产时的类别不平衡问题的效果。我们评估了在遗传算法中对训练和验证数据集进行特征选择时应用的欠采样、过采样和混合方法,并在获得优化特征子集后分析了重采样对训练数据的影响。最佳策略是在特征选择过程中对验证数据集的多数类进行 1:1 的欠采样,而不进行后续的训练数据重采样,从而实现 AUC 为 94.5 ± 4.6%、平均精度为 84.5 ± 11.7%、最大 F1 得分为 79.6 ± 13.8%和召回率为 89.8 ± 12.1%。我们的结果优于目前临床实践中使用的技术,表明 EHG 可用于临床预测早产。