Rad Milad, Rafiei Alireza, Grunwell Jocelyn, Kamaleswaran Rishikesan
School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
Department of Computer Science and Informatics, Emory University, Atlanta, GA, USA.
Int J Med Inform. 2025 Apr;196:105809. doi: 10.1016/j.ijmedinf.2025.105809. Epub 2025 Jan 25.
The regression of small imbalanced horizontal datasets is an important problem in bioinformatics due to rare but vital data points impacting model performance. Most clinical studies suffer from imbalance in their distribution which impacts the learning ability of regression or classification models. The imbalance once combined with the small number of samples reduces the prediction performance. An improvement in the trainability of small imbalanced datasets hugely improves the potency of current prediction models that rely on a small set of valuable expensive samples.
A method called Stability Selection has been used to overcome the high dimensionality problem, which arises when the sample sizes are relatively small compared to the number of features. The method was used to improve the performance of the Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise (SMOGN), an imbalance removal algorithm. To test the new pipeline, a small imbalanced cohort of pediatric ICU patients was used to predict the number of Ventilator-Free Days (VFD) a patient may experience for an admission period of 28 days due to respiratory illnesses.
Our model demonstrated its effectiveness by overcoming label imbalance while predicting almost all the non-surviving patients in the test dataset using Stability Selection before applying SMOGN. Our study also highlighted the importance of Pediatrics Risk of Mortality (PRISM) as a powerful VFD predictor if combined with other clinical features.
This paper shows how a hybrid strategy of Stability Selection, SMOGN, and regression can improve the outcome of highly imbalanced datasets and reduce the probability of highly expensive false negative detections in severe acute respiratory disease syndrome cases. The proposed modeling pipeline can reduce the overall VFD regression error but is also expandable to other regressable features. We also showed the importance of PRISM as a strong VFD predictor.
由于罕见但关键的数据点会影响模型性能,小型不平衡水平数据集的回归是生物信息学中的一个重要问题。大多数临床研究存在分布不平衡的问题,这会影响回归或分类模型的学习能力。不平衡一旦与少量样本相结合,就会降低预测性能。提高小型不平衡数据集的可训练性,能极大地提升当前依赖少量有价值的昂贵样本的预测模型的效能。
一种名为稳定性选择的方法被用于克服高维问题,当样本量与特征数量相比相对较小时会出现该问题。该方法用于提高带有高斯噪声的回归合成少数过采样技术(SMOGN)的性能,这是一种不平衡消除算法。为测试新流程,使用了一个小型不平衡的儿科重症监护病房患者队列,来预测因呼吸系统疾病入院28天期间患者可能经历的无呼吸机天数(VFD)。
在应用SMOGN之前,我们的模型通过使用稳定性选择克服标签不平衡,同时在测试数据集中几乎预测出所有非存活患者,证明了其有效性。我们的研究还强调了如果与其他临床特征相结合,儿科死亡风险(PRISM)作为强大的VFD预测指标的重要性。
本文展示了稳定性选择、SMOGN和回归的混合策略如何能改善高度不平衡数据集的结果,并降低严重急性呼吸综合征病例中高成本假阴性检测的概率。所提出的建模流程可以降低总体VFD回归误差,而且还可扩展到其他可回归特征。我们还展示了PRISM作为强大的VFD预测指标的重要性。