Department of Computer Science, University of York, UK.
Data Science Institute, Department of Computing, Imperial College London, UK.
Phys Chem Chem Phys. 2023 Jun 15;25(23):15744-15755. doi: 10.1039/d2cp05975d.
Predicting drop coalescence based on process parameters is crucial for experimental design in chemical engineering. However, predictive models can suffer from the lack of training data and more importantly, the label imbalance problem. In this study, we propose the use of deep learning generative models to tackle this bottleneck by training the predictive models using generated synthetic data. A novel generative model, named double space conditional variational autoencoder (DSCVAE) is developed for labelled tabular data. By introducing label constraints in both the latent and the original space, DSCVAE is capable of generating consistent and realistic samples compared to the standard conditional variational autoencoder (CVAE). Two predictive models, namely random forest and gradient boosting classifiers, are enhanced on synthetic data and their performances are evaluated based on real experimental data. Numerical results show that a considerable improvement in prediction accuracy can be achieved by using synthetic data and the proposed DSCVAE clearly outperforms the standard CVAE. This research clearly provides more insights into handling imbalanced data for classification problems, especially in chemical engineering.
基于过程参数预测液滴聚并对于化学工程中的实验设计至关重要。然而,预测模型可能会受到训练数据不足的影响,更重要的是,还会受到标签不平衡问题的影响。在本研究中,我们提出使用深度学习生成模型通过使用生成的合成数据来训练预测模型来解决这一瓶颈。提出了一种名为双空间条件变分自动编码器(DSCVAE)的新型生成模型,用于标记的表格数据。通过在潜在空间和原始空间中引入标签约束,DSCVAE 能够生成与标准条件变分自动编码器(CVAE)相比更加一致和现实的样本。在合成数据上增强了两个预测模型,即随机森林和梯度提升分类器,并基于真实实验数据评估它们的性能。数值结果表明,使用合成数据可以显著提高预测精度,并且所提出的 DSCVAE 明显优于标准 CVAE。这项研究为处理分类问题中的不平衡数据提供了更深入的见解,特别是在化学工程中。