Wei Sizhou, Chen Zhiyuan, Arumugasamy Senthil Kumar, Chew Irene Mei Leng
School of Computer Science, University of Nottingham, Nottingham, NG8 1BB, United Kingdom.
School of Computer Science, University of Nottingham Malaysia, Semenyih, 43500, Malaysia.
Environ Sci Ecotechnol. 2022 Apr 20;11:100172. doi: 10.1016/j.ese.2022.100172. eCollection 2022 Jul.
Machine learning has been increasingly used in biochemistry. However, in organic chemistry and other experiment-based fields, data collected from real experiments are inadequate and the current coronavirus disease (COVID-19) pandemic has made the situation even worse. Such limited data resources may result in the low performance of modeling and affect the proper development of a control strategy. This paper proposes a feasible machine learning solution to the problem of small sample size in the bio-polymerization process. To avoid overfitting, the variational auto-encoder and generative adversarial network algorithms are used for data augmentation. The random forest and artificial neural network algorithms are implemented in the modeling process. The results prove that data augmentation techniques effectively improve the performance of the regression model. Several machine learning models were compared and the experimental results show that the random forest model with data augmentation by the generative adversarial network technique achieved the best performance in predicting the molecular weight on the training set (with an R of 0.94) and on the test set (with an R of 0.74), and the coefficient of determination of this model was 0.74.
机器学习在生物化学中的应用越来越广泛。然而,在有机化学和其他基于实验的领域,从实际实验中收集的数据并不充足,而当前的冠状病毒病(COVID-19)大流行使这种情况更加恶化。如此有限的数据资源可能导致建模性能低下,并影响控制策略的合理发展。本文针对生物聚合过程中样本量小的问题提出了一种可行的机器学习解决方案。为避免过拟合,采用变分自编码器和生成对抗网络算法进行数据增强。在建模过程中实现了随机森林和人工神经网络算法。结果证明,数据增强技术有效地提高了回归模型的性能。比较了几种机器学习模型,实验结果表明,采用生成对抗网络技术进行数据增强的随机森林模型在训练集(R为0.94)和测试集(R为0.74)上预测分子量时表现最佳,该模型的决定系数为0.74。