Department of Biological and Ecological Engineering, Oregon State University, Corvallis, OR 97333, USA.
School of Chemical Engineering and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China.
Water Res. 2021 Jul 1;199:117182. doi: 10.1016/j.watres.2021.117182. Epub 2021 Apr 22.
Modeling of anaerobic digestion (AD) is crucial to better understand the process dynamics and to improve the digester performance. This is an essential yet difficult task due to the complex and unknown interactions within the system. The application of well-developed data mining technologies, such as machine learning (ML) and microbial gene sequencing techniques are promising in overcoming these challenges. In this study, we investigated the feasibility of 6 ML algorithms using genomic data and their corresponding operational parameters from 8 research groups to predict methane yield. For classification models, random forest (RF) achieved accuracies of 0.77 using operational parameters alone and 0.78 using genomic data at the bacterial phylum level alone. The combination of operational parameters and genomic data improved the prediction accuracy to 0.82 (p<0.05). For regression models, a low root mean square error of 0.04 (relative root mean square error =8.6%) was acquired by neural network using genomic data at the bacterial phylum level alone. Feature importance analysis by RF suggested that Chloroflexi, Actinobacteria, Proteobacteria, Fibrobacteres, and Spirochaeta were the top 5 most important phyla although their relative abundances were ranging only from 0.1% to 3.1%. The important features identified could provide guidance for early warning and proactive management of microbial communities. This study demonstrated the promising application of ML techniques for predicting and controlling AD performance.
模型化厌氧消化(AD)对于更好地理解工艺动力学和提高消化器性能至关重要。由于系统内复杂且未知的相互作用,这是一项必不可少但具有挑战性的任务。应用经过充分发展的数据挖掘技术,如机器学习(ML)和微生物基因测序技术,有望克服这些挑战。在这项研究中,我们研究了 6 种 ML 算法在使用来自 8 个研究小组的基因组数据及其相应操作参数来预测甲烷产量方面的可行性。对于分类模型,仅使用操作参数,随机森林(RF)的准确性为 0.77,仅使用细菌门水平的基因组数据为 0.78。将操作参数和基因组数据相结合,将预测准确性提高到 0.82(p<0.05)。对于回归模型,仅使用细菌门水平的基因组数据,神经网络的均方根误差较低,为 0.04(相对均方根误差=8.6%)。RF 的特征重要性分析表明,Chloroflexi、Actinobacteria、Proteobacteria、Fibrobacteres 和 Spirochaeta 是前 5 个最重要的门,尽管它们的相对丰度仅在 0.1%到 3.1%之间。确定的重要特征可为微生物群落的预警和主动管理提供指导。本研究表明,ML 技术在预测和控制 AD 性能方面具有广阔的应用前景。