REQUIMTE/DQ, Faculty of Science and Technology, University Nova de Lisboa, Campus de Caparica, 2829-516, Caparica, Portugal.
CEAM, Faculty of Science, Agriculture and Engineering, Newcastle University, Newcastle upon Tyne, NE1 7RU, UK.
Bioprocess Biosyst Eng. 2019 Nov;42(11):1853-1865. doi: 10.1007/s00449-019-02181-y. Epub 2019 Aug 2.
Hybrid semi-parametric modeling, combining mechanistic and machine-learning methods, has proven to be a powerful method for process development. This paper proposes bootstrap aggregation to increase the predictive power of hybrid semi-parametric models when the process data are obtained by statistical design of experiments. A fed-batch Escherichia coli optimization problem is addressed, in which three factors (biomass growth setpoint, temperature, and biomass concentration at induction) were designed statistically to identify optimal cell growth and recombinant protein expression conditions. Synthetic data sets were generated applying three distinct design methods, namely, Box-Behnken, central composite, and Doehlert design. Bootstrap-aggregated hybrid models were developed for the three designs and compared against the respective non-aggregated versions. It is shown that bootstrap aggregation significantly decreases the prediction mean squared error of new batch experiments for all three designs. The number of (best) models to aggregate is a key calibration parameter that needs to be fine-tuned in each problem. The Doehlert design was slightly better than the other designs in the identification of the process optimum. Finally, the availability of several predictions allowed computing error bounds for the different parts of the model, which provides an additional insight into the variation of predictions within the model components.
混合半参数建模,结合了机理和机器学习方法,已被证明是一种用于工艺开发的强大方法。本文提出了自举聚合(Bootstrap Aggregation),以提高混合半参数模型在通过实验设计的统计过程数据获得时的预测能力。解决了分批补料大肠杆菌优化问题,其中三个因素(生物量生长设定点、温度和诱导时的生物量浓度)通过统计设计来确定最佳细胞生长和重组蛋白表达条件。应用三种不同的设计方法(Box-Behnken、中心复合和 Doehlert 设计)生成了合成数据集。为这三种设计开发了自举聚合混合模型,并与各自的非聚合版本进行了比较。结果表明,对于所有三种设计,自举聚合显著降低了新批次实验的预测均方误差。聚合的(最佳)模型数量是每个问题都需要微调的关键校准参数。在确定工艺最优值方面,Doehlert 设计略优于其他设计。最后,多个预测的可用性允许计算模型不同部分的误差界限,这为模型组件内预测的变化提供了额外的见解。