Pembroke College, Cambridge, UK.
Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge, UK.
Bioprocess Biosyst Eng. 2019 Apr;42(4):657-663. doi: 10.1007/s00449-018-02059-5. Epub 2019 Jan 8.
The biologics sector has amassed a wealth of data in the past three decades, in line with the bioprocess development and manufacturing guidelines, and analysis of these data with precision is expected to reveal behavioural patterns in cell populations that can be used for making predictions on how future culture processes might behave. The historical bioprocessing data likely comprise experiments conducted using different cell lines, to produce different products and may be years apart; the situation causing inter-batch variability and missing data points to human- and instrument-associated technical oversights. These unavoidable complications necessitate the introduction of a pre-processing step prior to data mining. This study investigated the efficiency of mean imputation and multivariate regression for filling in the missing information in historical bio-manufacturing datasets, and evaluated their performance by symbolic regression models and Bayesian non-parametric models in subsequent data processing. Mean substitution was shown to be a simple and efficient imputation method for relatively smooth, non-dynamical datasets, and regression imputation was effective whilst maintaining the existing standard deviation and shape of the distribution in dynamical datasets with less than 30% missing data. The nature of the missing information, whether Missing Completely At Random, Missing At Random or Missing Not At Random, emerged as the key feature for selecting the imputation method.
生物制品领域在过去三十年中积累了大量数据,这些数据符合生物工艺开发和制造指南,对这些数据进行精确分析有望揭示细胞群体中的行为模式,可用于预测未来的培养工艺可能如何表现。这些历史生物处理数据可能包含使用不同细胞系进行的实验,以生产不同的产品,并且可能相隔数年;这种情况导致批次间的可变性和数据点缺失,这是人为和仪器相关技术疏忽造成的。这些不可避免的复杂性需要在数据挖掘之前引入预处理步骤。本研究调查了均值插补和多元回归在填补历史生物制造数据集缺失信息方面的效率,并通过符号回归模型和贝叶斯非参数模型在后续数据处理中评估了它们的性能。均值替代被证明是一种相对平滑、非动态数据集的简单高效插补方法,而回归插补在保持分布的现有标准差和形状方面是有效的,对于缺失数据少于 30%的动态数据集也是如此。缺失信息的性质,无论是完全随机缺失、随机缺失还是非随机缺失,成为选择插补方法的关键特征。