DataHow AG, Zurich, Switzerland.
Chair for Mathematical Information Science, ETH Zurich.
Biotechnol Bioeng. 2021 Nov;118(11):4389-4401. doi: 10.1002/bit.27907. Epub 2021 Aug 12.
To date, a large number of experiments are performed to develop a biochemical process. The generated data is used only once, to take decisions for development. Could we exploit data of already developed processes to make predictions for a novel process, we could significantly reduce the number of experiments needed. Processes for different products exhibit differences in behaviour, typically only a subset behave similar. Therefore, effective learning on multiple product spanning process data requires a sensible representation of the product identity. We propose to represent the product identity (a categorical feature) by embedding vectors that serve as input to a Gaussian process regression model. We demonstrate how the embedding vectors can be learned from process data and show that they capture an interpretable notion of product similarity. The improvement in performance is compared to traditional one-hot encoding on a simulated cross product learning task. All in all, the proposed method could render possible significant reductions in wet-lab experiments.
迄今为止,已经进行了大量实验来开发生化过程。生成的数据仅使用一次,用于为开发做出决策。如果我们可以利用已经开发的过程的数据来对新的过程进行预测,我们就可以显著减少所需的实验数量。不同产品的过程表现出不同的行为,通常只有一部分表现出相似的行为。因此,对跨越多个产品的过程数据进行有效的学习需要对产品身份进行合理的表示。我们建议通过嵌入向量来表示产品身份(类别特征),将其作为高斯过程回归模型的输入。我们展示了如何从过程数据中学习嵌入向量,并表明它们捕获了产品相似性的可解释概念。在模拟的交叉产品学习任务上,与传统的独热编码相比,性能得到了提高。总之,该方法可以大大减少湿实验室实验。