Chitre Aniket, Querimit Robert C M, Rihm Simon D, Karan Dogancan, Zhu Benchuan, Wang Ke, Wang Long, Hippalgaonkar Kedar, Lapkin Alexei A
Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK.
Cambridge Centre for Advanced Research and Education in Singapore, CARES Ltd. 1 CREATE Way, CREATE Tower #05-05, Singapore, 138602, Singapore.
Sci Data. 2024 Jul 3;11(1):728. doi: 10.1038/s41597-024-03573-w.
Liquid formulations are ubiquitous yet have lengthy product development cycles owing to the complex physical interactions between ingredients making it difficult to tune formulations to customer-defined property targets. Interpolative ML models can accelerate liquid formulations design but are typically trained on limited sets of ingredients and without any structural information, which limits their out-of-training predictive capacity. To address this challenge, we selected eighteen formulation ingredients covering a diverse chemical space to prepare an open experimental dataset for training ML models for rinse-off formulations development. The resulting design space has an over 50-fold increase in dimensionality compared to our previous work. Here, we present a dataset of 812 formulations, including 294 stable samples, which cover the entire design space, with phase stability, turbidity, and high-fidelity rheology measurements generated on our semi-automated, ML-driven liquid formulations workflow. Our dataset has the unique attribute of sample-specific uncertainty measurements to train predictive surrogate models.
液体制剂无处不在,但由于成分之间复杂的物理相互作用,其产品开发周期漫长,难以将制剂调整到客户定义的性能目标。插值机器学习模型可以加速液体制剂设计,但通常是在有限的成分集上进行训练,且没有任何结构信息,这限制了它们在训练之外的预测能力。为应对这一挑战,我们选择了涵盖不同化学空间的18种制剂成分,以制备一个开放的实验数据集,用于训练用于冲洗型制剂开发的机器学习模型。与我们之前的工作相比,由此产生的设计空间维度增加了50多倍。在这里,我们展示了一个包含812种制剂的数据集,其中包括294个稳定样品,这些样品覆盖了整个设计空间,并通过我们的半自动、机器学习驱动的液体制剂工作流程生成了相稳定性、浊度和高保真流变学测量结果。我们的数据集具有样本特定不确定性测量的独特属性,可用于训练预测替代模型。