Analytical Research & Development, MRL, Merck & Co., Inc., Rahway, NJ, 07065, USA.
Pharm Res. 2024 Feb;41(2):365-374. doi: 10.1007/s11095-023-03646-2. Epub 2024 Feb 8.
Significant resources are spent on developing robust liquid chromatography (LC) methods with optimum conditions for all project in the pipeline. Although, data-driven computer assisted modelling has been implemented to shorten the method development timelines, these modelling approaches require project-specific screening data to model retention time (RT) as function of method parameters. Sometimes method re-development is required, leading to additional investments and redundant laboratory work. Cheminformatics techniques have been successfully used to predict the RT of metabolites & other component mixtures for similar use cases. Here we will show that these techniques can be used to model structurally diverse molecules and predictions of these models trained on multiple LC conditions can be used for downstream data-driven modelling.
The Molecular Operating Environment (MOE) was used to calculate over 800 descriptors using the strucutres of the analytes. These descriptors were used to model the RT of the analytes under four chromatographic conditions. These models were then used to create data-driven models using LC-SIM.
A structural-based Random Forest (RF) model outperformed other techniques in cross-validation studies and predicted the RTs of a randomized test set with a median percentage error less than 4% for all LC conditions. RTs predicted by this structure-based model were used to fit a data-driven model that identifies optimum LC conditions without any additional experimental work.
These results show that small training sets yield pharmaceutically relevant models when used in a combination of structure-based and data-driven model.
大量资源用于开发具有最佳条件的稳健液相色谱 (LC) 方法,以满足管道中所有项目的需求。尽管已经实施了数据驱动的计算机辅助建模来缩短方法开发时间,但这些建模方法需要项目特定的筛选数据来将保留时间 (RT) 建模为方法参数的函数。有时需要重新开发方法,导致额外的投资和冗余的实验室工作。化学信息学技术已成功用于预测类似用例中代谢物和其他成分混合物的 RT。在这里,我们将展示这些技术可用于建模结构多样的分子,并且可以使用在多种 LC 条件下训练的这些模型的预测值来进行下游数据驱动的建模。
使用分子操作环境 (MOE) 计算了超过 800 个描述符,使用分析物的结构。这些描述符用于在四种色谱条件下对分析物的 RT 进行建模。然后,使用 LC-SIM 为这些模型创建数据驱动模型。
结构基随机森林 (RF) 模型在交叉验证研究中表现优于其他技术,并且对随机测试集的 RT 进行预测,所有 LC 条件下的中位数百分比误差均小于 4%。该结构基模型预测的 RT 用于拟合数据驱动模型,无需进行任何额外的实验工作即可确定最佳 LC 条件。
这些结果表明,当在结构基和数据驱动模型的组合中使用时,小的训练集可以产生具有药物相关性的模型。