Gonzalez Hernandez Ferran, Carter Simon J, Iso-Sipilä Juha, Goldsmith Paul, Almousa Ahmed A, Gastine Silke, Lilaonitkul Watjana, Kloprogge Frank, Standing Joseph F
CoMPLEX, University College London, London, UK.
The Alan Turing Institute, London, UK.
Wellcome Open Res. 2021 Apr 21;6:88. doi: 10.12688/wellcomeopenres.16718.1. eCollection 2021.
Pharmacokinetic (PK) predictions of new chemical entities are aided by prior knowledge from other compounds. The development of robust algorithms that improve preclinical and clinical phases of drug development remains constrained by the need to search, curate and standardise PK information across the constantly-growing scientific literature. The lack of centralised, up-to-date and comprehensive repositories of PK data represents a significant limitation in the drug development pipeline.In this work, we propose a machine learning approach to automatically identify and characterise scientific publications reporting PK parameters from in vivo data, providing a centralised repository of PK literature. A dataset of 4,792 PubMed publications was labelled by field experts depending on whether in vivo PK parameters were estimated in the study. Different classification pipelines were compared using a bootstrap approach and the best-performing architecture was used to develop a comprehensive and automatically-updated repository of PK publications. The best-performing architecture encoded documents using unigram features and mean pooling of BioBERT embeddings obtaining an F1 score of 83.8% on the test set. The pipeline retrieved over 121K PubMed publications in which in vivo PK parameters were estimated and it was scheduled to perform weekly updates on newly published articles. All the relevant documents were released through a publicly available web interface (https://app.pkpdai.com) and characterised by the drugs, species and conditions mentioned in the abstract, to facilitate the subsequent search of relevant PK data. This automated, open-access repository can be used to accelerate the search and comparison of PK results, curate ADME datasets, and facilitate subsequent text mining tasks in the PK domain.
新化学实体的药代动力学(PK)预测可借助其他化合物的先验知识。开发强大的算法以改善药物研发的临床前和临床阶段,仍受到在不断增长的科学文献中搜索、整理和标准化PK信息需求的限制。缺乏集中、最新且全面的PK数据储存库是药物研发流程中的一个重大限制。在这项工作中,我们提出一种机器学习方法,用于自动识别和表征报告体内数据PK参数的科学出版物,提供一个PK文献的集中储存库。一个包含4792篇PubMed出版物的数据集由领域专家根据研究中是否估计了体内PK参数进行标注。使用自助法比较了不同的分类流程,并使用性能最佳的架构开发了一个全面且自动更新的PK出版物储存库。性能最佳的架构使用单字特征和BioBERT嵌入的平均池化对文档进行编码,在测试集上获得了83.8%的F1分数。该流程检索了超过12.1万篇估计了体内PK参数的PubMed出版物,并计划对新发表的文章进行每周更新。所有相关文档通过一个公开可用的网络界面(https://app.pkpdai.com)发布,并以摘要中提到的药物、物种和条件为特征,以方便后续搜索相关的PK数据。这个自动化的开放获取储存库可用于加速PK结果的搜索和比较、整理ADME数据集,并促进PK领域后续的文本挖掘任务。