Suppr超能文献

大量体内生物测定描述的分类与分析

Classification and analysis of a large collection of in vivo bioassay descriptions.

作者信息

Zwierzyna Magdalena, Overington John P

机构信息

BenevolentAI, London, United Kingdom.

Institute of Cardiovascular Science, University College London, London, United Kingdom.

出版信息

PLoS Comput Biol. 2017 Jul 5;13(7):e1005641. doi: 10.1371/journal.pcbi.1005641. eCollection 2017 Jul.

Abstract

Testing potential drug treatments in animal disease models is a decisive step of all preclinical drug discovery programs. Yet, despite the importance of such experiments for translational medicine, there have been relatively few efforts to comprehensively and consistently analyze the data produced by in vivo bioassays. This is partly due to their complexity and lack of accepted reporting standards-publicly available animal screening data are only accessible in unstructured free-text format, which hinders computational analysis. In this study, we use text mining to extract information from the descriptions of over 100,000 drug screening-related assays in rats and mice. We retrieve our dataset from ChEMBL-an open-source literature-based database focused on preclinical drug discovery. We show that in vivo assay descriptions can be effectively mined for relevant information, including experimental factors that might influence the outcome and reproducibility of animal research: genetic strains, experimental treatments, and phenotypic readouts used in the experiments. We further systematize extracted information using unsupervised language model (Word2Vec), which learns semantic similarities between terms and phrases, allowing identification of related animal models and classification of entire assay descriptions. In addition, we show that random forest models trained on features generated by Word2Vec can predict the class of drugs tested in different in vivo assays with high accuracy. Finally, we combine information mined from text with curated annotations stored in ChEMBL to investigate the patterns of usage of different animal models across a range of experiments, drug classes, and disease areas.

摘要

在动物疾病模型中测试潜在的药物治疗方法是所有临床前药物发现计划的决定性步骤。然而,尽管此类实验对转化医学很重要,但相对较少有人致力于全面且一致地分析体内生物测定产生的数据。部分原因在于其复杂性以及缺乏公认的报告标准——公开可用的动物筛选数据只能以非结构化的自由文本格式获取,这阻碍了计算分析。在本研究中,我们使用文本挖掘从超过10万种大鼠和小鼠药物筛选相关测定的描述中提取信息。我们从ChEMBL(一个专注于临床前药物发现的基于文献的开源数据库)中检索我们的数据集。我们表明,可以有效地从体内测定描述中挖掘相关信息,包括可能影响动物研究结果和可重复性的实验因素:实验中使用的遗传品系、实验处理和表型读数。我们进一步使用无监督语言模型(Word2Vec)对提取的信息进行系统化,该模型学习术语和短语之间的语义相似性,从而能够识别相关的动物模型并对整个测定描述进行分类。此外,我们表明,基于Word2Vec生成的特征训练的随机森林模型可以高精度地预测在不同体内测定中测试的药物类别。最后,我们将从文本中挖掘的信息与ChEMBL中存储的精心策划的注释相结合,以研究不同动物模型在一系列实验、药物类别和疾病领域中的使用模式。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3cb2/5517062/06c386692484/pcbi.1005641.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验