Department of Comparative Medicine, Yale School of Medicine, New Haven, Connecticut, United States of America.
Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, Maryland, United States of America.
PLoS One. 2019 Dec 17;14(12):e0226176. doi: 10.1371/journal.pone.0226176. eCollection 2019.
Discovery studies in animals constitute a cornerstone of biomedical research, but suffer from lack of generalizability to human populations. We propose that large-scale interrogation of these data could reveal patterns of animal use that could narrow the translational divide. We describe a text-mining approach that extracts translationally useful data from PubMed abstracts. These comprise six modules: species, model, genes, interventions/disease modifiers, overall outcome and functional outcome measures. Existing National Library of Medicine natural language processing tools (SemRep, GNormPlus and the Chemical annotator) underpin the program and are further augmented by various rules, term lists, and machine learning models. Evaluation of the program using a 98-abstract test set achieved F1 scores ranging from 0.75-0.95 across all modules, and exceeded F1 scores obtained from comparable baseline programs. Next, the program was applied to a larger 14,481 abstract data set (2008-2017). Expected and previously identified patterns of species and model use for the field were obtained. As previously noted, the majority of studies reported promising outcomes. Longitudinal patterns of intervention type or gene mentions were demonstrated, and patterns of animal model use characteristic of the Parkinson's disease field were confirmed. The primary function of the program is to overcome low external validity of animal model systems by aggregating evidence across a diversity of models that capture different aspects of a multifaceted cellular process. Some aspects of the tool are generalizable, whereas others are field-specific. In the initial version presented here, we demonstrate proof of concept within a single disease area, Parkinson's disease. However, the program can be expanded in modular fashion to support a wider range of neurodegenerative diseases.
动物的发现研究构成了生物医学研究的基石,但缺乏对人类群体的普遍性。我们提出,对这些数据的大规模调查可能会揭示出可以缩小转化差距的动物使用模式。我们描述了一种从 PubMed 摘要中提取具有转化价值的数据的文本挖掘方法。这些包括六个模块:物种、模型、基因、干预/疾病修饰物、总体结果和功能结果测量。现有的国家医学图书馆自然语言处理工具(SemRep、GNormPlus 和化学注释器)为该程序提供了支持,并通过各种规则、术语列表和机器学习模型进一步增强。使用包含 98 个摘要的测试集对该程序进行评估,所有模块的 F1 分数在 0.75-0.95 之间,超过了可比基线程序的 F1 分数。接下来,该程序被应用于一个更大的 14481 个摘要数据集(2008-2017 年)。获得了该领域物种和模型使用的预期和先前确定的模式。如前所述,大多数研究报告了有希望的结果。干预类型或基因提及的纵向模式得到了证明,并且与帕金森病领域相关的动物模型使用模式得到了确认。该程序的主要功能是通过聚合不同模型的证据来克服动物模型系统的低外部有效性,这些模型捕获了一个多方面细胞过程的不同方面。该工具的某些方面具有通用性,而其他方面则是特定于领域的。在本文中介绍的初始版本中,我们在单个疾病领域(帕金森病)内证明了概念验证。然而,该程序可以以模块化的方式扩展,以支持更广泛的神经退行性疾病。