生物创意 VI 精准医疗轨道系统的性能受到实体识别和语料库特征变化的限制。

BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics.

机构信息

School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia.

出版信息

Database (Oxford). 2018 Jan 1;2018:bay122. doi: 10.1093/database/bay122.

DOI:10.1093/database/bay122

PMID:30576491

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6301335/

Abstract

Precision medicine aims to provide personalized treatments based on individual patient profiles. One critical step towards precision medicine is leveraging knowledge derived from biomedical publications-a tremendous literature resource presenting the latest scientific discoveries on genes, mutations and diseases. Biomedical natural language processing (BioNLP) plays a vital role in supporting automation of this process. BioCreative VI Track 4 brings community effort to the task of automatically identifying and extracting protein-protein interactions (PPi) affected by mutations (PPIm), important in the precision medicine context for capturing individual genotype variation related to disease.We present the READ-BioMed team's approach to identifying PPIm-related publications and to extracting specific PPIm information from those publications in the context of the BioCreative VI PPIm track. We observe that current BioNLP tools are insufficient to recognise entities for these two tasks; the best existing mutation recognition tool achieves only 55% recall in the document triage training set, while relation extraction performance is limited by the low recall performance of gene entity recognition. We develop the models accordingly: for document triage, we develop term lists capturing interactions and mutations to complement BioNLP tools, and select effective features via a feature contribution study, whereas an ensemble of BioNLP tools is employed for relation extraction.Our best document triage model achieves an F-score of 66.77% while our best model for relation extraction achieved an F-score of 35.09% over the final (updated post-task) test set. Impacting the document triage task, the characteristics of mutations are statistically different in the training and testing sets. While a vital new direction for biomedical text mining research, this early attempt to tackle the problem of identifying genetic variation of substantial biological significance highlights the importance of representative training data and the cascading impact of tool limitations in a modular system.

摘要

精准医学旨在根据个体患者的特征提供个性化治疗。迈向精准医学的关键一步是利用从生物医学文献中获取的知识——这是一个巨大的文献资源，提供了关于基因、突变和疾病的最新科学发现。生物医学自然语言处理（BioNLP）在支持该过程的自动化方面发挥着至关重要的作用。BioCreative VI Track 4 将社区的努力集中在自动识别和提取受突变影响的蛋白质-蛋白质相互作用（PPi）的任务上，这在精准医学中对于捕获与疾病相关的个体基因型变异非常重要。我们介绍了 READ-BioMed 团队在识别与 PPIm 相关的出版物以及在 BioCreative VI PPIm 轨道的上下文中从这些出版物中提取特定的 PPIm 信息的方法。我们观察到，当前的 BioNLP 工具不足以识别这些任务的实体；现有的最佳突变识别工具在文档分类训练集中的召回率仅为 55%，而关系提取性能受到基因实体识别低召回率的限制。因此，我们相应地开发了模型：对于文档分类，我们开发了捕获相互作用和突变的术语列表来补充 BioNLP 工具，并通过特征贡献研究选择有效特征，而关系提取则采用 BioNLP 工具的集成。我们最好的文档分类模型的 F 分数达到 66.77%，而我们最好的关系提取模型在最终（更新后的任务后）测试集上的 F 分数达到 35.09%。影响文档分类任务，突变的特征在训练集和测试集中存在统计学差异。虽然这是生物医学文本挖掘研究的一个新的重要方向，但这项早期尝试解决识别具有重要生物学意义的遗传变异的问题突出了代表性训练数据的重要性以及模块化系统中工具限制的级联影响。