Burns Gully A P C, Dasigi Pradeep, de Waard Anita, Hovy Eduard H
Information Sciences Institute, Viterbi School of Engineering, University of Southern California, Marina del Rey, CA 90292, USA
Carnegie Mellon University, Language Technologies Institute, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA.
Database (Oxford). 2016 Aug 31;2016. doi: 10.1093/database/baw122. Print 2016.
Automated machine-reading biocuration systems typically use sentence-by-sentence information extraction to construct meaning representations for use by curators. This does not directly reflect the typical discourse structure used by scientists to construct an argument from the experimental data available within a article, and is therefore less likely to correspond to representations typically used in biomedical informatics systems (let alone to the mental models that scientists have). In this study, we develop Natural Language Processing methods to locate, extract, and classify the individual passages of text from articles' Results sections that refer to experimental data. In our domain of interest (molecular biology studies of cancer signal transduction pathways), individual articles may contain as many as 30 small-scale individual experiments describing a variety of findings, upon which authors base their overall research conclusions. Our system automatically classifies discourse segments in these texts into seven categories (fact, hypothesis, problem, goal, method, result, implication) with an F-score of 0.68. These segments describe the essential building blocks of scientific discourse to (i) provide context for each experiment, (ii) report experimental details and (iii) explain the data's meaning in context. We evaluate our system on text passages from articles that were curated in molecular biology databases (the Pathway Logic Datum repository, the Molecular Interaction MINT and INTACT databases) linking individual experiments in articles to the type of assay used (coprecipitation, phosphorylation, translocation etc.). We use supervised machine learning techniques on text passages containing unambiguous references to experiments to obtain baseline F1 scores of 0.59 for MINT, 0.71 for INTACT and 0.63 for Pathway Logic. Although preliminary, these results support the notion that targeting information extraction methods to experimental results could provide accurate, automated methods for biocuration. We also suggest the need for finer-grained curation of experimental methods used when constructing molecular biology databases.
自动化机器阅读生物编目系统通常使用逐句信息提取来构建供编目人员使用的意义表示。这并不能直接反映科学家从一篇文章中可用的实验数据构建论点时所使用的典型话语结构,因此不太可能与生物医学信息学系统中通常使用的表示相对应(更不用说与科学家拥有的心理模型了)。在本研究中,我们开发了自然语言处理方法,以定位、提取和分类文章结果部分中引用实验数据的各个文本段落。在我们感兴趣的领域(癌症信号转导途径的分子生物学研究),单个文章可能包含多达30个小规模的单独实验,描述了各种发现,作者基于这些发现得出总体研究结论。我们的系统自动将这些文本中的话语片段分为七类(事实、假设、问题、目标、方法、结果、含义),F值为0.68。这些片段描述了科学话语的基本组成部分,以便(i)为每个实验提供背景,(ii)报告实验细节,以及(iii)在上下文中解释数据的含义。我们在分子生物学数据库(途径逻辑数据存储库、分子相互作用MINT和INTACT数据库)中策划的文章的文本段落上评估我们的系统,将文章中的单个实验与所使用的检测类型(共沉淀、磷酸化、易位等)联系起来。我们对包含明确实验参考的文本段落使用监督机器学习技术,以获得MINT的基线F1分数为0.59,INTACT为0.71,途径逻辑为0.63。尽管这些结果是初步的,但它们支持这样一种观点,即针对实验结果的信息提取方法可以为生物编目提供准确的自动化方法。我们还建议在构建分子生物学数据库时,需要对所使用的实验方法进行更细粒度的编目。