Suppr超能文献

使用双向深度循环神经网络在科学文献中识别蛋白质亚细胞定位。

Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network.

作者信息

David Rakesh, Menezes Rhys-Joshua D, De Klerk Jan, Castleden Ian R, Hooper Cornelia M, Carneiro Gustavo, Gilliham Matthew

机构信息

School of Agriculture, Food and Wine, The Waite Research Institute, ARC Centre of Excellence in Plant Energy Biology, Waite Campus, The University of Adelaide, Adelaide, SA, Australia.

School of Computer Science, Australian Institute for Machine Learning, The University of Adelaide, Adelaide, SA, Australia.

出版信息

Sci Rep. 2021 Jan 18;11(1):1696. doi: 10.1038/s41598-020-80441-8.

Abstract

The increased diversity and scale of published biological data has to led to a growing appreciation for the applications of machine learning and statistical methodologies to gain new insights. Key to achieving this aim is solving the Relationship Extraction problem which specifies the semantic interaction between two or more biological entities in a published study. Here, we employed two deep neural network natural language processing (NLP) methods, namely: the continuous bag of words (CBOW), and the bi-directional long short-term memory (bi-LSTM). These methods were employed to predict relations between entities that describe protein subcellular localisation in plants. We applied our system to 1700 published Arabidopsis protein subcellular studies from the SUBA manually curated dataset. The system combines pre-processing of full-text articles in a machine-readable format with relevant sentence extraction for downstream NLP analysis. Using the SUBA corpus, the neural network classifier predicted interactions between protein name, subcellular localisation and experimental methodology with an average precision, recall rate, accuracy and F1 scores of 95.1%, 82.8%, 89.3% and 88.4% respectively (n = 30). Comparable scoring metrics were obtained using the CropPAL database as an independent testing dataset that stores protein subcellular localisation in crop species, demonstrating wide applicability of prediction model. We provide a framework for extracting protein functional features from unstructured text in the literature with high accuracy, improving data dissemination and unlocking the potential of big data text analytics for generating new hypotheses.

摘要

已发表生物数据的多样性和规模不断增加,这使得人们越来越认识到机器学习和统计方法在获取新见解方面的应用。实现这一目标的关键在于解决关系提取问题,该问题确定了已发表研究中两个或多个生物实体之间的语义相互作用。在这里,我们采用了两种深度神经网络自然语言处理(NLP)方法,即:连续词袋模型(CBOW)和双向长短期记忆网络(bi-LSTM)。这些方法被用于预测描述植物中蛋白质亚细胞定位的实体之间的关系。我们将我们的系统应用于来自SUBA人工策划数据集的1700篇已发表的拟南芥蛋白质亚细胞研究。该系统将全文文章的预处理转换为机器可读格式,并提取相关句子用于下游的NLP分析。使用SUBA语料库,神经网络分类器预测了蛋白质名称、亚细胞定位和实验方法之间的相互作用,平均精确率、召回率、准确率和F1分数分别为95.1%、82.8%、89.3%和88.4%(n = 30)。使用CropPAL数据库作为独立测试数据集也获得了类似的评分指标,该数据库存储了作物物种中的蛋白质亚细胞定位,证明了预测模型的广泛适用性。我们提供了一个框架,用于从文献中的非结构化文本中高精度地提取蛋白质功能特征,改善数据传播,并释放大数据文本分析在生成新假设方面的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/da28/7813825/54e7609a09c2/41598_2020_80441_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验