使用双向深度循环神经网络在科学文献中识别蛋白质亚细胞定位。

Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network.

作者信息

David Rakesh, Menezes Rhys-Joshua D, De Klerk Jan, Castleden Ian R, Hooper Cornelia M, Carneiro Gustavo, Gilliham Matthew

机构信息

School of Agriculture, Food and Wine, The Waite Research Institute, ARC Centre of Excellence in Plant Energy Biology, Waite Campus, The University of Adelaide, Adelaide, SA, Australia.

School of Computer Science, Australian Institute for Machine Learning, The University of Adelaide, Adelaide, SA, Australia.

出版信息

Sci Rep. 2021 Jan 18;11(1):1696. doi: 10.1038/s41598-020-80441-8.

DOI:10.1038/s41598-020-80441-8

PMID:33462256

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7813825/

Abstract

The increased diversity and scale of published biological data has to led to a growing appreciation for the applications of machine learning and statistical methodologies to gain new insights. Key to achieving this aim is solving the Relationship Extraction problem which specifies the semantic interaction between two or more biological entities in a published study. Here, we employed two deep neural network natural language processing (NLP) methods, namely: the continuous bag of words (CBOW), and the bi-directional long short-term memory (bi-LSTM). These methods were employed to predict relations between entities that describe protein subcellular localisation in plants. We applied our system to 1700 published Arabidopsis protein subcellular studies from the SUBA manually curated dataset. The system combines pre-processing of full-text articles in a machine-readable format with relevant sentence extraction for downstream NLP analysis. Using the SUBA corpus, the neural network classifier predicted interactions between protein name, subcellular localisation and experimental methodology with an average precision, recall rate, accuracy and F1 scores of 95.1%, 82.8%, 89.3% and 88.4% respectively (n = 30). Comparable scoring metrics were obtained using the CropPAL database as an independent testing dataset that stores protein subcellular localisation in crop species, demonstrating wide applicability of prediction model. We provide a framework for extracting protein functional features from unstructured text in the literature with high accuracy, improving data dissemination and unlocking the potential of big data text analytics for generating new hypotheses.

摘要

已发表生物数据的多样性和规模不断增加，这使得人们越来越认识到机器学习和统计方法在获取新见解方面的应用。实现这一目标的关键在于解决关系提取问题，该问题确定了已发表研究中两个或多个生物实体之间的语义相互作用。在这里，我们采用了两种深度神经网络自然语言处理（NLP）方法，即：连续词袋模型（CBOW）和双向长短期记忆网络（bi-LSTM）。这些方法被用于预测描述植物中蛋白质亚细胞定位的实体之间的关系。我们将我们的系统应用于来自SUBA人工策划数据集的1700篇已发表的拟南芥蛋白质亚细胞研究。该系统将全文文章的预处理转换为机器可读格式，并提取相关句子用于下游的NLP分析。使用SUBA语料库，神经网络分类器预测了蛋白质名称、亚细胞定位和实验方法之间的相互作用，平均精确率、召回率、准确率和F1分数分别为95.1%、82.8%、89.3%和88.4%（n = 30）。使用CropPAL数据库作为独立测试数据集也获得了类似的评分指标，该数据库存储了作物物种中的蛋白质亚细胞定位，证明了预测模型的广泛适用性。我们提供了一个框架，用于从文献中的非结构化文本中高精度地提取蛋白质功能特征，改善数据传播，并释放大数据文本分析在生成新假设方面的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/da28/7813825/54e7609a09c2/41598_2020_80441_Fig1_HTML.jpg

相似文献

Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network.使用双向深度循环神经网络在科学文献中识别蛋白质亚细胞定位。

Sci Rep. 2021 Jan 18;11(1):1696. doi: 10.1038/s41598-020-80441-8.

Integrating shortest dependency path and sentence sequence into a deep learning framework for relation extraction in clinical text.将最短依赖路径和句子序列集成到深度学习框架中，用于临床文本中的关系抽取。

BMC Med Inform Decis Mak. 2019 Jan 31;19(Suppl 1):22. doi: 10.1186/s12911-019-0736-9.

Clinical Context-Aware Biomedical Text Summarization Using Deep Neural Network: Model Development and Validation.基于深度神经网络的临床相关生物医学文本摘要：模型开发与验证。

J Med Internet Res. 2020 Oct 23;22(10):e19810. doi: 10.2196/19810.

Clinical Named Entity Recognition From Chinese Electronic Health Records via Machine Learning Methods.基于机器学习方法的中文电子健康记录临床命名实体识别

JMIR Med Inform. 2018 Dec 17;6(4):e50. doi: 10.2196/medinform.9965.

Dependency-based Siamese long short-term memory network for learning sentence representations.基于依赖的孪生长短时记忆网络用于学习句子表示。

PLoS One. 2018 Mar 7;13(3):e0193919. doi: 10.1371/journal.pone.0193919. eCollection 2018.

Natural language processing and recurrent network models for identifying genomic mutation-associated cancer treatment change from patient progress notes.用于从患者病程记录中识别基因组突变相关癌症治疗变化的自然语言处理和循环网络模型

JAMIA Open. 2019 Apr;2(1):139-149. doi: 10.1093/jamiaopen/ooy061. Epub 2019 Jan 3.

Prediction of Stroke Outcome Using Natural Language Processing-Based Machine Learning of Radiology Report of Brain MRI.使用基于自然语言处理的脑磁共振成像放射学报告机器学习预测卒中结局

J Pers Med. 2020 Dec 16;10(4):286. doi: 10.3390/jpm10040286.

Extracting comprehensive clinical information for breast cancer using deep learning methods.利用深度学习方法提取乳腺癌全面临床信息。

Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.

GHS-NET a generic hybridized shallow neural network for multi-label biomedical text classification.GHS-NET：一种用于多标签生物医学文本分类的通用混合浅层神经网络。

J Biomed Inform. 2021 Apr;116:103699. doi: 10.1016/j.jbi.2021.103699. Epub 2021 Feb 15.

Entity recognition from clinical texts via recurrent neural network.基于循环神经网络的临床文本实体识别。

BMC Med Inform Decis Mak. 2017 Jul 5;17(Suppl 2):67. doi: 10.1186/s12911-017-0468-7.

引用本文的文献

Protein subcellular localization prediction tools.蛋白质亚细胞定位预测工具。

Comput Struct Biotechnol J. 2024 Apr 15;23:1796-1807. doi: 10.1016/j.csbj.2024.04.032. eCollection 2024 Dec.

Prenatal exposures to endocrine disrupting chemicals: The role of multi-omics in understanding toxicity.产前暴露于内分泌干扰化学物质：多组学在毒性理解中的作用。

Mol Cell Endocrinol. 2023 Dec 1;578:112046. doi: 10.1016/j.mce.2023.112046. Epub 2023 Aug 19.

本文引用的文献

UPCLASS: a deep learning-based classifier for UniProtKB entry publications.UPCLASS：一个基于深度学习的 UniProtKB 条目的出版物分类器。

Database (Oxford). 2020 Jan 1;2020. doi: 10.1093/database/baaa026.

Biophysical prediction of protein-peptide interactions and signaling networks using machine learning.利用机器学习进行蛋白质-肽相互作用和信号网络的生物物理预测。

Nat Methods. 2020 Feb;17(2):175-183. doi: 10.1038/s41592-019-0687-1. Epub 2020 Jan 6.

The 27th annual Nucleic Acids Research database issue and molecular biology database collection.第 27 届年度核酸研究数据库问题和分子生物学数据库汇集。

Nucleic Acids Res. 2020 Jan 8;48(D1):D1-D8. doi: 10.1093/nar/gkz1161.

Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine.BioCreative VI 精准医学赛道概述：精准医学中的蛋白质相互作用和突变挖掘。

Database (Oxford). 2019 Jan 1;2019:bay147. doi: 10.1093/database/bay147.

Drug-drug interaction extraction from biomedical texts using long short-term memory network.基于长短时记忆网络的生物医学文献中药物-药物相互作用提取

J Biomed Inform. 2018 Oct;86:15-24. doi: 10.1016/j.jbi.2018.08.005. Epub 2018 Aug 21.

Recent Advances in the Machine Learning-Based Drug-Target Interaction Prediction.基于机器学习的药物-靶标相互作用预测的最新进展。

Curr Drug Metab. 2019;20(3):194-202. doi: 10.2174/1389200219666180821094047.

Using machine learning tools for protein database biocuration assistance.利用机器学习工具辅助蛋白质数据库生物注释。

Sci Rep. 2018 Jul 5;8(1):10148. doi: 10.1038/s41598-018-28330-z.

A gene-phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach.使用表示学习方法从生物医学文献中提取基因-表型关系的管道。

Bioinformatics. 2018 Jul 1;34(13):i386-i394. doi: 10.1093/bioinformatics/bty263.

MU-LOC: A Machine-Learning Method for Predicting Mitochondrially Localized Proteins in Plants.MU-LOC：一种预测植物线粒体定位蛋白的机器学习方法。

Front Plant Sci. 2018 May 23;9:634. doi: 10.3389/fpls.2018.00634. eCollection 2018.

Long short-term memory RNN for biomedical named entity recognition.用于生物医学命名实体识别的长短期记忆循环神经网络

BMC Bioinformatics. 2017 Oct 30;18(1):462. doi: 10.1186/s12859-017-1868-5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用双向深度循环神经网络在科学文献中识别蛋白质亚细胞定位。

Identifying protein subcellular localisation in scientific literature using bidirectional deep recurrent neural network.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献