• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

相似文献

1
A bacterial phyla dataset for protein function prediction.用于蛋白质功能预测的细菌门数据集。
Data Brief. 2019 Dec 18;28:105002. doi: 10.1016/j.dib.2019.105002. eCollection 2020 Feb.
2
A deep learning ensemble for function prediction of hypothetical proteins from pathogenic bacterial species.基于深度学习的方法对致病菌中假定蛋白质功能进行预测。
Comput Biol Chem. 2019 Dec;83:107147. doi: 10.1016/j.compbiolchem.2019.107147. Epub 2019 Oct 19.
3
4
Deep learning meets ontologies: experiments to anchor the cardiovascular disease ontology in the biomedical literature.深度学习与本体论相遇:将心血管疾病本体论锚定在生物医学文献中的实验。
J Biomed Semantics. 2018 Apr 12;9(1):13. doi: 10.1186/s13326-018-0181-1.
5
Protein function prediction from protein-protein interaction network using gene ontology based neighborhood analysis and physico-chemical features.基于基因本体的邻域分析和物理化学特征,从蛋白质-蛋白质相互作用网络预测蛋白质功能。
J Bioinform Comput Biol. 2018 Dec;16(6):1850025. doi: 10.1142/S0219720018500257. Epub 2018 Sep 19.
6
Predicting functions of maize proteins using graph convolutional network.利用图卷积网络预测玉米蛋白的功能。
BMC Bioinformatics. 2020 Dec 16;21(Suppl 16):420. doi: 10.1186/s12859-020-03745-6.
7
Protein function prediction with gene ontology: from traditional to deep learning models.利用基因本体进行蛋白质功能预测:从传统模型到深度学习模型
PeerJ. 2021 Aug 24;9:e12019. doi: 10.7717/peerj.12019. eCollection 2021.
8
Automatic annotation of protein motif function with Gene Ontology terms.使用基因本体术语对蛋白质基序功能进行自动注释。
BMC Bioinformatics. 2004 Sep 2;5:122. doi: 10.1186/1471-2105-5-122.
9
DeepFunc: A Deep Learning Framework for Accurate Prediction of Protein Functions from Protein Sequences and Interactions.DeepFunc:一种从蛋白质序列和相互作用中准确预测蛋白质功能的深度学习框架。
Proteomics. 2019 Jun;19(12):e1900019. doi: 10.1002/pmic.201900019. Epub 2019 May 27.
10
GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank.GOLabeler:通过学习排序提高基于序列的大规模蛋白质功能预测。
Bioinformatics. 2018 Jul 15;34(14):2465-2473. doi: 10.1093/bioinformatics/bty130.

引用本文的文献

1
PANDA2: protein function prediction using graph neural networks.PANDA2:使用图神经网络进行蛋白质功能预测
NAR Genom Bioinform. 2022 Feb 2;4(1):lqac004. doi: 10.1093/nargab/lqac004. eCollection 2022 Mar.
2
Review on the Computational Genome Annotation of Sequences Obtained by Next-Generation Sequencing.下一代测序获得序列的计算基因组注释综述
Biology (Basel). 2020 Sep 18;9(9):295. doi: 10.3390/biology9090295.

本文引用的文献

1
A deep learning ensemble for function prediction of hypothetical proteins from pathogenic bacterial species.基于深度学习的方法对致病菌中假定蛋白质功能进行预测。
Comput Biol Chem. 2019 Dec;83:107147. doi: 10.1016/j.compbiolchem.2019.107147. Epub 2019 Oct 19.
2
iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences.iFeature:一个用于从蛋白质和肽序列中提取和选择特征的 Python 包和网络服务器。
Bioinformatics. 2018 Jul 15;34(14):2499-2502. doi: 10.1093/bioinformatics/bty140.
3
Biopython: freely available Python tools for computational molecular biology and bioinformatics.Biopython:用于计算分子生物学和生物信息学的免费可用Python工具。
Bioinformatics. 2009 Jun 1;25(11):1422-3. doi: 10.1093/bioinformatics/btp163. Epub 2009 Mar 20.
4
The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function.计算机模拟功能预测的简要指南,或如何利用序列和结构信息预测蛋白质功能。
PLoS Comput Biol. 2008 Oct;4(10):e1000160. doi: 10.1371/journal.pcbi.1000160. Epub 2008 Oct 31.
5
Manual curation is not sufficient for annotation of genomic databases.人工整理对于基因组数据库的注释来说并不足够。
Bioinformatics. 2007 Jul 1;23(13):i41-8. doi: 10.1093/bioinformatics/btm229.
6
The PROSITE database.PROSITE数据库。
Nucleic Acids Res. 2006 Jan 1;34(Database issue):D227-30. doi: 10.1093/nar/gkj063.
7
Beyond annotation transfer by homology: novel protein-function prediction methods to assist drug discovery.超越同源性注释转移:助力药物发现的新型蛋白质功能预测方法。
Drug Discov Today. 2005 Nov 1;10(21):1475-82. doi: 10.1016/S1359-6446(05)03621-4.
8
Human genome. Reaching their goal early, sequencing labs celebrate.人类基因组。测序实验室提前达成目标,纷纷庆祝。
Science. 2003 Apr 18;300(5618):409. doi: 10.1126/science.300.5618.409.

用于蛋白质功能预测的细菌门数据集。

A bacterial phyla dataset for protein function prediction.

作者信息

Mishra Sarthak, Rastogi Yash Pratap, Jabin Suraiya, Kaur Punit, Amir Mohammad, Khatoon Shabanam

机构信息

Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India.

Department of Biophysics, All India Institute of Medical Sciences (AIIMS), New Delhi, 110029, Delhi, India.

出版信息

Data Brief. 2019 Dec 18;28:105002. doi: 10.1016/j.dib.2019.105002. eCollection 2020 Feb.

DOI:10.1016/j.dib.2019.105002
PMID:31921945
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6950771/
Abstract

Protein function prediction has been the most worked upon and the most challenging problem for computational biologists. The vast majority of known proteins have yet not been characterised experimentally, and there is significant gap between their structures and functions. New un-annotated sequences are being added to the public protein databases (e.g. UniprotKB) at an enormous pace [1]. Such proteins with unknown functions might play key role in the metabolism, growth and development regulation. Thus, if functions of unknown proteins left undiscovered, researchers may skip important information(s). Based on their sequence, structure, evolutionary history, and their association with other proteins, tools of computational biology can provide insights into the function of proteins [2]. For proteins with well characterised close relatives, it is trivial to infer function. Orphan proteins without discernible sequence relatives present a greater challenge [3]. Here the task of experimental characterisation is blind and becomes unwieldy. It is highly unlikely that all known proteins will ever be completely experimentally characterised [4]. Thus, there is an emergent need to develop fast and accurate computational approaches to fulfil this requirement. Towards this end, we prepared a dataset for protein function prediction by extracting protein sequences and annotations of reviewed prokaryotic proteins (total count 323,719 as accessed on date March 10, 2019) belonging to 9 bacterial phyla Actinobacteria, Bacteroidetes, Chlamydiae, Cyanobacteria, Firmicutes, Fusobacteria, Proteobacteria, Spirochaetes and Tenericutes. Corresponding to the most frequent 1739 Gene Ontology (Molecular Function) terms, samples were filtered, and 171,212 proteins were retrieved for feature generation. The Dataset was generated by calculating the sequence, sub-sequence, physiochemical, annotation-based features for each 171,212 reviewed proteins using method in [10]. These features constitute a total of 9890 attributes for each sequence of protein along with 1739 Gene Ontology terms. Each protein sequence is assigned one or more of 1739 Gene Ontology (Molecular Function) term as its target label. The Dataset contains the Entry and Entry name of each sequence corresponding to UniprotKB Database. This dataset being huge in size (171,212 samples X 9890 features, 1739 classes with multiple values) and equipped with enough number of positive and negative samples of each 1739 class, is good for testing efficiency of any upcoming deep learning models [5]. We divided the full dataset of 171,212 reviewed proteins in the ratio 3:1 to form Train/Test dataset 1; train dataset with 128,409 samples and test dataset with 42,803 samples to facilitate training of a deep learning model. The train and test datasets are stratified to contain good proportion of each 1739 classes. We then prepared a dataset 2 of pathogenic unreviewed proteins of the 9 bacterial phyla each with 9890 features same as train/train dataset of reviewed proteins but without target labels in order to predict their functions using deep learning model proposed in [5].

摘要

蛋白质功能预测一直是计算生物学家研究最多且最具挑战性的问题。绝大多数已知蛋白质尚未通过实验进行表征,它们的结构与功能之间存在显著差距。新的未注释序列正以极快的速度被添加到公共蛋白质数据库(如UniprotKB)中[1]。这类功能未知的蛋白质可能在新陈代谢、生长和发育调控中发挥关键作用。因此,如果未知蛋白质的功能未被发现,研究人员可能会错过重要信息。基于蛋白质的序列、结构、进化历史以及它们与其他蛋白质的关联,计算生物学工具可以为蛋白质的功能提供见解[2]。对于具有特征明确的近亲蛋白质,推断其功能很容易。而没有可识别序列亲属的孤儿蛋白质则带来了更大的挑战[3]。在此,实验表征的任务盲目且变得难以处理。所有已知蛋白质都完全通过实验进行表征的可能性极小[4]。因此,迫切需要开发快速且准确的计算方法来满足这一需求。为此,我们通过提取属于9个细菌门(放线菌门、拟杆菌门、衣原体门、蓝细菌门、厚壁菌门、梭杆菌门、变形菌门、螺旋体门和柔膜菌门)的已审查原核蛋白质的蛋白质序列和注释(截至2019年3月10日访问时总数为323,719个),准备了一个用于蛋白质功能预测的数据集。对应于最常见的1739个基因本体(分子功能)术语,对样本进行了筛选,并检索了171,212个蛋白质用于特征生成。该数据集是通过使用[10]中的方法为每个171,212个已审查蛋白质计算序列、子序列、物理化学、基于注释的特征而生成的。这些特征为每个蛋白质序列总共构成了9890个属性以及1739个基因本体术语。每个蛋白质序列被指定一个或多个1739个基因本体(分子功能)术语作为其目标标签。该数据集包含与UniprotKB数据库相对应的每个序列的条目和条目名称。这个数据集规模巨大(171,212个样本×9890个特征,1739个类别且具有多个值),并且每个1739个类别都配备了足够数量的正样本和负样本,有利于测试任何即将出现的深度学习模型的效率[5]。我们将171,212个已审查蛋白质的完整数据集按3:1的比例划分,形成训练/测试数据集1;训练数据集有128,409个样本,测试数据集有42,803个样本,以方便深度学习模型的训练。训练和测试数据集进行了分层,以包含每个1739个类别的良好比例。然后,我们准备了一个数据集2,其中包含9个细菌门的致病性未审查蛋白质,每个蛋白质具有与已审查蛋白质的训练/训练数据集相同的9890个特征,但没有目标标签以便使用[5]中提出的深度学习模型预测它们的功能。