• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

与蛋白质功能相关的跨家族序列特征的自动发现。

Automatic discovery of cross-family sequence features associated with protein function.

作者信息

Brameier Markus, Haan Josien, Krings Andrea, MacCallum Robert M

机构信息

Stockholm Bioinformatics Center, Stockholm University, 106 91 Stockholm, Sweden.

出版信息

BMC Bioinformatics. 2006 Jan 12;7:16. doi: 10.1186/1471-2105-7-16.

DOI:10.1186/1471-2105-7-16
PMID:16409628
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1395344/
Abstract

BACKGROUND

Methods for predicting protein function directly from amino acid sequences are useful tools in the study of uncharacterized protein families and in comparative genomics. Until now, this problem has been approached using machine learning techniques that attempt to predict membership, or otherwise, to predefined functional categories or subcellular locations. A potential drawback of this approach is that the human-designated functional classes may not accurately reflect the underlying biology, and consequently important sequence-to-function relationships may be missed.

RESULTS

We show that a self-supervised data mining approach is able to find relationships between sequence features and functional annotations. No preconceived ideas about functional categories are required, and the training data is simply a set of protein sequences and their UniProt/Swiss-Prot annotations. The main technical aspect of the approach is the co-evolution of amino acid-based regular expressions and keyword-based logical expressions with genetic programming. Our experiments on a strictly non-redundant set of eukaryotic proteins reveal that the strongest and most easily detected sequence-to-function relationships are concerned with targeting to various cellular compartments, which is an area already well studied both experimentally and computationally. Of more interest are a number of broad functional roles which can also be correlated with sequence features. These include inhibition, biosynthesis, transcription and defence against bacteria. Despite substantial overlaps between these functions and their corresponding cellular compartments, we find clear differences in the sequence motifs used to predict some of these functions. For example, the presence of polyglutamine repeats appears to be linked more strongly to the "transcription" function than to the general "nuclear" function/location.

CONCLUSION

We have developed a novel and useful approach for knowledge discovery in annotated sequence data. The technique is able to identify functionally important sequence features and does not require expert knowledge. By viewing protein function from a sequence perspective, the approach is also suitable for discovering unexpected links between biological processes, such as the recently discovered role of ubiquitination in transcription.

摘要

背景

直接从氨基酸序列预测蛋白质功能的方法是研究未表征蛋白质家族和比较基因组学中的有用工具。到目前为止,这个问题一直通过机器学习技术来解决,这些技术试图预测蛋白质是否属于预定义的功能类别或亚细胞定位。这种方法的一个潜在缺点是,人为指定的功能类别可能无法准确反映潜在的生物学特性,因此可能会错过重要的序列与功能关系。

结果

我们表明,一种自监督数据挖掘方法能够找到序列特征与功能注释之间的关系。不需要对功能类别有先入为主的想法,训练数据只是一组蛋白质序列及其UniProt/Swiss-Prot注释。该方法的主要技术方面是基于氨基酸的正则表达式和基于关键词的逻辑表达式与遗传编程的共同进化。我们在一组严格非冗余的真核蛋白质上进行的实验表明,最强且最容易检测到的序列与功能关系与靶向各种细胞区室有关,这是一个在实验和计算方面都已得到充分研究的领域。更有趣的是,一些广泛的功能作用也可以与序列特征相关联。这些功能包括抑制、生物合成、转录以及对细菌的防御。尽管这些功能与其相应的细胞区室之间存在大量重叠,但我们发现用于预测其中一些功能的序列基序存在明显差异。例如,聚谷氨酰胺重复序列的存在似乎与“转录”功能的联系比与一般的“核”功能/定位的联系更为紧密。

结论

我们开发了一种新颖且有用的方法用于在注释序列数据中发现知识。该技术能够识别功能上重要的序列特征,并且不需要专家知识。通过从序列角度看待蛋白质功能,该方法也适用于发现生物过程之间意想不到的联系,例如最近发现的泛素化在转录中的作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7820/1395344/e872278e0c02/1471-2105-7-16-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7820/1395344/5da7117199c5/1471-2105-7-16-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7820/1395344/59964773cf43/1471-2105-7-16-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7820/1395344/69c7f4bcd706/1471-2105-7-16-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7820/1395344/117388810779/1471-2105-7-16-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7820/1395344/d40d7b19fd98/1471-2105-7-16-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7820/1395344/e872278e0c02/1471-2105-7-16-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7820/1395344/5da7117199c5/1471-2105-7-16-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7820/1395344/59964773cf43/1471-2105-7-16-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7820/1395344/69c7f4bcd706/1471-2105-7-16-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7820/1395344/117388810779/1471-2105-7-16-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7820/1395344/d40d7b19fd98/1471-2105-7-16-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7820/1395344/e872278e0c02/1471-2105-7-16-6.jpg

相似文献

1
Automatic discovery of cross-family sequence features associated with protein function.与蛋白质功能相关的跨家族序列特征的自动发现。
BMC Bioinformatics. 2006 Jan 12;7:16. doi: 10.1186/1471-2105-7-16.
2
Predicting protein function by machine learning on amino acid sequences--a critical evaluation.通过对氨基酸序列进行机器学习来预测蛋白质功能——一项批判性评估。
BMC Genomics. 2007 Mar 20;8:78. doi: 10.1186/1471-2164-8-78.
3
Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization.Hum-PLoc:一种用于预测人类蛋白质亚细胞定位的新型集成分类器。
Biochem Biophys Res Commun. 2006 Aug 18;347(1):150-7. doi: 10.1016/j.bbrc.2006.06.059. Epub 2006 Jun 21.
4
ProLoc-GO: utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization.ProLoc-GO:利用信息丰富的基因本体术语进行基于序列的蛋白质亚细胞定位预测。
BMC Bioinformatics. 2008 Feb 1;9:80. doi: 10.1186/1471-2105-9-80.
5
Bio-support vector machines for computational proteomics.用于计算蛋白质组学的生物支持向量机
Bioinformatics. 2004 Mar 22;20(5):735-41. doi: 10.1093/bioinformatics/btg477. Epub 2004 Jan 29.
6
FGsub: Fusarium graminearum protein subcellular localizations predicted from primary structures.FGsub:根据一级结构预测的禾谷镰刀菌蛋白质亚细胞定位
BMC Syst Biol. 2010 Sep 13;4 Suppl 2(Suppl 2):S12. doi: 10.1186/1752-0509-4-S2-S12.
7
Ruleminer: a knowledge system for supporting high-throughput protein function annotations.Ruleminer:一个支持高通量蛋白质功能注释的知识系统。
J Bioinform Comput Biol. 2004 Dec;2(4):615-37. doi: 10.1142/s0219720004000752.
8
C-terminal motif prediction in eukaryotic proteomes using comparative genomics and statistical over-representation across protein families.利用比较基因组学和蛋白质家族间的统计学过度代表性预测真核生物蛋白质组中的C端基序
BMC Genomics. 2007 Jun 26;8:191. doi: 10.1186/1471-2164-8-191.
9
Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).大分子拥挤现象:化学与物理邂逅生物学(瑞士阿斯科纳,2012年6月10日至14日)
Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2.
10
Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties.使用支持向量机结合选定的蛋白质序列和结构特性预测催化残基。
BMC Bioinformatics. 2006 Jun 21;7:312. doi: 10.1186/1471-2105-7-312.

引用本文的文献

1
Analysis and prediction of single-stranded and double-stranded DNA binding proteins based on protein sequences.基于蛋白质序列的单链和双链DNA结合蛋白分析与预测
BMC Bioinformatics. 2017 Jun 12;18(1):300. doi: 10.1186/s12859-017-1715-8.
2
An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis.基于支持向量机和综合特征分析的 DNA 结合蛋白改进序列预测协议。
BMC Bioinformatics. 2013 Mar 9;14:90. doi: 10.1186/1471-2105-14-90.
3
The use of genetic programming in the analysis of quantitative gene expression profiles for identification of nodal status in bladder cancer.

本文引用的文献

1
Molecular biology: what ubiquitin can do for transcription.分子生物学:泛素对转录的作用
Curr Biol. 2004 Aug 10;14(15):R622-4. doi: 10.1016/j.cub.2004.07.046.
2
Evolutionary constraints associated with functional specificity of the CMGC protein kinases MAPK, CDK, GSK, SRPK, DYRK, and CK2alpha.与CMGC蛋白激酶MAPK、CDK、GSK、SRPK、DYRK和CK2α功能特异性相关的进化限制因素。
Protein Sci. 2004 Aug;13(8):2059-77. doi: 10.1110/ps.04637904.
3
The KEGG resource for deciphering the genome.用于解读基因组的KEGG资源。
基因编程在分析定量基因表达谱以确定膀胱癌淋巴结状态中的应用。
BMC Cancer. 2006 Jun 16;6:159. doi: 10.1186/1471-2407-6-159.
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D277-80. doi: 10.1093/nar/gkh063.
4
The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology.基因本体注释(GOA)数据库:在UniProt中与基因本体共享知识。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D262-6. doi: 10.1093/nar/gkh021.
5
The Gene Ontology (GO) database and informatics resource.基因本体论(GO)数据库及信息资源。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D258-61. doi: 10.1093/nar/gkh036.
6
The Pfam protein families database.Pfam蛋白质家族数据库。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D138-41. doi: 10.1093/nar/gkh121.
7
ELM server: A new resource for investigating short functional sites in modular eukaryotic proteins.ELM服务器:一种用于研究模块化真核生物蛋白质中短功能位点的新资源。
Nucleic Acids Res. 2003 Jul 1;31(13):3625-30. doi: 10.1093/nar/gkg545.
8
Prediction of human protein function according to Gene Ontology categories.根据基因本体论类别预测人类蛋白质功能。
Bioinformatics. 2003 Mar 22;19(5):635-42. doi: 10.1093/bioinformatics/btg036.
9
Mutations that affect the ability of the vnd/NK-2 homeoprotein to regulate gene expression: transgenic alterations and tertiary structure.影响vnd/NK-2同源异型蛋白调控基因表达能力的突变:转基因改变与三级结构
Proc Natl Acad Sci U S A. 2003 Mar 18;100(6):3119-24. doi: 10.1073/pnas.0438043100. Epub 2003 Mar 7.
10
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.2003年的SWISS-PROT蛋白质知识库及其补充TrEMBL。
Nucleic Acids Res. 2003 Jan 1;31(1):365-70. doi: 10.1093/nar/gkg095.