• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

BayesMotif:从不纯数据集发现从头蛋白质分拣基序。

BayesMotif: de novo protein sorting motif discovery from impure datasets.

机构信息

Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA.

出版信息

BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S66. doi: 10.1186/1471-2105-11-S1-S66.

DOI:10.1186/1471-2105-11-S1-S66
PMID:20122242
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3009540/
Abstract

BACKGROUND

Protein sorting is the process that newly synthesized proteins are transported to their target locations within or outside of the cell. This process is precisely regulated by protein sorting signals in different forms. A major category of sorting signals are amino acid sub-sequences usually located at the N-terminals or C-terminals of protein sequences. Genome-wide experimental identification of protein sorting signals is extremely time-consuming and costly. Effective computational algorithms for de novo discovery of protein sorting signals is needed to improve the understanding of protein sorting mechanisms.

METHODS

We formulated the protein sorting motif discovery problem as a classification problem and proposed a Bayesian classifier based algorithm (BayesMotif) for de novo identification of a common type of protein sorting motifs in which a highly conserved anchor is present along with a less conserved motif regions. A false positive removal procedure is developed to iteratively remove sequences that are unlikely to contain true motifs so that the algorithm can identify motifs from impure input sequences.

RESULTS

Experiments on both implanted motif datasets and real-world datasets showed that the enhanced BayesMotif algorithm can identify anchored sorting motifs from pure or impure protein sequence dataset. It also shows that the false positive removal procedure can help to identify true motifs even when there is only 20% of the input sequences containing true motif instances.

CONCLUSION

We proposed BayesMotif, a novel Bayesian classification based algorithm for de novo discovery of a special category of anchored protein sorting motifs from impure datasets. Compared to conventional motif discovery algorithms such as MEME, our algorithm can find less-conserved motifs with short highly conserved anchors. Our algorithm also has the advantage of easy incorporation of additional meta-sequence features such as hydrophobicity or charge of the motifs which may help to overcome the limitations of PWM (position weight matrix) motif model.

摘要

背景

蛋白质分拣是指新合成的蛋白质被运输到细胞内外目标位置的过程。这个过程被不同形式的蛋白质分拣信号精确调控。分拣信号的一个主要类别是氨基酸子序列,通常位于蛋白质序列的 N 端或 C 端。全面的蛋白质分拣信号的基因组实验鉴定非常耗时且昂贵。需要有效的计算算法来从头发现蛋白质分拣信号,以提高对蛋白质分拣机制的理解。

方法

我们将蛋白质分拣基序发现问题表述为分类问题,并提出了一种基于贝叶斯分类器的算法(BayesMotif),用于从头发现一种常见类型的蛋白质分拣基序,其中存在一个高度保守的锚定序列和一个不太保守的基序区域。开发了一种假阳性去除程序,用于迭代地去除不太可能包含真实基序的序列,以便算法可以从不纯的输入序列中识别基序。

结果

在植入的基序数据集和真实世界数据集上的实验表明,增强的 BayesMotif 算法可以从纯或不纯的蛋白质序列数据集中识别锚定分拣基序。它还表明,即使只有 20%的输入序列包含真实基序实例,假阳性去除程序也可以帮助识别真实基序。

结论

我们提出了 BayesMotif,这是一种基于贝叶斯分类的新算法,用于从不纯数据集中从头发现一种特殊类型的锚定蛋白质分拣基序。与 MEME 等传统基序发现算法相比,我们的算法可以找到较短的高度保守的锚定序列,并且具有较短的保守基序。我们的算法还具有易于合并额外的元序列特征(例如基序的疏水性或电荷)的优势,这可能有助于克服 PWM(位置权重矩阵)基序模型的限制。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/78f6/3009540/2ca911a2c667/1471-2105-11-S1-S66-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/78f6/3009540/98d31432bc80/1471-2105-11-S1-S66-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/78f6/3009540/aa22de57adbf/1471-2105-11-S1-S66-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/78f6/3009540/6792a5c4f5ed/1471-2105-11-S1-S66-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/78f6/3009540/1b3acdfce83e/1471-2105-11-S1-S66-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/78f6/3009540/2ca911a2c667/1471-2105-11-S1-S66-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/78f6/3009540/98d31432bc80/1471-2105-11-S1-S66-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/78f6/3009540/aa22de57adbf/1471-2105-11-S1-S66-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/78f6/3009540/6792a5c4f5ed/1471-2105-11-S1-S66-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/78f6/3009540/1b3acdfce83e/1471-2105-11-S1-S66-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/78f6/3009540/2ca911a2c667/1471-2105-11-S1-S66-5.jpg

相似文献

1
BayesMotif: de novo protein sorting motif discovery from impure datasets.BayesMotif:从不纯数据集发现从头蛋白质分拣基序。
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S66. doi: 10.1186/1471-2105-11-S1-S66.
2
A Monte Carlo EM algorithm for de novo motif discovery in biomolecular sequences.一种用于生物分子序列中从头基序发现的蒙特卡罗期望最大化算法。
IEEE/ACM Trans Comput Biol Bioinform. 2009 Jul-Sep;6(3):370-86. doi: 10.1109/TCBB.2008.103.
3
Metamotifs--a generative model for building families of nucleotide position weight matrices.Metamotifs--一种构建核苷酸位置权重矩阵家族的生成模型。
BMC Bioinformatics. 2010 Jun 25;11:348. doi: 10.1186/1471-2105-11-348.
4
GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery.GADEM:一种遗传算法引导的间隔二元组形成,结合期望最大化算法用于基序发现。
J Comput Biol. 2009 Feb;16(2):317-29. doi: 10.1089/cmb.2008.16TT.
5
Simultaneously learning DNA motif along with its position and sequence rank preferences through expectation maximization algorithm.通过期望最大化算法同时学习DNA基序及其位置和序列排名偏好。
J Comput Biol. 2013 Mar;20(3):237-48. doi: 10.1089/cmb.2012.0233.
6
HIGEDA: a hierarchical gene-set genetics based algorithm for finding subtle motifs in biological sequences.HIGEDA:一种基于层次基因集遗传学的算法,用于在生物序列中寻找微妙的模体。
Bioinformatics. 2010 Feb 1;26(3):302-9. doi: 10.1093/bioinformatics/btp676. Epub 2009 Dec 8.
7
Discriminative motif discovery in DNA and protein sequences using the DEME algorithm.使用DEME算法在DNA和蛋白质序列中发现鉴别性基序。
BMC Bioinformatics. 2007 Oct 15;8:385. doi: 10.1186/1471-2105-8-385.
8
The value of position-specific priors in motif discovery using MEME.MEME 中位置特异性先验在基序发现中的价值。
BMC Bioinformatics. 2010 Apr 9;11:179. doi: 10.1186/1471-2105-11-179.
9
Finding motifs with insufficient number of strong binding sites.发现具有数量不足的强结合位点的基序。
J Comput Biol. 2005 Jul-Aug;12(6):686-701. doi: 10.1089/cmb.2005.12.686.
10
Bayesian multiple-instance motif discovery with BAMBI: inference of recombinase and transcription factor binding sites.贝叶斯多实例基序发现与 BAMBI:重组酶和转录因子结合位点的推断。
Nucleic Acids Res. 2011 Nov;39(21):e146. doi: 10.1093/nar/gkr745. Epub 2011 Sep 24.

本文引用的文献

1
Comparative analysis of regulatory motif discovery tools for transcription factor binding sites.用于转录因子结合位点的调控基序发现工具的比较分析。
Genomics Proteomics Bioinformatics. 2007 May;5(2):131-42. doi: 10.1016/S1672-0229(07)60023-0.
2
Locating proteins in the cell using TargetP, SignalP and related tools.使用TargetP、SignalP及相关工具在细胞中定位蛋白质。
Nat Protoc. 2007;2(4):953-71. doi: 10.1038/nprot.2007.131.
3
BaCelLo: a balanced subcellular localization predictor.BaCelLo:一种平衡的亚细胞定位预测器。
Bioinformatics. 2006 Jul 15;22(14):e408-16. doi: 10.1093/bioinformatics/btl222.
4
MEME: discovering and analyzing DNA and protein sequence motifs.MEME:发现和分析DNA与蛋白质序列基序
Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W369-73. doi: 10.1093/nar/gkl198.
5
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.Cd-hit:一个用于对大量蛋白质或核苷酸序列进行聚类和比较的快速程序。
Bioinformatics. 2006 Jul 1;22(13):1658-9. doi: 10.1093/bioinformatics/btl158. Epub 2006 May 26.
6
LOCATE: a mouse protein subcellular localization database.LOCATE:一个小鼠蛋白质亚细胞定位数据库。
Nucleic Acids Res. 2006 Jan 1;34(Database issue):D213-7. doi: 10.1093/nar/gkj069.
7
Twin-arginine-specific protein export in Escherichia coli.大肠杆菌中双精氨酸特异性蛋白质输出
Res Microbiol. 2005 Mar;156(2):131-6. doi: 10.1016/j.resmic.2004.09.016. Epub 2005 Jan 28.
8
Improved prediction of signal peptides: SignalP 3.0.信号肽预测的改进:SignalP 3.0
J Mol Biol. 2004 Jul 16;340(4):783-95. doi: 10.1016/j.jmb.2004.05.028.
9
The Gene Ontology (GO) database and informatics resource.基因本体论(GO)数据库及信息资源。
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D258-61. doi: 10.1093/nar/gkh036.
10
ELM server: A new resource for investigating short functional sites in modular eukaryotic proteins.ELM服务器:一种用于研究模块化真核生物蛋白质中短功能位点的新资源。
Nucleic Acids Res. 2003 Jul 1;31(13):3625-30. doi: 10.1093/nar/gkg545.