一种用于识别人类蛋白质编码序列中候选结合位点的机器学习策略。

A machine learning strategy to identify candidate binding sites in human protein-coding sequence.

作者信息

Down Thomas, Leong Bernard, Hubbard Tim J P

机构信息

Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

出版信息

BMC Bioinformatics. 2006 Sep 26;7:419. doi: 10.1186/1471-2105-7-419.

DOI:10.1186/1471-2105-7-419

PMID:17002805

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1592515/

Abstract

BACKGROUND

The splicing of RNA transcripts is thought to be partly promoted and regulated by sequences embedded within exons. Known sequences include binding sites for SR proteins, which are thought to mediate interactions between splicing factors bound to the 5' and 3' splice sites. It would be useful to identify further candidate sequences, however identifying them computationally is hard since exon sequences are also constrained by their functional role in coding for proteins.

RESULTS

This strategy identified a collection of motifs including several previously reported splice enhancer elements. Although only trained on coding exons, the model discriminates both coding and non-coding exons from intragenic sequence.

CONCLUSION

We have trained a computational model able to detect signals in coding exons which seem to be orthogonal to the sequences' primary function of coding for proteins. We believe that many of the motifs detected here represent binding sites for both previously unrecognized proteins which influence RNA splicing as well as other regulatory elements.

摘要

背景

RNA转录本的剪接被认为部分受到外显子内嵌入序列的促进和调控。已知序列包括SR蛋白的结合位点，这些位点被认为介导了与5'和3'剪接位点结合的剪接因子之间的相互作用。识别更多的候选序列将是有用的，然而，通过计算识别它们很困难，因为外显子序列也受到其在蛋白质编码中的功能作用的限制。

结果

该策略识别出了一组基序，包括几个先前报道的剪接增强子元件。尽管该模型仅在编码外显子上进行训练，但它能够从基因内序列中区分编码外显子和非编码外显子。

结论

我们训练了一个计算模型，该模型能够检测编码外显子中的信号，这些信号似乎与序列编码蛋白质的主要功能无关。我们相信，这里检测到的许多基序代表了影响RNA剪接的先前未识别的蛋白质以及其他调控元件的结合位点。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6516/1592515/ec885f5d4004/1471-2105-7-419-1.jpg

相似文献

A machine learning strategy to identify candidate binding sites in human protein-coding sequence.

BMC Bioinformatics. 2006 Sep 26;7:419. doi: 10.1186/1471-2105-7-419.

Automatic detection of exonic splicing enhancers (ESEs) using SVMs.

BMC Bioinformatics. 2008 Sep 10;9:369. doi: 10.1186/1471-2105-9-369.

Orthogonal kernel machine for the prediction of functional sites in proteins.

IEEE Trans Syst Man Cybern B Cybern. 2005 Feb;35(1):100-6. doi: 10.1109/tsmcb.2004.840723.

High-throughput identification of interacting protein-protein binding sites.

BMC Bioinformatics. 2007 Jun 27;8:223. doi: 10.1186/1471-2105-8-223.

In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists.

Bioinformatics. 2007 Feb 15;23(4):414-20. doi: 10.1093/bioinformatics/btl639. Epub 2007 Jan 4.

A comprehensive assessment of sequence-based and template-based methods for protein contact prediction.

Bioinformatics. 2008 Apr 1;24(7):924-31. doi: 10.1093/bioinformatics/btn069. Epub 2008 Feb 22.

Learning to translate sequence and structure to function: identifying DNA binding and membrane binding proteins.

Ann Biomed Eng. 2007 Jun;35(6):1043-52. doi: 10.1007/s10439-007-9312-z. Epub 2007 Apr 13.

A novel ensemble learning method for de novo computational identification of DNA binding sites.

BMC Bioinformatics. 2007 Jul 12;8:249. doi: 10.1186/1471-2105-8-249.

Identification of Intrinsically Unstructured Proteins using hierarchical classifier.

Int J Data Min Bioinform. 2008;2(2):121-33. doi: 10.1504/ijdmb.2008.019093.

Learning to predict protein-protein interactions from protein sequences.

Bioinformatics. 2003 Oct 12;19(15):1875-81. doi: 10.1093/bioinformatics/btg352.

引用本文的文献

A primer on machine learning techniques for genomic applications.

Comput Struct Biotechnol J. 2021 Jul 31;19:4345-4359. doi: 10.1016/j.csbj.2021.07.021. eCollection 2021.

CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences.

BMC Bioinformatics. 2012 Feb 14;13:32. doi: 10.1186/1471-2105-13-32.

Locating protein-coding sequences under selection for additional, overlapping functions in 29 mammalian genomes.

Genome Res. 2011 Nov;21(11):1916-28. doi: 10.1101/gr.108753.110. Epub 2011 Oct 12.

COMIT: identification of noncoding motifs under selection in coding sequences.

Genome Biol. 2009;10(11):R133. doi: 10.1186/gb-2009-10-11-r133. Epub 2009 Nov 20.

A search for conserved sequences in coding regions reveals that the let-7 microRNA targets Dicer within its coding sequence.

Proc Natl Acad Sci U S A. 2008 Sep 30;105(39):14879-84. doi: 10.1073/pnas.0803230105. Epub 2008 Sep 23.

Automatic detection of exonic splicing enhancers (ESEs) using SVMs.

BMC Bioinformatics. 2008 Sep 10;9:369. doi: 10.1186/1471-2105-9-369.

Calculation of splicing potential from the Alternative Splicing Mutation Database.

BMC Res Notes. 2008;1:4. doi: 10.1186/1756-0500-1-4. Epub 2008 Feb 26.

The Alternative Splicing Mutation Database: a hub for investigations of alternative splicing using mutational evidence.

BMC Res Notes. 2008;1:3. doi: 10.1186/1756-0500-1-3. Epub 2008 Feb 26.

Genomic mid-range inhomogeneity correlates with an abundance of RNA secondary structures.

BMC Genomics. 2008 Jun 12;9:284. doi: 10.1186/1471-2164-9-284.

Resolving the structural features of genomic islands: a machine learning approach.

Genome Res. 2008 Feb;18(2):331-42. doi: 10.1101/gr.7004508. Epub 2007 Dec 10.

本文引用的文献

Dichotomous splicing signals in exon flanks.

Genome Res. 2005 Jun;15(6):768-79. doi: 10.1101/gr.3217705.

The Vertebrate Genome Annotation (Vega) database.

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D459-65. doi: 10.1093/nar/gki135.

Accurate identification of alternatively spliced exons using support vector machine.

Bioinformatics. 2005 Apr 1;21(7):897-901. doi: 10.1093/bioinformatics/bti132. Epub 2004 Nov 5.

What can we learn from noncoding regions of similarity between genomes?

BMC Bioinformatics. 2004 Sep 15;5:131. doi: 10.1186/1471-2105-5-131.

An overview of Ensembl.

Genome Res. 2004 May;14(5):925-8. doi: 10.1101/gr.1860604. Epub 2004 Apr 12.

Sequence information for the splicing of human pre-mRNA identified by support vector machine classification.

Genome Res. 2003 Dec;13(12):2637-50. doi: 10.1101/gr.1679003.

Widespread selection for local RNA secondary structure in coding regions of bacterial genes.

Genome Res. 2003 Sep;13(9):2042-51. doi: 10.1101/gr.1257503.

A general role for splicing enhancers in exon definition.

RNA. 2002 Oct;8(10):1233-41. doi: 10.1017/s1355838202028030.

Predictive identification of exonic splicing enhancers in human genes.

Science. 2002 Aug 9;297(5583):1007-13. doi: 10.1126/science.1073774. Epub 2002 Jul 11.

Categorization and characterization of transcript-confirmed constitutively and alternatively spliced introns and exons from human.

Hum Mol Genet. 2002 Feb 15;11(4):451-64. doi: 10.1093/hmg/11.4.451.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于识别人类蛋白质编码序列中候选结合位点的机器学习策略。

A machine learning strategy to identify candidate binding sites in human protein-coding sequence.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献