使用概率参数和支持向量机分类进行剪接位点识别。

Splice site identification using probabilistic parameters and SVM classification.

作者信息

Baten A K M A, Chang B C H, Halgamuge S K, Li Jason

机构信息

Dynamic Systems and Control Research Group, DoMME, The University of Melbourne, Victoria 3010, Australia.

出版信息

BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S15. doi: 10.1186/1471-2105-7-S5-S15.

DOI:10.1186/1471-2105-7-S5-S15

PMID:17254299

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1764471/

Abstract

BACKGROUND

Recent advances and automation in DNA sequencing technology has created a vast amount of DNA sequence data. This increasing growth of sequence data demands better and efficient analysis methods. Identifying genes in this newly accumulated data is an important issue in bioinformatics, and it requires the prediction of the complete gene structure. Accurate identification of splice sites in DNA sequences plays one of the central roles of gene structural prediction in eukaryotes. Effective detection of splice sites requires the knowledge of characteristics, dependencies, and relationship of nucleotides in the splice site surrounding region. A higher-order Markov model is generally regarded as a useful technique for modeling higher-order dependencies. However, their implementation requires estimating a large number of parameters, which is computationally expensive.

RESULTS

The proposed method for splice site detection consists of two stages: a first order Markov model (MM1) is used in the first stage and a support vector machine (SVM) with polynomial kernel is used in the second stage. The MM1 serves as a pre-processing step for the SVM and takes DNA sequences as its input. It models the compositional features and dependencies of nucleotides in terms of probabilistic parameters around splice site regions. The probabilistic parameters are then fed into the SVM, which combines them nonlinearly to predict splice sites. When the proposed MM1-SVM model is compared with other existing standard splice site detection methods, it shows a superior performance in all the cases.

CONCLUSION

We proposed an effective pre-processing scheme for the SVM and applied it for the identification of splice sites. This is a simple yet effective splice site detection method, which shows a better classification accuracy and computational speed than some other more complex methods.

摘要

背景

DNA测序技术的最新进展和自动化产生了大量的DNA序列数据。序列数据的不断增长需要更好且高效的分析方法。在这些新积累的数据中识别基因是生物信息学中的一个重要问题，这需要预测完整的基因结构。准确识别DNA序列中的剪接位点是真核生物基因结构预测的核心任务之一。有效检测剪接位点需要了解剪接位点周围区域核苷酸的特征、依赖性和关系。高阶马尔可夫模型通常被认为是一种用于对高阶依赖性进行建模的有用技术。然而，其实现需要估计大量参数，计算成本很高。

结果

所提出的剪接位点检测方法包括两个阶段：第一阶段使用一阶马尔可夫模型（MM1），第二阶段使用具有多项式核的支持向量机（SVM）。MM1作为SVM的预处理步骤，以DNA序列作为输入。它根据剪接位点区域周围的概率参数对核苷酸的组成特征和依赖性进行建模。然后将概率参数输入到SVM中，SVM将它们进行非线性组合以预测剪接位点。当将所提出的MM1-SVM模型与其他现有的标准剪接位点检测方法进行比较时，在所有情况下它都表现出优越的性能。

结论

我们为支持向量机提出了一种有效的预处理方案，并将其应用于剪接位点的识别。这是一种简单而有效的剪接位点检测方法，与其他一些更复杂的方法相比，它具有更高的分类准确率和计算速度。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f97/1764471/d86fc5876d9e/1471-2105-7-S5-S15-1.jpg

相似文献

Splice site identification using probabilistic parameters and SVM classification.

BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S15. doi: 10.1186/1471-2105-7-S5-S15.

A novel method for splice sites prediction using sequence component and hidden Markov model.

Annu Int Conf IEEE Eng Med Biol Soc. 2016 Aug;2016:3076-3079. doi: 10.1109/EMBC.2016.7591379.

Fast splice site detection using information content and feature reduction.

BMC Bioinformatics. 2008 Dec 12;9 Suppl 12(Suppl 12):S8. doi: 10.1186/1471-2105-9-S12-S8.

Markovian encoding models in human splice site recognition using SVM.

Comput Biol Chem. 2018 Apr;73:159-170. doi: 10.1016/j.compbiolchem.2018.02.005. Epub 2018 Feb 14.

Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition.

Gene. 2019 Jul 15;705:113-126. doi: 10.1016/j.gene.2019.04.047. Epub 2019 Apr 19.

A statistical approach for 5' splice site prediction using short sequence motifs and without encoding sequence data.

BMC Bioinformatics. 2014 Nov 25;15:362. doi: 10.1186/s12859-014-0362-6.

SpliceIT: a hybrid method for splice signal identification based on probabilistic and biological inference.

J Biomed Inform. 2010 Apr;43(2):208-17. doi: 10.1016/j.jbi.2009.09.004. Epub 2009 Sep 30.

A computational approach for prediction of donor splice sites with improved accuracy.

J Theor Biol. 2016 Sep 7;404:285-294. doi: 10.1016/j.jtbi.2016.06.013. Epub 2016 Jun 11.

Impact of RNA structure on the prediction of donor and acceptor splice sites.

BMC Bioinformatics. 2006 Jun 13;7:297. doi: 10.1186/1471-2105-7-297.

Hybrid MM/SVM structural sensors for stochastic sequential data.

BMC Bioinformatics. 2008 Aug 12;9 Suppl 9(Suppl 9):S12. doi: 10.1186/1471-2105-9-S9-S12.

引用本文的文献

Genomic language models with k-mer tokenization strategies for plant genome annotation and regulatory element strength prediction.

Plant Mol Biol. 2025 Jul 31;115(4):100. doi: 10.1007/s11103-025-01604-7.

Ninein isoform contributions to intracellular processes and macrophage immune function.

J Biol Chem. 2025 May;301(5):108419. doi: 10.1016/j.jbc.2025.108419. Epub 2025 Mar 18.

A foundational large language model for edible plant genomes.

Commun Biol. 2024 Jul 9;7(1):835. doi: 10.1038/s42003-024-06465-2.

Genetics tools for corpora allata specific gene expression in Aedes aegypti mosquitoes.

Sci Rep. 2022 Nov 28;12(1):20426. doi: 10.1038/s41598-022-25009-4.

Improved recognition of splice sites in by incorporating secondary structure information into sequence-derived features: a computational study.

3 Biotech. 2021 Nov;11(11):484. doi: 10.1007/s13205-021-03036-8. Epub 2021 Oct 31.

DASSI: differential architecture search for splice identification from DNA sequences.

BioData Min. 2021 Feb 15;14(1):15. doi: 10.1186/s13040-021-00237-y.

Splice2Deep: An ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA.

Gene X. 2020 May 13;5:100035. doi: 10.1016/j.gene.2020.100035. eCollection 2020 Dec.

A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples.

Biol Direct. 2019 Apr 11;14(1):6. doi: 10.1186/s13062-019-0236-y.

funbarRF: DNA barcode-based fungal species prediction using multiclass Random Forest supervised learning model.

BMC Genet. 2019 Jan 7;20(1):2. doi: 10.1186/s12863-018-0710-z.

Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach.

BMC Genomics. 2018 Dec 27;19(1):971. doi: 10.1186/s12864-018-5350-1.

本文引用的文献

Support vector machines for spam categorization.

IEEE Trans Neural Netw. 1999;10(5):1048-54. doi: 10.1109/72.788645.

Markov encoding for detecting signals in genomic sequences.

IEEE/ACM Trans Comput Biol Bioinform. 2005 Apr-Jun;2(2):131-42. doi: 10.1109/TCBB.2005.27.

Impact of RNA structure on the prediction of donor and acceptor splice sites.

BMC Bioinformatics. 2006 Jun 13;7:297. doi: 10.1186/1471-2105-7-297.

Learning interpretable SVMs for biological sequence classification.

BMC Bioinformatics. 2006 Mar 20;7 Suppl 1(Suppl 1):S9. doi: 10.1186/1471-2105-7-S1-S9.

Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments.

Bioinformatics. 2006 Jun 15;22(12):1536-7. doi: 10.1093/bioinformatics/btl151. Epub 2006 Apr 21.

Analysis of SD sequences in completed microbial genomes: non-SD-led genes are as common as SD-led genes.

Gene. 2006 May 24;373:90-9. doi: 10.1016/j.gene.2006.01.033. Epub 2006 Mar 30.

Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources.

BMC Bioinformatics. 2006 Feb 9;7:62. doi: 10.1186/1471-2105-7-62.

Improved spliced alignment from an information theoretic approach.

Bioinformatics. 2006 Jan 1;22(1):13-20. doi: 10.1093/bioinformatics/bti748. Epub 2005 Nov 2.

Splice site detection with a higher-order markov model implemented on a neural network.

Genome Inform. 2003;14:64-72.

SpliceMachine: predicting splice sites from high-dimensional local context representations.

Bioinformatics. 2005 Apr 15;21(8):1332-8. doi: 10.1093/bioinformatics/bti166. Epub 2004 Nov 25.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用概率参数和支持向量机分类进行剪接位点识别。

Splice site identification using probabilistic parameters and SVM classification.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献