基于人类 DNA 序列的谱潜在特征进行 Poly(A) 基序预测。

Poly(A) motif prediction using spectral latent features from human DNA sequences.

机构信息

College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA.

出版信息

Bioinformatics. 2013 Jul 1;29(13):i316-25. doi: 10.1093/bioinformatics/btt218.

DOI:10.1093/bioinformatics/btt218

PMID:23813000

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3694652/

Abstract

MOTIVATION

Polyadenylation is the addition of a poly(A) tail to an RNA molecule. Identifying DNA sequence motifs that signal the addition of poly(A) tails is essential to improved genome annotation and better understanding of the regulatory mechanisms and stability of mRNA. Existing poly(A) motif predictors demonstrate that information extracted from the surrounding nucleotide sequences of candidate poly(A) motifs can differentiate true motifs from the false ones to a great extent. A variety of sophisticated features has been explored, including sequential, structural, statistical, thermodynamic and evolutionary properties. However, most of these methods involve extensive manual feature engineering, which can be time-consuming and can require in-depth domain knowledge.

RESULTS

We propose a novel machine-learning method for poly(A) motif prediction by marrying generative learning (hidden Markov models) and discriminative learning (support vector machines). Generative learning provides a rich palette on which the uncertainty and diversity of sequence information can be handled, while discriminative learning allows the performance of the classification task to be directly optimized. Here, we used hidden Markov models for fitting the DNA sequence dynamics, and developed an efficient spectral algorithm for extracting latent variable information from these models. These spectral latent features were then fed into support vector machines to fine-tune the classification performance. We evaluated our proposed method on a comprehensive human poly(A) dataset that consists of 14 740 samples from 12 of the most abundant variants of human poly(A) motifs. Compared with one of the previous state-of-the-art methods in the literature (the random forest model with expert-crafted features), our method reduces the average error rate, false-negative rate and false-positive rate by 26, 15 and 35%, respectively. Meanwhile, our method makes ~30% fewer error predictions relative to the other string kernels. Furthermore, our method can be used to visualize the importance of oligomers and positions in predicting poly(A) motifs, from which we can observe a number of characteristics in the surrounding regions of true and false motifs that have not been reported before.

AVAILABILITY

http://sfb.kaust.edu.sa/Pages/Software.aspx.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

聚腺苷酸化是在 RNA 分子上添加聚（A）尾巴的过程。鉴定指示聚（A）尾巴添加的 DNA 序列基序对于改进基因组注释以及更好地理解 mRNA 的调控机制和稳定性至关重要。现有的聚（A）基序预测器表明，从候选聚（A）基序的周围核苷酸序列中提取的信息在很大程度上可以将真实基序与虚假基序区分开来。已经探索了各种复杂的特征，包括序列、结构、统计、热力学和进化特性。然而，这些方法中的大多数都涉及广泛的手动特征工程，这可能很耗时并且需要深入的领域知识。

结果

我们提出了一种通过结合生成学习（隐马尔可夫模型）和判别学习（支持向量机）来预测聚（A）基序的新机器学习方法。生成学习为处理序列信息的不确定性和多样性提供了丰富的调色板，而判别学习允许直接优化分类任务的性能。在这里，我们使用隐马尔可夫模型来拟合 DNA 序列动力学，并开发了一种从这些模型中提取潜在变量信息的高效谱算法。然后，将这些谱潜在特征输入支持向量机中，以微调分类性能。我们在一个综合的人类聚（A）数据集上评估了我们提出的方法，该数据集由来自 12 种最丰富的人类聚（A）基序变体的 14740 个样本组成。与文献中以前的一种最先进的方法（具有专家设计特征的随机森林模型）相比，我们的方法将平均错误率、假阴性率和假阳性率分别降低了 26%、15%和 35%。同时，与其他字符串核相比，我们的方法相对减少了约 30%的错误预测。此外，我们的方法可用于可视化寡核苷酸和位置在预测聚（A）基序中的重要性，从中我们可以观察到一些在真实和虚假基序的周围区域中以前未报道过的特征。

可用性

http://sfb.kaust.edu.sa/Pages/Software.aspx。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/781c/3694652/790467d8237d/btt218f1p.jpg

相似文献

Poly(A) motif prediction using spectral latent features from human DNA sequences.

Bioinformatics. 2013 Jul 1;29(13):i316-25. doi: 10.1093/bioinformatics/btt218.

An improved poly(A) motifs recognition method based on decision level fusion.

Comput Biol Chem. 2015 Feb;54:49-56. doi: 10.1016/j.compbiolchem.2014.12.001. Epub 2014 Dec 30.

Dragon PolyA Spotter: predictor of poly(A) motifs within human genomic DNA sequences.

Bioinformatics. 2012 Jan 1;28(1):127-9. doi: 10.1093/bioinformatics/btr602. Epub 2011 Nov 15.

An in-silico method for prediction of polyadenylation signals in human sequences.

Genome Inform. 2003;14:84-93.

A novel genome-wide polyadenylation sites recognition system based on condition random field.

Annu Int Conf IEEE Eng Med Biol Soc. 2014;2014:4755-8. doi: 10.1109/EMBC.2014.6944687.

Discovering cis-regulatory RNAs in Shewanella genomes by Support Vector Machines.

PLoS Comput Biol. 2009 Apr;5(4):e1000338. doi: 10.1371/journal.pcbi.1000338. Epub 2009 Apr 3.

Metamotifs--a generative model for building families of nucleotide position weight matrices.

BMC Bioinformatics. 2010 Jun 25;11:348. doi: 10.1186/1471-2105-11-348.

Probabilistic models for semisupervised discriminative motif discovery in DNA sequences.

IEEE/ACM Trans Comput Biol Bioinform. 2011 Sep-Oct;8(5):1309-17. doi: 10.1109/TCBB.2010.84.

Prediction of protein binding sites in protein structures using hidden Markov support vector machine.

BMC Bioinformatics. 2009 Nov 20;10:381. doi: 10.1186/1471-2105-10-381.

DISCOVER: a feature-based discriminative method for motif search in complex genomes.

Bioinformatics. 2009 Jun 15;25(12):i321-9. doi: 10.1093/bioinformatics/btp230.

引用本文的文献

The Advances in Deep Learning Modeling of Polyadenylation Codes.

Wiley Interdiscip Rev RNA. 2025 May-Jun;16(3):e70017. doi: 10.1002/wrna.70017.

Accurate transcriptome-wide identification and quantification of alternative polyadenylation from RNA-seq data with APAIQ.

Genome Res. 2023 Apr;33(4):644-657. doi: 10.1101/gr.277177.122. Epub 2023 Apr 28.

Known sequence features explain half of all human gene ends.

NAR Genom Bioinform. 2023 Apr 5;5(2):lqad031. doi: 10.1093/nargab/lqad031. eCollection 2023 Jun.

Motif and conserved module analysis in DNA (promoters, enhancers) and RNA (lncRNA, mRNA) using AlModules.

Sci Rep. 2022 Oct 20;12(1):17588. doi: 10.1038/s41598-022-21732-0.

A Survey on Methods for Predicting Polyadenylation Sites from DNA Sequences, Bulk RNA-seq, and Single-cell RNA-seq.

Genomics Proteomics Bioinformatics. 2023 Feb;21(1):67-83. doi: 10.1016/j.gpb.2022.09.005. Epub 2022 Sep 24.

From shallow to deep: some lessons learned from application of machine learning for recognition of functional genomic elements in human genome.

Hum Genomics. 2022 Feb 18;16(1):7. doi: 10.1186/s40246-022-00376-1.

DeeReCT-APA: Prediction of Alternative Polyadenylation Site Usage Through Deep Learning.

Genomics Proteomics Bioinformatics. 2022 Jun;20(3):483-495. doi: 10.1016/j.gpb.2020.05.004. Epub 2021 Mar 2.

Poly(A)-DG: A deep-learning-based domain generalization method to identify cross-species Poly(A) signal without prior knowledge from target species.

PLoS Comput Biol. 2020 Nov 5;16(11):e1008297. doi: 10.1371/journal.pcbi.1008297. eCollection 2020 Nov.

Splice2Deep: An ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA.

Gene X. 2020 May 13;5:100035. doi: 10.1016/j.gene.2020.100035. eCollection 2020 Dec.

C-RNNCrispr: Prediction of CRISPR/Cas9 sgRNA activity using convolutional and recurrent neural networks.

Comput Struct Biotechnol J. 2020 Feb 12;18:344-354. doi: 10.1016/j.csbj.2020.01.013. eCollection 2020.

本文引用的文献

Dragon PolyA Spotter: predictor of poly(A) motifs within human genomic DNA sequences.

Bioinformatics. 2013 Jun 1;29(11):1484. doi: 10.1093/bioinformatics/btt161. Epub 2013 Apr 15.

A complex immunodeficiency is based on U1 snRNP-mediated poly(A) site suppression.

EMBO J. 2012 Oct 17;31(20):4035-44. doi: 10.1038/emboj.2012.252. Epub 2012 Sep 11.

Ending the message: poly(A) signals then and now.

Genes Dev. 2011 Sep 1;25(17):1770-82. doi: 10.1101/gad.17268411.

Poly(ADP-ribose) regulates stress responses and microRNA activity in the cytoplasm.

Mol Cell. 2011 May 20;42(4):489-99. doi: 10.1016/j.molcel.2011.04.015.

Characterization and prediction of mRNA polyadenylation sites in human genes.

Med Biol Eng Comput. 2011 Apr;49(4):463-72. doi: 10.1007/s11517-011-0732-4. Epub 2011 Feb 1.

POLYAR, a new computer program for prediction of poly(A) sites in human sequences.

BMC Genomics. 2010 Nov 19;11:646. doi: 10.1186/1471-2164-11-646.

A classification-based prediction model of messenger RNA polyadenylation sites.

J Theor Biol. 2010 Aug 7;265(3):287-96. doi: 10.1016/j.jtbi.2010.05.015. Epub 2010 May 26.

Prediction of polyadenylation signals in human DNA sequences using nucleotide frequencies.

In Silico Biol. 2009;9(3):135-48.

POIMs: positional oligomer importance matrices--understanding support vector machine-based signal detectors.

Bioinformatics. 2008 Jul 1;24(13):i6-14. doi: 10.1093/bioinformatics/btn170.

Accurate splice site prediction using support vector machines.

BMC Bioinformatics. 2007;8 Suppl 10(Suppl 10):S7. doi: 10.1186/1471-2105-8-S10-S7.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于人类 DNA 序列的谱潜在特征进行 Poly(A) 基序预测。

Poly(A) motif prediction using spectral latent features from human DNA sequences.

机构信息

College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA.