College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA.
Bioinformatics. 2013 Jul 1;29(13):i316-25. doi: 10.1093/bioinformatics/btt218.
Polyadenylation is the addition of a poly(A) tail to an RNA molecule. Identifying DNA sequence motifs that signal the addition of poly(A) tails is essential to improved genome annotation and better understanding of the regulatory mechanisms and stability of mRNA. Existing poly(A) motif predictors demonstrate that information extracted from the surrounding nucleotide sequences of candidate poly(A) motifs can differentiate true motifs from the false ones to a great extent. A variety of sophisticated features has been explored, including sequential, structural, statistical, thermodynamic and evolutionary properties. However, most of these methods involve extensive manual feature engineering, which can be time-consuming and can require in-depth domain knowledge.
We propose a novel machine-learning method for poly(A) motif prediction by marrying generative learning (hidden Markov models) and discriminative learning (support vector machines). Generative learning provides a rich palette on which the uncertainty and diversity of sequence information can be handled, while discriminative learning allows the performance of the classification task to be directly optimized. Here, we used hidden Markov models for fitting the DNA sequence dynamics, and developed an efficient spectral algorithm for extracting latent variable information from these models. These spectral latent features were then fed into support vector machines to fine-tune the classification performance. We evaluated our proposed method on a comprehensive human poly(A) dataset that consists of 14 740 samples from 12 of the most abundant variants of human poly(A) motifs. Compared with one of the previous state-of-the-art methods in the literature (the random forest model with expert-crafted features), our method reduces the average error rate, false-negative rate and false-positive rate by 26, 15 and 35%, respectively. Meanwhile, our method makes ~30% fewer error predictions relative to the other string kernels. Furthermore, our method can be used to visualize the importance of oligomers and positions in predicting poly(A) motifs, from which we can observe a number of characteristics in the surrounding regions of true and false motifs that have not been reported before.
http://sfb.kaust.edu.sa/Pages/Software.aspx.
Supplementary data are available at Bioinformatics online.
聚腺苷酸化是在 RNA 分子上添加聚(A)尾巴的过程。鉴定指示聚(A)尾巴添加的 DNA 序列基序对于改进基因组注释以及更好地理解 mRNA 的调控机制和稳定性至关重要。现有的聚(A)基序预测器表明,从候选聚(A)基序的周围核苷酸序列中提取的信息在很大程度上可以将真实基序与虚假基序区分开来。已经探索了各种复杂的特征,包括序列、结构、统计、热力学和进化特性。然而,这些方法中的大多数都涉及广泛的手动特征工程,这可能很耗时并且需要深入的领域知识。
我们提出了一种通过结合生成学习(隐马尔可夫模型)和判别学习(支持向量机)来预测聚(A)基序的新机器学习方法。生成学习为处理序列信息的不确定性和多样性提供了丰富的调色板,而判别学习允许直接优化分类任务的性能。在这里,我们使用隐马尔可夫模型来拟合 DNA 序列动力学,并开发了一种从这些模型中提取潜在变量信息的高效谱算法。然后,将这些谱潜在特征输入支持向量机中,以微调分类性能。我们在一个综合的人类聚(A)数据集上评估了我们提出的方法,该数据集由来自 12 种最丰富的人类聚(A)基序变体的 14740 个样本组成。与文献中以前的一种最先进的方法(具有专家设计特征的随机森林模型)相比,我们的方法将平均错误率、假阴性率和假阳性率分别降低了 26%、15%和 35%。同时,与其他字符串核相比,我们的方法相对减少了约 30%的错误预测。此外,我们的方法可用于可视化寡核苷酸和位置在预测聚(A)基序中的重要性,从中我们可以观察到一些在真实和虚假基序的周围区域中以前未报道过的特征。
http://sfb.kaust.edu.sa/Pages/Software.aspx。
补充数据可在 Bioinformatics 在线获得。