SVM2Motif——通过模拟支持向量机预测器重建重叠DNA序列基序

SVM2Motif--Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor.

作者信息

Vidovic Marina M-C, Görnitz Nico, Müller Klaus-Robert, Rätsch Gunnar, Kloft Marius

机构信息

Machine Learning Group, Technical University of Berlin, Berlin, Germany.

Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul 136-713, Korea.

出版信息

PLoS One. 2015 Dec 21;10(12):e0144782. doi: 10.1371/journal.pone.0144782. eCollection 2015.

DOI:10.1371/journal.pone.0144782

PMID:26690911

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4686957/

Abstract

Identifying discriminative motifs underlying the functionality and evolution of organisms is a major challenge in computational biology. Machine learning approaches such as support vector machines (SVMs) achieve state-of-the-art performances in genomic discrimination tasks, but--due to its black-box character--motifs underlying its decision function are largely unknown. As a remedy, positional oligomer importance matrices (POIMs) allow us to visualize the significance of position-specific subsequences. Although being a major step towards the explanation of trained SVM models, they suffer from the fact that their size grows exponentially in the length of the motif, which renders their manual inspection feasible only for comparably small motif sizes, typically k ≤ 5. In this work, we extend the work on positional oligomer importance matrices, by presenting a new machine-learning methodology, entitled motifPOIM, to extract the truly relevant motifs--regardless of their length and complexity--underlying the predictions of a trained SVM model. Our framework thereby considers the motifs as free parameters in a probabilistic model, a task which can be phrased as a non-convex optimization problem. The exponential dependence of the POIM size on the oligomer length poses a major numerical challenge, which we address by an efficient optimization framework that allows us to find possibly overlapping motifs consisting of up to hundreds of nucleotides. We demonstrate the efficacy of our approach on a synthetic data set as well as a real-world human splice site data set.

摘要

识别生物体功能和进化背后的判别基序是计算生物学中的一项重大挑战。诸如支持向量机（SVM）之类的机器学习方法在基因组判别任务中取得了最先进的性能，但是由于其黑箱性质，其决策函数背后的基序在很大程度上是未知的。作为一种补救措施，位置寡聚物重要性矩阵（POIM）使我们能够可视化位置特异性子序列的重要性。尽管这是朝着解释训练好的支持向量机模型迈出的重要一步，但它们存在这样一个问题，即其大小会随着基序长度呈指数增长，这使得只有在基序大小相对较小（通常k≤5）时手动检查才可行。在这项工作中，我们扩展了关于位置寡聚物重要性矩阵的工作，提出了一种名为motifPOIM的新机器学习方法，以提取训练好的支持向量机模型预测背后真正相关的基序，而不管其长度和复杂性如何。我们的框架将基序视为概率模型中的自由参数，这一任务可以表述为一个非凸优化问题。POIM大小对寡聚物长度的指数依赖性带来了一个重大的数值挑战，我们通过一个高效的优化框架来解决这个问题，该框架使我们能够找到可能由数百个核苷酸组成的重叠基序。我们在一个合成数据集以及一个真实世界的人类剪接位点数据集上证明了我们方法的有效性。