Sonnenburg Sören, Zien Alexander, Philips Petra, Rätsch Gunnar
Fraunhofer Institute FIRST, Department IDA, Kekulèstr. 7, 12489 Berlin, Germany.
Bioinformatics. 2008 Jul 1;24(13):i6-14. doi: 10.1093/bioinformatics/btn170.
At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts.
To make SVM-based sequence classifiers more accessible and profitable, we introduce the concept of positional oligomer importance matrices (POIMs) and propose an efficient algorithm for their computation. In contrast to the raw SVM feature weighting, POIMs take the underlying correlation structure of k-mer features induced by overlaps of related k-mers into account. POIMs can be seen as a powerful generalization of sequence logos: they allow to capture and visualize sequence patterns that are relevant for the investigated biological phenomena.
All source code, datasets, tables and figures are available at http://www.fml.tuebingen.mpg.de/raetsch/projects/POIM.
Supplementary data are available at Bioinformatics online.
许多重要的生物信息学问题,如基因发现和功能预测,其核心都是生物序列的分类。通常,最准确的分类器是通过使用复杂序列核训练支持向量机(SVM)获得的。然而,支持向量机的一个麻烦缺点是,其学习到的决策规则对人类来说很难理解,并且不容易与生物学事实相关联。
为了使基于支持向量机的序列分类器更易于理解和实用,我们引入了位置寡聚物重要性矩阵(POIM)的概念,并提出了一种高效的计算算法。与原始的支持向量机特征加权不同,POIM考虑了由相关k聚体重叠引起的k聚体特征的潜在相关结构。POIM可以被视为序列标识的强大扩展:它们能够捕捉和可视化与所研究的生物学现象相关的序列模式。
所有源代码、数据集、表格和图表可在http://www.fml.tuebingen.mpg.de/raetsch/projects/POIM获取。
补充数据可在《生物信息学》在线获取。