位置寡聚物重要性矩阵（POIMs）：理解基于支持向量机的信号检测器

POIMs: positional oligomer importance matrices--understanding support vector machine-based signal detectors.

作者信息

Sonnenburg Sören, Zien Alexander, Philips Petra, Rätsch Gunnar

机构信息

Fraunhofer Institute FIRST, Department IDA, Kekulèstr. 7, 12489 Berlin, Germany.

出版信息

Bioinformatics. 2008 Jul 1;24(13):i6-14. doi: 10.1093/bioinformatics/btn170.

DOI:10.1093/bioinformatics/btn170

PMID:18586746

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2718648/

Abstract

MOTIVATION

At the heart of many important bioinformatics problems, such as gene finding and function prediction, is the classification of biological sequences. Frequently the most accurate classifiers are obtained by training support vector machines (SVMs) with complex sequence kernels. However, a cumbersome shortcoming of SVMs is that their learned decision rules are very hard to understand for humans and cannot easily be related to biological facts.

RESULTS

To make SVM-based sequence classifiers more accessible and profitable, we introduce the concept of positional oligomer importance matrices (POIMs) and propose an efficient algorithm for their computation. In contrast to the raw SVM feature weighting, POIMs take the underlying correlation structure of k-mer features induced by overlaps of related k-mers into account. POIMs can be seen as a powerful generalization of sequence logos: they allow to capture and visualize sequence patterns that are relevant for the investigated biological phenomena.

AVAILABILITY

All source code, datasets, tables and figures are available at http://www.fml.tuebingen.mpg.de/raetsch/projects/POIM.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

许多重要的生物信息学问题，如基因发现和功能预测，其核心都是生物序列的分类。通常，最准确的分类器是通过使用复杂序列核训练支持向量机（SVM）获得的。然而，支持向量机的一个麻烦缺点是，其学习到的决策规则对人类来说很难理解，并且不容易与生物学事实相关联。

结果

为了使基于支持向量机的序列分类器更易于理解和实用，我们引入了位置寡聚物重要性矩阵（POIM）的概念，并提出了一种高效的计算算法。与原始的支持向量机特征加权不同，POIM考虑了由相关k聚体重叠引起的k聚体特征的潜在相关结构。POIM可以被视为序列标识的强大扩展：它们能够捕捉和可视化与所研究的生物学现象相关的序列模式。

可用性

所有源代码、数据集、表格和图表可在http://www.fml.tuebingen.mpg.de/raetsch/projects/POIM获取。

补充信息

补充数据可在《生物信息学》在线获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/06d3/2718648/013c4221eec5/btn170f1.jpg

相似文献

POIMs: positional oligomer importance matrices--understanding support vector machine-based signal detectors.

Bioinformatics. 2008 Jul 1;24(13):i6-14. doi: 10.1093/bioinformatics/btn170.

SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition.

BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S2. doi: 10.1186/1471-2105-8-S4-S2.

Profile-based string kernels for remote homology detection and motif extraction.

J Bioinform Comput Biol. 2005 Jun;3(3):527-50. doi: 10.1142/s021972000500120x.

ARTS: accurate recognition of transcription starts in human.

Bioinformatics. 2006 Jul 15;22(14):e472-80. doi: 10.1093/bioinformatics/btl250.

Profile-based string kernels for remote homology detection and motif extraction.

Proc IEEE Comput Syst Bioinform Conf. 2004:152-60. doi: 10.1109/csb.2004.1332428.

ML2Motif-Reliable extraction of discriminative sequence motifs from learning machines.

PLoS One. 2017 Mar 27;12(3):e0174392. doi: 10.1371/journal.pone.0174392. eCollection 2017.

Pattern locator: a new tool for finding local sequence patterns in genomic DNA sequences.

Bioinformatics. 2006 Dec 15;22(24):3099-100. doi: 10.1093/bioinformatics/btl551. Epub 2006 Nov 8.

Remote homology detection based on oligomer distances.

Bioinformatics. 2006 Sep 15;22(18):2224-31. doi: 10.1093/bioinformatics/btl376. Epub 2006 Jul 12.

Support vector machines for separation of mixed plant-pathogen EST collections based on codon usage.

Bioinformatics. 2005 Apr 15;21(8):1383-8. doi: 10.1093/bioinformatics/bti200. Epub 2004 Dec 7.

SVM-HUSTLE--an iterative semi-supervised machine learning approach for pairwise protein remote homology detection.

Bioinformatics. 2008 Mar 15;24(6):783-90. doi: 10.1093/bioinformatics/btn028. Epub 2008 Feb 1.

引用本文的文献

Interpretable machine learning for genomics.

Hum Genet. 2022 Sep;141(9):1499-1513. doi: 10.1007/s00439-021-02387-9. Epub 2021 Oct 20.

DeepGSR: an optimized deep-learning structure for the recognition of genomic signals and regions.

Bioinformatics. 2019 Apr 1;35(7):1125-1132. doi: 10.1093/bioinformatics/bty752.

ML2Motif-Reliable extraction of discriminative sequence motifs from learning machines.

PLoS One. 2017 Mar 27;12(3):e0174392. doi: 10.1371/journal.pone.0174392. eCollection 2017.

Learning to Predict miRNA-mRNA Interactions from AGO CLIP Sequencing and CLASH Data.

PLoS Comput Biol. 2016 Jul 20;12(7):e1005026. doi: 10.1371/journal.pcbi.1005026. eCollection 2016 Jul.

SVM2Motif--Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor.

PLoS One. 2015 Dec 21;10(12):e0144782. doi: 10.1371/journal.pone.0144782. eCollection 2015.

SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps.

PLoS Comput Biol. 2015 May 27;11(5):e1004271. doi: 10.1371/journal.pcbi.1004271. eCollection 2015 May.

Modeling DNA affinity landscape through two-round support vector regression with weighted degree kernels.

BMC Syst Biol. 2014;8 Suppl 5(Suppl 5):S5. doi: 10.1186/1752-0509-8-S5-S5. Epub 2014 Dec 12.

Clinical prediction from structural brain MRI scans: a large-scale empirical study.

Neuroinformatics. 2015 Jan;13(1):31-46. doi: 10.1007/s12021-014-9238-1.

Effective automated feature construction and selection for classification of biological sequences.

PLoS One. 2014 Jul 17;9(7):e99982. doi: 10.1371/journal.pone.0099982. eCollection 2014.

Estimation of diffusion coefficients from voltammetric signals by support vector and gaussian process regression.

J Cheminform. 2014 May 28;6:30. doi: 10.1186/1758-2946-6-30. eCollection 2014.

本文引用的文献

Accurate splice site prediction using support vector machines.

BMC Bioinformatics. 2007;8 Suppl 10(Suppl 10):S7. doi: 10.1186/1471-2105-8-S10-S7.

Translation initiation site prediction on a genomic scale: beauty in simplicity.

Bioinformatics. 2007 Jul 1;23(13):i418-23. doi: 10.1093/bioinformatics/btm177.

C. elegans sequences that control trans-splicing and operon pre-mRNA processing.

RNA. 2007 Sep;13(9):1409-26. doi: 10.1261/rna.596707. Epub 2007 Jul 13.

Visualisation and interpretation of Support Vector Regression models.

Anal Chim Acta. 2007 Jul 9;595(1-2):299-309. doi: 10.1016/j.aca.2007.03.023. Epub 2007 Mar 18.

Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction.

Nucleic Acids Res. 2006;34(20):5943-50. doi: 10.1093/nar/gkl608. Epub 2006 Oct 26.

ARTS: accurate recognition of transcription starts in human.

Bioinformatics. 2006 Jul 15;22(14):e472-80. doi: 10.1093/bioinformatics/btl250.

Learning interpretable SVMs for biological sequence classification.

BMC Bioinformatics. 2006 Mar 20;7 Suppl 1(Suppl 1):S9. doi: 10.1186/1471-2105-7-S1-S9.

Classification of faces in man and machine.

Neural Comput. 2006 Jan;18(1):143-65. doi: 10.1162/089976606774841611.

RASE: recognition of alternatively spliced exons in C.elegans.

Bioinformatics. 2005 Jun;21 Suppl 1:i369-77. doi: 10.1093/bioinformatics/bti1053.

Identification of transcription factor binding sites with variable-order Bayesian networks.

Bioinformatics. 2005 Jun 1;21(11):2657-66. doi: 10.1093/bioinformatics/bti410. Epub 2005 Mar 29.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

位置寡聚物重要性矩阵（POIMs）：理解基于支持向量机的信号检测器

POIMs: positional oligomer importance matrices--understanding support vector machine-based signal detectors.

作者信息

Sonnenburg Sören, Zien Alexander, Philips Petra, Rätsch Gunnar

机构信息

Fraunhofer Institute FIRST, Department IDA, Kekulèstr. 7, 12489 Berlin, Germany.