MOCCA：一个用于建模 DNA 序列基序出现组合的灵活套件。

MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics.

机构信息

Computational Biology Unit, Department of Informatics, University of Bergen, P.O. Box 7803, 5020, Bergen, Norway.

Department of Biology, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany.

出版信息

BMC Bioinformatics. 2021 May 7;22(1):234. doi: 10.1186/s12859-021-04143-2.

DOI:10.1186/s12859-021-04143-2

PMID:33962556

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8105988/

Abstract

BACKGROUND

Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We have recently introduced a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements. We used SVM-MOCCA to predict PREs in Drosophila and found that it was superior to other methods. However, we did not publish a polished implementation of SVM-MOCCA, which can be useful for other researchers, and we only tested SVM-MOCCA with IUPAC motifs and PREs.

RESULTS

We here present an expanded suite for modelling CRE sequences in terms of motif occurrence combinatorics-Motif Occurrence Combinatorics Classification Algorithms (MOCCA). MOCCA contains efficient implementations of several modelling methods, including SVM-MOCCA, and a new method, RF-MOCCA, a Random Forest-derivative of SVM-MOCCA. We used SVM-MOCCA and RF-MOCCA to model Drosophila PREs and BEs in cross-validation experiments, making this the first study to model PREs with Random Forests and the first study that applies the hierarchical MOCCA approach to the prediction of BEs. Both models significantly improve generalization to PREs and boundary elements beyond that of previous methods-including 4-spectrum and motif occurrence frequency Support Vector Machines and Random Forests-, with RF-MOCCA yielding the best results.

CONCLUSION

MOCCA is a flexible and powerful suite of tools for the motif-based modelling of CRE sequences in terms of motif composition. MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix (PWM) motifs. For ease of use, MOCCA implements generation of negative training data, and additionally a mode that requires only that the user specifies positives, motifs and a genome. MOCCA is licensed under the MIT license and is available on Github at https://github.com/bjornbredesen/MOCCA .

摘要

背景

顺式调控元件（CREs）是调节基因表达的 DNA 序列片段。其中包括启动子、增强子、边界元件（BEs）和多梳反应元件（PREs），它们都富含特定的序列基序，形成特定的出现景观。我们最近引入了一种层次化机器学习方法（SVM-MOCCA），其中支持向量机（SVM）应用于单个基序出现的水平，对局部序列组成进行建模，然后结合起来预测整个调控元件。我们使用 SVM-MOCCA 预测果蝇中的 PREs，发现它优于其他方法。然而，我们没有发布一个经过打磨的 SVM-MOCCA 实现，这对于其他研究人员可能很有用，并且我们仅使用 IUPAC 基序和 PREs 测试了 SVM-MOCCA。

结果

我们在这里提出了一个扩展的套件，用于根据基序出现组合来建模 CRE 序列——基序出现组合分类算法（MOCCA）。MOCCA 包含几种建模方法的高效实现，包括 SVM-MOCCA 和一种新方法，即 SVM-MOCCA 的随机森林衍生方法 RF-MOCCA。我们使用 SVM-MOCCA 和 RF-MOCCA 在交叉验证实验中对果蝇 PREs 和 BEs 进行建模，这是首次使用随机森林对 PREs 进行建模的研究，也是首次将层次化 MOCCA 方法应用于 BEs 预测的研究。这两种模型都显著提高了对 PREs 和边界元件的泛化能力，优于之前的方法——包括 4-谱和基序出现频率支持向量机和随机森林，其中 RF-MOCCA 产生了最好的结果。

结论

MOCCA 是一个灵活而强大的工具套件，用于根据基序组成对 CRE 序列进行基于基序的建模。MOCCA 可应用于任何已识别基序的新 CRE 建模问题。MOCCA 支持 IUPAC 和位置权重矩阵（PWM）基序。为了便于使用，MOCCA 实现了负训练数据的生成，并且还实现了一种仅要求用户指定正例、基序和基因组的模式。MOCCA 遵循 MIT 许可证，并可在 Github 上获得，网址为 https://github.com/bjornbredesen/MOCCA。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b48d/8105988/9b96eac5e2b5/12859_2021_4143_Fig1_HTML.jpg

相似文献

MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics.

BMC Bioinformatics. 2021 May 7;22(1):234. doi: 10.1186/s12859-021-04143-2.

Gnocis: An integrated system for interactive and reproducible analysis and modelling of cis-regulatory elements in Python 3.

PLoS One. 2022 Sep 9;17(9):e0274338. doi: 10.1371/journal.pone.0274338. eCollection 2022.

DNA sequence models of genome-wide Drosophila melanogaster Polycomb binding sites improve generalization to independent Polycomb Response Elements.

Nucleic Acids Res. 2019 Sep 5;47(15):7781-7797. doi: 10.1093/nar/gkz617.

pDHS-SVM: A prediction method for plant DNase I hypersensitive sites based on support vector machine.

J Theor Biol. 2017 Aug 7;426:126-133. doi: 10.1016/j.jtbi.2017.05.030. Epub 2017 May 26.

LedPred: an R/bioconductor package to predict regulatory sequences using support vector machines.

Bioinformatics. 2016 Apr 1;32(7):1091-3. doi: 10.1093/bioinformatics/btv705. Epub 2015 Dec 1.

Metamotifs--a generative model for building families of nucleotide position weight matrices.

BMC Bioinformatics. 2010 Jun 25;11:348. doi: 10.1186/1471-2105-11-348.

MD-SVM: a novel SVM-based algorithm for the motif discovery of transcription factor binding sites.

BMC Bioinformatics. 2019 May 1;20(Suppl 7):200. doi: 10.1186/s12859-019-2735-3.

Discovering cis-regulatory RNAs in Shewanella genomes by Support Vector Machines.

PLoS Comput Biol. 2009 Apr;5(4):e1000338. doi: 10.1371/journal.pcbi.1000338. Epub 2009 Apr 3.

DP-BINDER: machine learning model for prediction of DNA-binding proteins by fusing evolutionary and physicochemical information.

J Comput Aided Mol Des. 2019 Jul;33(7):645-658. doi: 10.1007/s10822-019-00207-x. Epub 2019 May 23.

Cis-motifs upstream of the transcription and translation initiation sites are effectively revealed by their positional disequilibrium in eukaryote genomes using frequency distribution curves.

BMC Bioinformatics. 2006 Nov 30;7:522. doi: 10.1186/1471-2105-7-522.

引用本文的文献

Gnocis: An integrated system for interactive and reproducible analysis and modelling of cis-regulatory elements in Python 3.

PLoS One. 2022 Sep 9;17(9):e0274338. doi: 10.1371/journal.pone.0274338. eCollection 2022.

本文引用的文献

DNA sequence models of genome-wide Drosophila melanogaster Polycomb binding sites improve generalization to independent Polycomb Response Elements.

Nucleic Acids Res. 2019 Sep 5;47(15):7781-7797. doi: 10.1093/nar/gkz617.

High-resolution TADs reveal DNA sequences underlying genome organization in flies.

Nat Commun. 2018 Jan 15;9(1):189. doi: 10.1038/s41467-017-02525-w.

Supervised learning method for predicting chromatin boundary associated insulator elements.

J Bioinform Comput Biol. 2014 Dec;12(6):1442006. doi: 10.1142/S0219720014420062.

Combinatorial interactions are required for the efficient recruitment of pho repressive complex (PhoRC) to polycomb response elements.

PLoS Genet. 2014 Jul 10;10(7):e1004495. doi: 10.1371/journal.pgen.1004495. eCollection 2014 Jul.

Ibf1 and Ibf2 are novel CP190-interacting proteins required for insulator function.

EMBO J. 2014 Mar 18;33(6):637-47. doi: 10.1002/embj.201386001. Epub 2014 Feb 6.

Principles of nucleation of H3K27 methylation during embryonic development.

Genome Res. 2014 Mar;24(3):401-10. doi: 10.1101/gr.159608.113. Epub 2013 Dec 11.

Genome-wide identification of Polycomb target genes in human embryonic stem cells.

Gene. 2013 Apr 15;518(2):425-30. doi: 10.1016/j.gene.2012.12.022. Epub 2013 Jan 9.

Genomic approaches towards finding cis-regulatory modules in animals.

Nat Rev Genet. 2012 Jun 18;13(7):469-83. doi: 10.1038/nrg3242.

Genome-wide polycomb target gene prediction in Drosophila melanogaster.

Nucleic Acids Res. 2012 Jul;40(13):5848-63. doi: 10.1093/nar/gks209. Epub 2012 Mar 13.

Chromatin domain boundary element search tool for Drosophila.

Nucleic Acids Res. 2012 May;40(10):4385-95. doi: 10.1093/nar/gks045. Epub 2012 Jan 28.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

MOCCA：一个用于建模 DNA 序列基序出现组合的灵活套件。

MOCCA: a flexible suite for modelling DNA sequence motif occurrence combinatorics.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献