Computational Biology Unit, Department of Informatics, University of Bergen, P.O. Box 7803, 5020, Bergen, Norway.
Department of Biology, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany.
BMC Bioinformatics. 2021 May 7;22(1):234. doi: 10.1186/s12859-021-04143-2.
Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We have recently introduced a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements. We used SVM-MOCCA to predict PREs in Drosophila and found that it was superior to other methods. However, we did not publish a polished implementation of SVM-MOCCA, which can be useful for other researchers, and we only tested SVM-MOCCA with IUPAC motifs and PREs.
We here present an expanded suite for modelling CRE sequences in terms of motif occurrence combinatorics-Motif Occurrence Combinatorics Classification Algorithms (MOCCA). MOCCA contains efficient implementations of several modelling methods, including SVM-MOCCA, and a new method, RF-MOCCA, a Random Forest-derivative of SVM-MOCCA. We used SVM-MOCCA and RF-MOCCA to model Drosophila PREs and BEs in cross-validation experiments, making this the first study to model PREs with Random Forests and the first study that applies the hierarchical MOCCA approach to the prediction of BEs. Both models significantly improve generalization to PREs and boundary elements beyond that of previous methods-including 4-spectrum and motif occurrence frequency Support Vector Machines and Random Forests-, with RF-MOCCA yielding the best results.
MOCCA is a flexible and powerful suite of tools for the motif-based modelling of CRE sequences in terms of motif composition. MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix (PWM) motifs. For ease of use, MOCCA implements generation of negative training data, and additionally a mode that requires only that the user specifies positives, motifs and a genome. MOCCA is licensed under the MIT license and is available on Github at https://github.com/bjornbredesen/MOCCA .
顺式调控元件(CREs)是调节基因表达的 DNA 序列片段。其中包括启动子、增强子、边界元件(BEs)和多梳反应元件(PREs),它们都富含特定的序列基序,形成特定的出现景观。我们最近引入了一种层次化机器学习方法(SVM-MOCCA),其中支持向量机(SVM)应用于单个基序出现的水平,对局部序列组成进行建模,然后结合起来预测整个调控元件。我们使用 SVM-MOCCA 预测果蝇中的 PREs,发现它优于其他方法。然而,我们没有发布一个经过打磨的 SVM-MOCCA 实现,这对于其他研究人员可能很有用,并且我们仅使用 IUPAC 基序和 PREs 测试了 SVM-MOCCA。
我们在这里提出了一个扩展的套件,用于根据基序出现组合来建模 CRE 序列——基序出现组合分类算法(MOCCA)。MOCCA 包含几种建模方法的高效实现,包括 SVM-MOCCA 和一种新方法,即 SVM-MOCCA 的随机森林衍生方法 RF-MOCCA。我们使用 SVM-MOCCA 和 RF-MOCCA 在交叉验证实验中对果蝇 PREs 和 BEs 进行建模,这是首次使用随机森林对 PREs 进行建模的研究,也是首次将层次化 MOCCA 方法应用于 BEs 预测的研究。这两种模型都显著提高了对 PREs 和边界元件的泛化能力,优于之前的方法——包括 4-谱和基序出现频率支持向量机和随机森林,其中 RF-MOCCA 产生了最好的结果。
MOCCA 是一个灵活而强大的工具套件,用于根据基序组成对 CRE 序列进行基于基序的建模。MOCCA 可应用于任何已识别基序的新 CRE 建模问题。MOCCA 支持 IUPAC 和位置权重矩阵(PWM)基序。为了便于使用,MOCCA 实现了负训练数据的生成,并且还实现了一种仅要求用户指定正例、基序和基因组的模式。MOCCA 遵循 MIT 许可证,并可在 Github 上获得,网址为 https://github.com/bjornbredesen/MOCCA。