Computational Biology Unit, Department of Informatics, University of Bergen, Bergen, Norway.
Department of Biology, Humboldt-Universität zu Berlin, Berlin, Germany.
PLoS One. 2022 Sep 9;17(9):e0274338. doi: 10.1371/journal.pone.0274338. eCollection 2022.
Gene expression is regulated through cis-regulatory elements (CREs), among which are promoters, enhancers, Polycomb/Trithorax Response Elements (PREs), silencers and insulators. Computational prediction of CREs can be achieved using a variety of statistical and machine learning methods combined with different feature space formulations. Although Python packages for DNA sequence feature sets and for machine learning are available, no existing package facilitates the combination of DNA sequence feature sets with machine learning methods for the genome-wide prediction of candidate CREs. We here present Gnocis, a Python package that streamlines the analysis and the modelling of CRE sequences by providing extensible APIs and implementing the glue required for combining feature sets and models for genome-wide prediction. Gnocis implements a variety of base feature sets, including motif pair occurrence frequencies and the k-spectrum mismatch kernel. It integrates with Scikit-learn and TensorFlow for state-of-the-art machine learning. Gnocis additionally implements a broad suite of tools for the handling and preparation of sequence, region and curve data, which can be useful for general DNA bioinformatics in Python. We also present Deep-MOCCA, a neural network architecture inspired by SVM-MOCCA that achieves moderate to high generalization without prior motif knowledge. To demonstrate the use of Gnocis, we applied multiple machine learning methods to the modelling of D. melanogaster PREs, including a Convolutional Neural Network (CNN), making this the first study to model PREs with CNNs. The models are readily adapted to new CRE modelling problems and to other organisms. In order to produce a high-performance, compiled package for Python 3, we implemented Gnocis in Cython. Gnocis can be installed using the PyPI package manager by running 'pip install gnocis'. The source code is available on GitHub, at https://github.com/bjornbredesen/gnocis.
基因表达是通过顺式调控元件(CREs)进行调节的,其中包括启动子、增强子、多梳/三价响应元件(PREs)、沉默子和绝缘子。可以使用各种统计和机器学习方法结合不同的特征空间公式来预测 CREs。虽然有用于 DNA 序列特征集和机器学习的 Python 包,但没有现有的包可以方便地将 DNA 序列特征集与机器学习方法结合起来,以进行全基因组候选 CRE 预测。我们在这里介绍 Gnocis,这是一个 Python 包,通过提供可扩展的 API 和实现组合特征集和模型以进行全基因组预测所需的“胶水”,简化了 CRE 序列的分析和建模。Gnocis 实现了各种基本特征集,包括基序对出现频率和 k-谱失配核。它与 Scikit-learn 和 TensorFlow 集成,实现了最先进的机器学习。Gnocis 还实现了一套广泛的用于处理和准备序列、区域和曲线数据的工具,这些工具对于 Python 中的一般 DNA 生物信息学可能很有用。我们还介绍了 Deep-MOCCA,这是一种受 SVM-MOCCA 启发的神经网络架构,它在没有先验基序知识的情况下实现了中等至高的泛化能力。为了演示 Gnocis 的使用,我们将多种机器学习方法应用于 D. melanogaster PREs 的建模,包括卷积神经网络(CNN),这是首次使用 CNN 对 PREs 进行建模的研究。这些模型可以很容易地适应新的 CRE 建模问题和其他生物体。为了为 Python 3 生成高性能的编译包,我们使用 Cython 实现了 Gnocis。可以通过运行 'pip install gnocis' 使用 PyPI 包管理器安装 Gnocis。源代码可在 GitHub 上获得,网址为 https://github.com/bjornbredesen/gnocis。