Faculty of Science, National Centre for Biomolecular Research, Masaryk University, Brno, Czechia.
Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia.
BMC Genomics. 2022 Mar 31;23(1):248. doi: 10.1186/s12864-022-08414-x.
The recent big data revolution in Genomics, coupled with the emergence of Deep Learning as a set of powerful machine learning methods, has shifted the standard practices of machine learning for Genomics. Even though Deep Learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are becoming widespread in Genomics, developing and training such models is outside the ability of most researchers in the field.
Here we present ENNGene-Easy Neural Network model building tool for Genomics. This tool simplifies training of custom CNN or hybrid CNN-RNN models on genomic data via an easy-to-use Graphical User Interface. ENNGene allows multiple input branches, including sequence, evolutionary conservation, and secondary structure, and performs all the necessary preprocessing steps, allowing simple input such as genomic coordinates. The network architecture is selected and fully customized by the user, from the number and types of the layers to each layer's precise set-up. ENNGene then deals with all steps of training and evaluation of the model, exporting valuable metrics such as multi-class ROC and precision-recall curve plots or TensorBoard log files. To facilitate interpretation of the predicted results, we deploy Integrated Gradients, providing the user with a graphical representation of an attribution level of each input position. To showcase the usage of ENNGene, we train multiple models on the RBP24 dataset, quickly reaching the state of the art while improving the performance on more than half of the proteins by including the evolutionary conservation score and tuning the network per protein.
As the role of DL in big data analysis in the near future is indisputable, it is important to make it available for a broader range of researchers. We believe that an easy-to-use tool such as ENNGene can allow Genomics researchers without a background in Computational Sciences to harness the power of DL to gain better insights into and extract important information from the large amounts of data available in the field.
最近基因组学领域的大数据革命,加上深度学习作为一组强大的机器学习方法的出现,改变了基因组学机器学习的标准实践。尽管卷积神经网络(CNN)和循环神经网络(RNN)等深度学习方法在基因组学中越来越普及,但开发和训练这些模型超出了该领域大多数研究人员的能力。
在这里,我们介绍了用于基因组学的 ENNGene-易神经网络模型构建工具。该工具通过易于使用的图形用户界面简化了在基因组数据上训练自定义 CNN 或混合 CNN-RNN 模型的过程。ENNGene 允许多个输入分支,包括序列、进化保守性和二级结构,并执行所有必要的预处理步骤,允许输入简单,例如基因组坐标。网络架构由用户选择并完全定制,从层数和类型到每个层的精确设置。ENNGene 然后处理模型的所有训练和评估步骤,导出有价值的指标,如多类 ROC 和精度-召回曲线图或 TensorBoard 日志文件。为了便于解释预测结果,我们部署了集成梯度,为用户提供每个输入位置的归因水平的图形表示。为了展示 ENNGene 的用法,我们在 RBP24 数据集上训练了多个模型,通过包括进化保守评分和针对每个蛋白质调整网络,快速达到最新水平,同时提高了一半以上蛋白质的性能。
由于 DL 在不久的将来在大数据分析中的作用是不可争议的,因此将其提供给更广泛的研究人员非常重要。我们相信,像 ENNGene 这样易于使用的工具可以使没有计算科学背景的基因组学研究人员能够利用 DL 的力量,从该领域可用的大量数据中获得更好的洞察力并提取重要信息。