Zrimec Jan, Buric Filip, Kokina Mariia, Garcia Victor, Zelezniak Aleksej
Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden.
Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark.
Front Mol Biosci. 2021 Jun 10;8:673363. doi: 10.3389/fmolb.2021.673363. eCollection 2021.
Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the -regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.
数据驱动的机器学习是从核苷酸序列预测分子表型、对包括蛋白质-DNA结合、染色质状态以及mRNA和蛋白质水平在内的基因表达事件进行建模的首选方法。深度神经网络能自动学习信息丰富的序列表示,对其进行解读有助于我们加深对调控基因表达的规则的理解。在这里,我们回顾了将浅层或深度学习应用于量化分子表型并从原核和真核生物测序数据中解码调控语法的最新进展。我们的方法是从头开始构建,首先关注起始的蛋白质-DNA相互作用,然后是特定的编码和非编码区域,最后是结合基因和mRNA调控结构多个部分的进展,从而实现了前所未有的性能。因此,我们从核苷酸序列提供了基因表达调控的定量观点,并以分子生物学中心法则的以信息为中心的概述作为总结。