Brief Funct Genomics. 2022 Sep 16;21(5):357-375. doi: 10.1093/bfgp/elac009.
Transcription factors are important cellular components of the process of gene expression control. Transcription factor binding sites are locations where transcription factors specifically recognize DNA sequences, targeting gene-specific regions and recruiting transcription factors or chromatin regulators to fine-tune spatiotemporal gene regulation. As the common proteins, transcription factors play a meaningful role in life-related activities. In the face of the increase in the protein sequence, it is urgent how to predict the structure and function of the protein effectively. At present, protein-DNA-binding site prediction methods are based on traditional machine learning algorithms and deep learning algorithms. In the early stage, we usually used the development method based on traditional machine learning algorithm to predict protein-DNA-binding sites. In recent years, methods based on deep learning to predict protein-DNA-binding sites from sequence data have achieved remarkable success. Various statistical and machine learning methods used to predict the function of DNA-binding proteins have been proposed and continuously improved. Existing deep learning methods for predicting protein-DNA-binding sites can be roughly divided into three categories: convolutional neural network (CNN), recursive neural network (RNN) and hybrid neural network based on CNN-RNN. The purpose of this review is to provide an overview of the computational and experimental methods applied in the field of protein-DNA-binding site prediction today. This paper introduces the methods of traditional machine learning and deep learning in protein-DNA-binding site prediction from the aspects of data processing characteristics of existing learning frameworks and differences between basic learning model frameworks. Our existing methods are relatively simple compared with natural language processing, computational vision, computer graphics and other fields. Therefore, the summary of existing protein-DNA-binding site prediction methods will help researchers better understand this field.
转录因子是基因表达调控过程中的重要细胞成分。转录因子结合位点是转录因子特异性识别 DNA 序列的位置,靶向基因特异性区域,并招募转录因子或染色质调节剂来微调时空基因调控。作为常见的蛋白质,转录因子在与生命相关的活动中发挥着有意义的作用。面对蛋白质序列的增加,如何有效地预测蛋白质的结构和功能迫在眉睫。目前,蛋白质-DNA 结合位点预测方法基于传统机器学习算法和深度学习算法。在早期,我们通常使用基于传统机器学习算法的开发方法来预测蛋白质-DNA 结合位点。近年来,基于深度学习从序列数据预测蛋白质-DNA 结合位点的方法取得了显著的成功。各种用于预测 DNA 结合蛋白功能的统计和机器学习方法已经被提出并不断改进。现有的用于预测蛋白质-DNA 结合位点的深度学习方法大致可以分为三类:卷积神经网络 (CNN)、递归神经网络 (RNN) 和基于 CNN-RNN 的混合神经网络。本综述的目的是提供一个关于当今蛋白质-DNA 结合位点预测领域应用的计算和实验方法的概述。本文从现有学习框架的数据处理特点和基本学习模型框架的差异两个方面介绍了传统机器学习和深度学习在蛋白质-DNA 结合位点预测中的方法。与自然语言处理、计算视觉、计算机图形学等领域相比,我们现有的方法相对简单。因此,对现有蛋白质-DNA 结合位点预测方法的总结将有助于研究人员更好地理解这一领域。