Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 Plus Program), Korea Advanced Institute of Science and Technology, 34141 Daejeon, Republic of Korea.
Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Korea Advanced Institute of Science and Technology, 34141 Daejeon, Republic of Korea.
Proc Natl Acad Sci U S A. 2021 Jan 12;118(2). doi: 10.1073/pnas.2021171118.
A transcription factor (TF) is a sequence-specific DNA-binding protein that modulates the transcription of a set of particular genes, and thus regulates gene expression in the cell. TFs have commonly been predicted by analyzing sequence homology with the DNA-binding domains of TFs already characterized. Thus, TFs that do not show homologies with the reported ones are difficult to predict. Here we report the development of a deep learning-based tool, DeepTFactor, that predicts whether a protein in question is a TF. DeepTFactor uses a convolutional neural network to extract features of a protein. It showed high performance in predicting TFs of both eukaryotic and prokaryotic origins, resulting in 1 scores of 0.8154 and 0.8000, respectively. Analysis of the gradients of prediction score with respect to input suggested that DeepTFactor detects DNA-binding domains and other latent features for TF prediction. DeepTFactor predicted 332 candidate TFs in K-12 MG1655. Among them, 84 candidate TFs belong to the y-ome, which is a collection of genes that lack experimental evidence of function. We experimentally validated the results of DeepTFactor prediction by further characterizing genome-wide binding sites of three predicted TFs, YqhC, YiaU, and YahB. Furthermore, we made available the list of 4,674,808 TFs predicted from 73,873,012 protein sequences in 48,346 genomes. DeepTFactor will serve as a useful tool for predicting TFs, which is necessary for understanding the regulatory systems of organisms of interest. We provide DeepTFactor as a stand-alone program, available at https://bitbucket.org/kaistsystemsbiology/deeptfactor.
转录因子(TF)是一种序列特异性 DNA 结合蛋白,可调节一组特定基因的转录,从而调节细胞中的基因表达。通常通过分析与已鉴定的 TF 的 DNA 结合域的序列同源性来预测 TF。因此,与已报道的 TF 没有同源性的 TF 很难预测。在这里,我们报告了一种基于深度学习的工具 DeepTFactor 的开发,该工具可预测有疑问的蛋白质是否为 TF。DeepTFactor 使用卷积神经网络提取蛋白质的特征。它在预测真核生物和原核生物起源的 TF 方面表现出很高的性能,分别得到了 0.8154 和 0.8000 的 1 分数。对预测得分相对于输入的梯度的分析表明,DeepTFactor 用于 TF 预测的检测 DNA 结合域和其他潜在特征。DeepTFactor 在 K-12 MG1655 中预测了 332 个候选 TF。其中,84 个候选 TF 属于 y-ome,这是一组缺乏功能实验证据的基因。我们通过进一步表征三个预测的 TF(YqhC、YiaU 和 YahB)的全基因组结合位点,实验验证了 DeepTFactor 预测的结果。此外,我们提供了从 48,346 个基因组中的 73,873,012 个蛋白质序列中预测的 4,674,808 个 TF 的列表。DeepTFactor 将成为预测 TF 的有用工具,这对于理解感兴趣的生物体的调控系统是必要的。我们提供了一个独立的程序 DeepTFactor,可以在 https://bitbucket.org/kaistsystemsbiology/deeptfactor 上获得。