Cheng Huiling, Liu Lifen, Zhou Yuying, Deng Kaixuan, Ge Yuanxin, Hu Xuehai
College of Informatics, Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, Hubei, China.
Front Plant Sci. 2023 May 9;14:1175837. doi: 10.3389/fpls.2023.1175837. eCollection 2023.
An emerging approach using promoter tiling deletion via genome editing is beginning to become popular in plants. Identifying the precise positions of core motifs within plant gene promoter is of great demand but they are still largely unknown. We previously developed TSPTFBS of 265 transcription factor binding sites (TFBSs) prediction models, which now cannot meet the above demand of identifying the core motif.
Here, we additionally introduced 104 maize and 20 rice TFBS datasets and utilized DenseNet for model construction on a large-scale dataset of a total of 389 plant TFs. More importantly, we combined three biological interpretability methods including DeepLIFT, tiling deletion, and mutagenesis to identify the potential core motifs of any given genomic region.
For the results, DenseNet not only has achieved greater predictability than baseline methods such as LS-GKM and MEME for above 389 TFs from Arabidopsis, maize and rice, but also has greater performance on trans-species prediction of a total of 15 TFs from other six plant species. A motif analysis based on TF-MoDISco and global importance analysis (GIA) further provide the biological implication of the core motif identified by three interpretability methods. Finally, we developed a pipeline of TSPTFBS 2.0, which integrates 389 DenseNet-based models of TF binding and the above three interpretability methods.
TSPTFBS 2.0 was implemented as a user-friendly web-server (http://www.hzau-hulab.com/TSPTFBS/), which can support important references for editing targets of any given plant promoters and it has great potentials to provide reliable editing target of genetic screen experiments in plants.
一种通过基因组编辑进行启动子平铺缺失的新兴方法在植物中开始流行起来。确定植物基因启动子内核心基序的精确位置需求迫切,但目前仍知之甚少。我们之前开发了包含265个转录因子结合位点(TFBSs)预测模型的TSPTFBS,现在该模型已无法满足上述识别核心基序的需求。
在此,我们额外引入了104个玉米和20个水稻TFBS数据集,并利用DenseNet在总共389个植物转录因子的大规模数据集上构建模型。更重要的是,我们结合了三种生物学可解释性方法,包括DeepLIFT、平铺缺失和诱变,以识别任何给定基因组区域的潜在核心基序。
在结果方面,对于来自拟南芥、玉米和水稻的上述389个转录因子,DenseNet不仅比LS-GKM和MEME等基线方法具有更高的预测能力,而且在对来自其他六个植物物种的总共15个转录因子的跨物种预测中也表现出更好的性能。基于TF-MoDISco的基序分析和全局重要性分析(GIA)进一步揭示了三种可解释性方法所识别的核心基序的生物学意义。最后,我们开发了TSPTFBS 2.0流程,它整合了基于DenseNet的389个转录因子结合模型以及上述三种可解释性方法。
TSPTFBS 2.0被实现为一个用户友好的网络服务器(http://www.hzau-hulab.com/TSPTFBS/),它可以为任何给定植物启动子的编辑靶点提供重要参考,并且在为植物遗传筛选实验提供可靠编辑靶点方面具有巨大潜力。