Yang Tzu-Hsien, Yang Ya-Chiao, Tu Kai-Chi
Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan.
Comput Struct Biotechnol J. 2021 Dec 18;20:296-308. doi: 10.1016/j.csbj.2021.12.015. eCollection 2022.
Transcription regulation in metazoa is controlled by the binding events of transcription factors (TFs) or regulatory proteins on specific modular DNA regulatory sequences called -regulatory modules (CRMs). Understanding the distributions of CRMs on a genomic scale is essential for constructing the metazoan transcriptional regulatory networks that help diagnose genetic disorders. While traditional reporter-assay CRM identification approaches can provide an in-depth understanding of functions of some CRM, these methods are usually cost-inefficient and low-throughput. It is generally believed that by integrating diverse genomic data, reliable CRM predictions can be made. Hence, researchers often first resort to computational algorithms for genome-wide CRM screening before specific experiments. However, current existing methods for searching potential CRMs were restricted by low sensitivity, poor prediction accuracy, or high computation time from TFBS composition combinatorial complexity. To overcome these obstacles, we designed a novel CRM identification pipeline called regCNN by considering the base-by-base local patterns in TF binding motifs and epigenetic profiles. On the test set, regCNN shows an accuracy/auROC of 84.5%/92.5% in CRM identification. And by further considering local patterns in epigenetic profiles and TF binding motifs, it can accomplish 4.7% (92.5%-87.8%) improvement in the auROC value over the average value-based pure multi-layer perceptron model. We also demonstrated that regCNN outperforms all currently available tools by at least 11.3% in auROC values. Finally, regCNN is verified to be robust against its resizing window hyperparameter in dealing with the variable lengths of CRMs. The model of regCNN can be downloaded athttp://cobisHSS0.im.nuk.edu.tw/regCNN/.
后生动物中的转录调控由转录因子(TFs)或调节蛋白与特定的模块化DNA调控序列(称为顺式调控模块,CRMs)的结合事件所控制。了解CRMs在基因组规模上的分布对于构建有助于诊断遗传疾病的后生动物转录调控网络至关重要。虽然传统的报告基因检测CRM识别方法可以深入了解某些CRM的功能,但这些方法通常成本高昂且通量较低。人们普遍认为,通过整合各种基因组数据,可以做出可靠的CRM预测。因此,研究人员通常在进行特定实验之前首先求助于计算算法进行全基因组CRM筛选。然而,目前现有的搜索潜在CRM的方法受到低灵敏度、预测准确性差或TFBS组成组合复杂性导致的计算时间长的限制。为了克服这些障碍,我们通过考虑TF结合基序和表观遗传谱中的逐碱基局部模式,设计了一种名为regCNN的新型CRM识别管道。在测试集上,regCNN在CRM识别中的准确率/auROC为84.5%/92.5%。通过进一步考虑表观遗传谱和TF结合基序中的局部模式,它在auROC值上比基于平均值的纯多层感知器模型的平均值提高了4.7%(92.5%-87.8%)。我们还证明,regCNN在auROC值上比所有当前可用工具至少高出11.3%。最后,验证了regCNN在处理CRM可变长度时对其调整窗口超参数具有鲁棒性。regCNN模型可从http://cobisHSS0.im.nuk.edu.tw/regCNN/下载。