Lu Ruipeng, Rogan Peter K
Computer Science, University of Western Ontario, London, Ontario, N6A 5B7, Canada.
Biochemistry, University of Western Ontario, London, Ontario, N6A 5C1, Canada.
F1000Res. 2018 Dec 14;7:1933. doi: 10.12688/f1000research.17363.2. eCollection 2018.
The distribution and composition of -regulatory modules composed of transcription factor (TF) binding site (TFBS) clusters in promoters substantially determine gene expression patterns and TF targets. TF knockdown experiments have revealed that TF binding profiles and gene expression levels are correlated. We use TFBS features within accessible promoter intervals to predict genes with similar tissue-wide expression patterns and TF targets using Machine Learning (ML). Bray-Curtis Similarity was used to identify genes with correlated expression patterns across 53 tissues. TF targets from knockdown experiments were also analyzed by this approach to set up the ML framework. TFBSs were selected within DNase I-accessible intervals of corresponding promoter sequences using information theory-based position weight matrices (iPWMs) for each TF. Features from information-dense clusters of TFBSs were input to ML classifiers which predict these gene targets along with their accuracy, specificity and sensitivity. Mutations in TFBSs were analyzed to examine their impact on TFBS clustering and predict changes in gene regulation. The glucocorticoid receptor gene ( ), whose regulation has been extensively studied, was selected to test this approach. and exhibited the most similar expression patterns to . A Decision Tree classifier exhibited the best performance in detecting such genes, based on Area Under the Receiver Operating Characteristic curve (ROC). TF target gene prediction was confirmed using siRNA knockdown, which was more accurate than CRISPR/CAS9 inactivation. TFBS mutation analyses revealed that accurate target gene prediction required at least 1 information-dense TFBS cluster. : ML based on TFBS information density, organization, and chromatin accessibility accurately identifies gene targets with comparable tissue-wide expression patterns. Multiple information-dense TFBS clusters in promoters appear to protect promoters from effects of deleterious binding site mutations in a single TFBS that would otherwise alter regulation of these genes.
由启动子中转录因子(TF)结合位点(TFBS)簇组成的调控模块的分布和组成,在很大程度上决定了基因表达模式和TF靶标。TF敲低实验表明,TF结合谱与基因表达水平相关。我们利用可及启动子区间内的TFBS特征,通过机器学习(ML)预测具有相似全组织表达模式的基因和TF靶标。使用布雷-柯蒂斯相似度来识别53种组织中具有相关表达模式的基因。还通过这种方法分析了敲低实验中的TF靶标,以建立ML框架。使用基于信息论的每个TF的位置权重矩阵(iPWMs),在相应启动子序列的DNase I可及区间内选择TFBS。来自TFBS信息密集簇的特征被输入到ML分类器中,该分类器预测这些基因靶标及其准确性、特异性和敏感性。分析TFBS中的突变,以检查它们对TFBS聚类的影响,并预测基因调控的变化。选择调控已被广泛研究的糖皮质激素受体基因( )来测试这种方法。 和 表现出与 最相似的表达模式。基于受试者工作特征曲线(ROC)下的面积,决策树分类器在检测此类基因方面表现最佳。使用siRNA敲低证实了TF靶基因预测,其比CRISPR/CAS9失活更准确。TFBS突变分析表明,准确的靶基因预测需要至少1个信息密集的TFBS簇。 :基于TFBS信息密度、组织和染色质可及性的ML准确识别具有可比全组织表达模式的基因靶标。启动子中的多个信息密集TFBS簇似乎可保护启动子免受单个TFBS中有害结合位点突变的影响,否则这些突变会改变这些基因的调控。