Computational Biology Research Center, Advanced Industrial Science and Technology, Tokyo, Japan.
PLoS One. 2010 Aug 27;5(8):e11881. doi: 10.1371/journal.pone.0011881.
How to identify true transcription factor binding sites on the basis of sequence motif information (e.g., motif pattern, location, combination, etc.) is an important question in bioinformatics. We present "PeakRegressor," a system that identifies binding motifs by combining DNA-sequence data and ChIP-Seq data. PeakRegressor uses L1-norm log linear regression in order to predict peak values from binding motif candidates. Our approach successfully predicts the peak values of STAT1 and RNA Polymerase II with correlation coefficients as high as 0.65 and 0.66, respectively. Using PeakRegressor, we could identify composite motifs for STAT1, as well as potential regulatory SNPs (rSNPs) involved in the regulation of transcription levels of neighboring genes. In addition, we show that among five regression methods, L1-norm log linear regression achieves the best performance with respect to binding motif identification, biological interpretability and computational efficiency.
如何根据序列基序信息(例如,基序模式、位置、组合等)识别真正的转录因子结合位点是生物信息学中的一个重要问题。我们提出了“PeakRegressor”,这是一个通过结合 DNA 序列数据和 ChIP-Seq 数据来识别结合基序的系统。PeakRegressor 使用 L1 范数对数线性回归来预测结合基序候选物的峰 值。我们的方法成功地预测了 STAT1 和 RNA 聚合酶 II 的峰 值,相关性系数分别高达 0.65 和 0.66。使用 PeakRegressor,我们可以为 STAT1 识别复合基序,以及参与邻近基因转录水平调节的潜在调节 SNP(rSNP)。此外,我们还表明,在五种回归方法中,L1 范数对数线性回归在结合基序识别、生物学可解释性和计算效率方面表现最佳。