Center for Computational Biology and Bioinformatics, University of Texas at Austin, Austin, TX, USA.
Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, TX, USA.
Bioinformatics. 2019 Oct 15;35(20):3944-3952. doi: 10.1093/bioinformatics/btz198.
We set out to develop an algorithm that can mine differential gene expression data to identify candidate cell type-specific DNA regulatory sequences. Differential expression is usually quantified as a continuous score-fold-change, test-statistic, P-value-comparing biological classes. Unlike existing approaches, our de novo strategy, termed SArKS, applies non-parametric kernel smoothing to uncover promoter motif sites that correlate with elevated differential expression scores. SArKS detects motif k-mers by smoothing sequence scores over sequence similarity. A second round of smoothing over spatial proximity reveals multi-motif domains (MMDs). Discovered motif sites can then be merged or extended based on adjacency within MMDs. False positive rates are estimated and controlled by permutation testing.
We applied SArKS to published gene expression data representing distinct neocortical neuron classes in Mus musculus and interneuron developmental states in Homo sapiens. When benchmarked against several existing algorithms using a cross-validation procedure, SArKS identified larger motif sets that formed the basis for regression models with higher correlative power.
https://github.com/denniscwylie/sarks.
Supplementary data are available at Bioinformatics online.
我们旨在开发一种算法,能够挖掘差异基因表达数据,以识别候选细胞类型特异性 DNA 调控序列。差异表达通常被量化为连续的分数变化、检验统计量、比较生物类别的 P 值。与现有方法不同,我们的从头开始策略,称为 SArKS,应用非参数核平滑来揭示与升高的差异表达分数相关的启动子 motif 位点。SArKS 通过在序列相似性上平滑序列分数来检测 motif k-mers。第二轮在空间接近度上的平滑揭示了多 motif 域(MMD)。然后可以根据 MMD 内的邻接关系合并或扩展发现的 motif 位点。通过置换检验估计和控制假阳性率。
我们将 SArKS 应用于已发表的基因表达数据,这些数据代表了 Mus musculus 中不同的新皮层神经元类和 Homo sapiens 中的中间神经元发育状态。当使用交叉验证程序与几种现有算法进行基准测试时,SArKS 确定了更大的 motif 集,这些 motif 集构成了具有更高相关能力的回归模型的基础。
https://github.com/denniscwylie/sarks。
补充数据可在 Bioinformatics 在线获取。