Huang Haiyan, Kao Ming-Chih J, Zhou Xianghong, Liu Jun S, Wong Wing H
Department of Biostatistics, Harvard University, 655 Huntington Avenue, Boston, MA 02115, USA.
J Comput Biol. 2004;11(1):1-14. doi: 10.1089/106652704773416858.
High-level eukaryotic genomes present a particular challenge to the computational identification of transcription factor binding sites (TFBSs) because of their long noncoding regions and large numbers of repeat elements. This is evidenced by the noisy results generated by most current methods. In this paper, we present a p-value-based scoring scheme using probability generating functions to evaluate the statistical significance of potential TFBSs. Furthermore, we introduce the local genomic context into the model so that candidate sites are evaluated based both on their similarities to known binding sites and on their contrasts against their respective local genomic contexts. We demonstrate that our approach is advantageous in the prediction of myogenin and MEF2 binding sites in the human genome. We also apply LMM to large-scale human binding site sequences in situ and found that, compared to current popular methods, LMM analysis can reduce false positive errors by more than 50% without compromising sensitivity. This improvement will be of importance to any subsequent algorithm that aims to detect regulatory modules based on known PSSMs.
由于高级真核生物基因组存在长非编码区域和大量重复元件,其转录因子结合位点(TFBSs)的计算识别面临特殊挑战。当前大多数方法产生的结果嘈杂,这证明了这一点。在本文中,我们提出一种基于p值的评分方案,使用概率生成函数来评估潜在TFBSs的统计显著性。此外,我们将局部基因组背景引入模型,以便基于候选位点与已知结合位点的相似性以及与各自局部基因组背景的差异来对其进行评估。我们证明,我们的方法在预测人类基因组中的肌细胞生成素和MEF2结合位点方面具有优势。我们还将线性混合模型(LMM)应用于大规模人类结合位点序列原位分析,发现与当前流行方法相比,LMM分析在不影响灵敏度的情况下可将假阳性错误减少50%以上。这一改进对于任何旨在基于已知位置特异性得分矩阵(PSSMs)检测调控模块的后续算法都很重要。