Zamdborg Leonid, Ma Ping
Department of Statistics, University of Illinois at Urbana-Champaign, Center for Biophysics and Computational Biology, Institute for Genomic Biology, IL, USA.
Nucleic Acids Res. 2009 Sep;37(16):5246-54. doi: 10.1093/nar/gkp554. Epub 2009 Jul 3.
Discovering which regulatory proteins, especially transcription factors (TFs), are active under certain experimental conditions and identifying the corresponding binding motifs is essential for understanding the regulatory circuits that control cellular programs. The experimental methods used for this purpose are laborious. Computational methods have been proven extremely effective in identifying TF-binding motifs (TFBMs). In this article, we propose a novel computational method called MotifExpress for discovering active TFBMs. Unlike existing methods, which either use only DNA sequence information or integrate sequence information with a single-sample measurement of gene expression, MotifExpress integrates DNA sequence information with gene expression measured in multiple samples. By selecting TFBMs that are significantly associated with gene expression, we can identify active TFBMs under specific experimental conditions and thus provide clues for the construction of regulatory networks. Compared with existing methods, MotifExpress substantially reduces the number of spurious results. Statistically, MotifExpress uses a penalized multivariate regression approach with a composite absolute penalty, which is highly stable and can effectively find the globally optimal set of active motifs. We demonstrate the excellent performance of MotifExpress by applying it to synthetic data and real examples of Saccharomyces cerevisiae. MotifExpress is available at http://www.stat.illinois.edu/~pingma/MotifExpress.htm.
确定哪些调控蛋白,尤其是转录因子(TFs)在特定实验条件下处于活跃状态,并识别相应的结合基序,对于理解控制细胞程序的调控回路至关重要。为此目的所使用的实验方法很繁琐。计算方法已被证明在识别TF结合基序(TFBMs)方面极其有效。在本文中,我们提出了一种名为MotifExpress的新型计算方法,用于发现活跃的TFBMs。与现有方法不同,现有方法要么仅使用DNA序列信息,要么将序列信息与基因表达的单样本测量相结合,而MotifExpress将DNA序列信息与在多个样本中测量的基因表达相结合。通过选择与基因表达显著相关的TFBMs,我们可以识别特定实验条件下的活跃TFBMs,从而为调控网络的构建提供线索。与现有方法相比,MotifExpress大大减少了虚假结果的数量。从统计学角度来看,MotifExpress使用带有复合绝对惩罚的惩罚多元回归方法,该方法高度稳定,能够有效地找到全局最优的活跃基序集。我们通过将MotifExpress应用于合成数据和酿酒酵母的实际例子来证明其卓越性能。MotifExpress可在http://www.stat.illinois.edu/~pingma/MotifExpress.htm获取。