GrassRoots Biotechnology, Durham, North Carolina, United States of America.
PLoS One. 2012;7(7):e40373. doi: 10.1371/journal.pone.0040373. Epub 2012 Jul 5.
Transcription factors and the short, often degenerate DNA sequences they recognize are central regulators of gene expression, but their regulatory code is challenging to dissect experimentally. Thus, computational approaches have long been used to identify putative regulatory elements from the patterns in promoter sequences. Here we present a new algorithm "POWRS" (POsition-sensitive WoRd Set) for identifying regulatory sequence motifs, specifically developed to address two common shortcomings of existing algorithms. First, POWRS uses the position-specific enrichment of regulatory elements near transcription start sites to significantly increase sensitivity, while providing new information about the preferred localization of those elements. Second, POWRS forgoes position weight matrices for a discrete motif representation that appears more resistant to over-generalization. We apply this algorithm to discover sequences related to constitutive, high-level gene expression in the model plant Arabidopsis thaliana, and then experimentally validate the importance of those elements by systematically mutating two endogenous promoters and measuring the effect on gene expression levels. This provides a foundation for future efforts to rationally engineer gene expression in plants, a problem of great importance in developing biotech crop varieties.
BSD-licensed Python code at http://grassrootsbio.com/papers/powrs/.
转录因子及其识别的短而通常退化的 DNA 序列是基因表达的核心调节剂,但它们的调节密码很难通过实验来解析。因此,计算方法长期以来一直被用于从启动子序列的模式中识别可能的调节元件。在这里,我们提出了一种新的算法“POWRS”(位置敏感单词集),用于识别调节序列基序,专门针对现有算法的两个常见缺点进行了开发。首先,POWRS 使用转录起始位点附近调节元件的位置特异性富集来显著提高灵敏度,同时提供有关这些元件的首选定位的新信息。其次,POWRS 放弃了位置权重矩阵,转而采用离散基序表示形式,这种表示形式似乎更能抵抗过度泛化。我们将此算法应用于发现模式植物拟南芥中组成性、高水平基因表达的相关序列,然后通过系统突变两个内源性启动子并测量对基因表达水平的影响来实验验证这些元件的重要性。这为未来在植物中合理设计基因表达的努力提供了基础,这是开发生物技术作物品种的一个重要问题。
可在 http://grassrootsbio.com/papers/powrs/ 获得 BSD 许可的 Python 代码。