O'Connor Informatics Consulting LLC, Round Rock, TX 78681, USA.
Department of Genome Sciences, University of Washington, Seattle, WA 98195-5065.
Bioinformatics. 2020 Jun 1;36(12):3902-3904. doi: 10.1093/bioinformatics/btaa227.
Identifying the genes regulated by a given transcription factor (TF) (its 'target genes') is a key step in developing a comprehensive understanding of gene regulation. Previously, we developed a method (CisMapper) for predicting the target genes of a TF based solely on the correlation between a histone modification at the TF's binding site and the expression of the gene across a set of tissues or cell lines. That approach is limited to organisms for which extensive histone and expression data are available, and does not explicitly incorporate the genomic distance between the TF and the gene.
We present the T-Gene algorithm, which overcomes these limitations. It can be used to predict which genes are most likely to be regulated by a TF, and which of the TF's binding sites are most likely involved in regulating particular genes. T-Gene calculates a novel score that combines distance and histone/expression correlation, and we show that this score accurately predicts when a regulatory element bound by a TF is in contact with a gene's promoter, achieving median precision above 60%. T-Gene is easy to use via its web server or as a command-line tool, and can also make accurate predictions (median precision above 40%) based on distance alone when extensive histone/expression data is not available for the organism. T-Gene provides an estimate of the statistical significance of each of its predictions.
The T-Gene web server, source code, histone/expression data and genome annotation files are provided at http://meme-suite.org.
Supplementary data are available at Bioinformatics online.
确定给定转录因子(TF)调控的基因(其“靶基因”)是全面了解基因调控的关键步骤。此前,我们开发了一种方法(CisMapper),仅基于 TF 结合位点处的组蛋白修饰与一组组织或细胞系中基因表达之间的相关性,来预测 TF 的靶基因。该方法仅限于具有广泛组蛋白和表达数据的生物,并且没有明确纳入 TF 和基因之间的基因组距离。
我们提出了 T-Gene 算法,它克服了这些限制。它可用于预测哪些基因最有可能受到 TF 的调控,以及 TF 的哪些结合位点最有可能参与调控特定基因。T-Gene 计算出一种新的分数,结合了距离和组蛋白/表达相关性,我们表明该分数可以准确预测 TF 结合的调节元件与基因启动子接触的情况,其中位数精度超过 60%。T-Gene 通过其网络服务器或命令行工具使用非常方便,并且在没有该生物的广泛组蛋白/表达数据时,仅基于距离也可以进行准确的预测(中位数精度超过 40%)。T-Gene 为其每个预测提供了统计显著性的估计。
T-Gene 网络服务器、源代码、组蛋白/表达数据和基因组注释文件可在 http://meme-suite.org 获得。
补充数据可在 Bioinformatics 在线获得。