Xie Yang, Pan Wei, Jeong Kyeong S, Khodursky Arkady
Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA.
Stat Med. 2007 May 10;26(10):2258-75. doi: 10.1002/sim.2703.
Transcriptional control is a critical step in regulation of gene expression. Understanding such a control on a genomic level involves deciphering the mechanisms and structures of regulatory programmes and networks. A difficulty arises due to the weak signal and high noise in various sources of data while most current approaches are limited to analysis of a single source of data. A natural alternative is to improve statistical efficiency and power by a combined analysis of multiple sources of data. Here we propose a shrinkage method to combine genome-wide location data and gene expression data to detect the binding sites or target genes of a transcription factor. Specifically, a prior 'non-target' gene list is generated by analysing the expression data, and then this information is incorporated into the subsequent binding data analysis via a shrinkage method. There is a Bayesian justification for this shrinkage method. Both simulated and real data were used to evaluate the proposed method and compare it with analysing binding data alone. In simulation studies, the proposed method gives higher sensitivity and lower false discovery rate (FDR) in detecting the target genes. In real data example, the proposed method can reduce the estimated FDR and increase the power to detect the previously known target genes of a broad transcription regulator, leucine responsive regulatory protein (Lrp) in Escherichia coli. This method can also be used to incorporate other information, such as gene ontology (GO), to microarray data analysis to detect differentially expressed genes.
转录调控是基因表达调控中的关键步骤。在基因组水平上理解这种调控涉及解读调控程序和网络的机制与结构。由于各种数据来源中信号微弱且噪声高,同时大多数现有方法局限于单一数据源的分析,因此出现了困难。一种自然的替代方法是通过对多个数据源进行联合分析来提高统计效率和功效。在此,我们提出一种收缩方法,将全基因组定位数据和基因表达数据相结合,以检测转录因子的结合位点或靶基因。具体而言,通过分析表达数据生成一个先验的“非靶标”基因列表,然后通过收缩方法将此信息纳入后续的结合数据分析中。这种收缩方法有贝叶斯理论依据。使用模拟数据和真实数据来评估所提出的方法,并将其与单独分析结合数据进行比较。在模拟研究中,所提出的方法在检测靶基因时具有更高的灵敏度和更低的错误发现率(FDR)。在真实数据示例中,所提出的方法可以降低估计的FDR,并提高检测大肠杆菌中广泛转录调节因子亮氨酸响应调节蛋白(Lrp)先前已知靶基因的能力。该方法还可用于将其他信息(如基因本体论(GO))纳入微阵列数据分析,以检测差异表达基因。