School of Computer Science, McGill University, Montreal, Quebec, Canada.
PLoS Comput Biol. 2010 Jul 8;6(7):e1000849. doi: 10.1371/journal.pcbi.1000849.
Allelic imbalance (AI) is a phenomenon where the two alleles of a given gene are expressed at different levels in a given cell, either because of epigenetic inactivation of one of the two alleles, or because of genetic variation in regulatory regions. Recently, Bing et al. have described the use of genotyping arrays to assay AI at a high resolution (approximately 750,000 SNPs across the autosomes). In this paper, we investigate computational approaches to analyze this data and identify genomic regions with AI in an unbiased and robust statistical manner. We propose two families of approaches: (i) a statistical approach based on z-score computations, and (ii) a family of machine learning approaches based on Hidden Markov Models. Each method is evaluated using previously published experimental data sets as well as with permutation testing. When applied to whole genome data from 53 HapMap samples, our approaches reveal that allelic imbalance is widespread (most expressed genes show evidence of AI in at least one of our 53 samples) and that most AI regions in a given individual are also found in at least a few other individuals. While many AI regions identified in the genome correspond to known protein-coding transcripts, others overlap with recently discovered long non-coding RNAs. We also observe that genomic regions with AI not only include complete transcripts with consistent differential expression levels, but also more complex patterns of allelic expression such as alternative promoters and alternative 3' end. The approaches developed not only shed light on the incidence and mechanisms of allelic expression, but will also help towards mapping the genetic causes of allelic expression and identify cases where this variation may be linked to diseases.
等位基因失衡 (AI) 是一种现象,即给定基因的两个等位基因在给定细胞中以不同水平表达,要么是因为两个等位基因之一的表观遗传失活,要么是因为调节区域的遗传变异。最近,Bing 等人描述了使用基因分型阵列以高分辨率(整个常染色体上约有 750,000 个 SNP)检测 AI。在本文中,我们研究了计算方法,以无偏和稳健的统计方式分析该数据并识别具有 AI 的基因组区域。我们提出了两类方法:(i) 基于 z 分数计算的统计方法,和 (ii) 基于隐马尔可夫模型的机器学习方法家族。每种方法都使用先前发表的实验数据集以及置换检验进行了评估。当应用于来自 53 个 HapMap 样本的全基因组数据时,我们的方法表明等位基因失衡是普遍存在的(大多数表达基因在我们的 53 个样本中的至少一个中显示出 AI 的证据),并且给定个体中的大多数 AI 区域也存在于至少几个其他个体中。虽然基因组中识别出的许多 AI 区域对应于已知的蛋白质编码转录本,但其他区域与最近发现的长非编码 RNA 重叠。我们还观察到,具有 AI 的基因组区域不仅包括具有一致差异表达水平的完整转录本,还包括等位基因表达的更复杂模式,例如替代启动子和替代 3'端。开发的方法不仅揭示了等位基因表达的发生率和机制,而且还有助于映射等位基因表达的遗传原因,并确定这种变异可能与疾病相关的情况。