Crowley E M
Epidemiology Data Center, University of Pittsburgh, Pittsburgh, PA 15261, USA.
Biopolymers. 2001 Feb;58(2):165-74. doi: 10.1002/1097-0282(200102)58:2<165::AID-BIP50>3.0.CO;2-O.
A goal of the human genome project is to determine the entire sequence of DNA (3 x 10(9) base pairs) found in chromosomes. The massive amounts of data produced by this project require interpretation. A Bayesian model is developed for locating regulatory regions in a DNA sequence. Regulatory regions are areas of DNA to which specific proteins bind and control whether or not a gene is transcribed to produce templates for protein synthesis. Each human cell contains the same DNA sequence. Thus the particular function of different cells is determined by the genes that are transcribed in that cell. A Hidden Markov chain is used to model whether a small interval of the DNA is in a regulatory region or not. This can be regarded as a changepoint problem where the changepoints are the start of a regulatory or nonregulatory region. The data consists of protein-binding elements, which are short subsequences, or "words," in the DNA sequence. Although these words can occur anywhere in the sequence, a larger number are expected in regulatory regions. Therefore, regulatory regions are detected by locating clusters of words. For a particular DNA sequence, the model automatically selects those words that best predict regions of interest. Markov chain Monte Carlo methods are used to explore the posterior distribution of the Hidden Markov chain. The model is tested by means of simulations, and applied to several DNA sequences.
人类基因组计划的一个目标是确定染色体中DNA的完整序列(3×10⁹个碱基对)。该计划产生的大量数据需要进行解读。为此开发了一种贝叶斯模型来定位DNA序列中的调控区域。调控区域是DNA上特定蛋白质结合的区域,它们控制着基因是否转录以产生蛋白质合成的模板。每个人类细胞都包含相同的DNA序列。因此,不同细胞的特定功能是由该细胞中被转录的基因决定的。使用隐马尔可夫链对DNA的一个小间隔是否处于调控区域进行建模。这可以被视为一个变点问题,其中变点就是调控区域或非调控区域的起始点。数据由蛋白质结合元件组成,这些元件是DNA序列中的短子序列,即“单词”。虽然这些“单词”可以出现在序列的任何位置,但预计在调控区域中会出现更多。因此,通过定位“单词”簇来检测调控区域。对于特定的DNA序列,该模型会自动选择那些最能预测感兴趣区域的“单词”。使用马尔可夫链蒙特卡罗方法来探索隐马尔可夫链的后验分布。该模型通过模拟进行测试,并应用于多个DNA序列。