Department of Electrical Engineering, University of Washington, Seattle, Washington, USA.
PLoS Comput Biol. 2010 Jul 8;6(7):e1000834. doi: 10.1371/journal.pcbi.1000834.
DNA in eukaryotes is packaged into a chromatin complex, the most basic element of which is the nucleosome. The precise positioning of the nucleosome cores allows for selective access to the DNA, and the mechanisms that control this positioning are important pieces of the gene expression puzzle. We describe a large-scale nucleosome pattern that jointly characterizes the nucleosome core and the adjacent linkers and is predominantly characterized by long-range oscillations in the mono, di- and tri-nucleotide content of the DNA sequence, and we show that this pattern can be used to predict nucleosome positions in both Homo sapiens and Saccharomyces cerevisiae more accurately than previously published methods. Surprisingly, in both H. sapiens and S. cerevisiae, the most informative individual features are the mono-nucleotide patterns, although the inclusion of di- and tri-nucleotide features results in improved performance. Our approach combines a much longer pattern than has been previously used to predict nucleosome positioning from sequence-301 base pairs, centered at the position to be scored-with a novel discriminative classification approach that selectively weights the contributions from each of the input features. The resulting scores are relatively insensitive to local AT-content and can be used to accurately discriminate putative dyad positions from adjacent linker regions without requiring an additional dynamic programming step and without the attendant edge effects and assumptions about linker length modeling and overall nucleosome density. Our approach produces the best dyad-linker classification results published to date in H. sapiens, and outperforms two recently published models on a large set of S. cerevisiae nucleosome positions. Our results suggest that in both genomes, a comparable and relatively small fraction of nucleosomes are well-positioned and that these positions are predictable based on sequence alone. We believe that the bulk of the remaining nucleosomes follow a statistical positioning model.
真核生物中的 DNA 被包装成染色质复合物,其最基本的元件是核小体。核小体核心的精确定位允许对 DNA 进行选择性访问,而控制这种定位的机制是基因表达谜题的重要组成部分。我们描述了一种大规模的核小体模式,该模式共同描述了核小体核心及其相邻连接子,主要表现为 DNA 序列中单、二和三核苷酸含量的长程波动,并且我们表明,该模式可用于预测人类和酿酒酵母中的核小体位置,比以前发表的方法更准确。令人惊讶的是,在人类和酿酒酵母中,最具信息量的单个特征是单核苷酸模式,尽管包含二核苷酸和三核苷酸特征会导致性能提高。我们的方法结合了比以前用于从序列预测核小体定位的方法更长的模式-301 个碱基对,以要评分的位置为中心-与一种新颖的判别分类方法相结合,该方法选择性地加权来自每个输入特征的贡献。所得分数对局部 AT 含量相对不敏感,可用于准确区分假定的二联体位置与相邻连接子区域,而无需额外的动态编程步骤,并且无需考虑边缘效应和关于连接子长度建模和整体核小体密度的假设。我们的方法在人类中产生了迄今为止发表的最佳二联体-连接子分类结果,并在大量酿酒酵母核小体位置上优于两个最近发表的模型。我们的结果表明,在这两个基因组中,相当一部分核小体的位置都很好,并且这些位置可以仅根据序列进行预测。我们认为其余大部分核小体遵循统计定位模型。