多因素降维分析确定了促进遗传多态性的特定核苷酸模式。

Multifactor dimensionality reduction analysis identifies specific nucleotide patterns promoting genetic polymorphisms.

机构信息

Department of Pharmacology and Toxicology, Dartmouth Medical School, Hanover, NH, USA.

Computational Genetics Laboratory, Department of Genetics, Norris-Cotton Cancer Center, Dartmouth Medical School, Lebanon, NH, USA.

出版信息

BioData Min. 2009 Mar 30;2(1):2. doi: 10.1186/1756-0381-2-2.

DOI:10.1186/1756-0381-2-2

PMID:19331672

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2669078/

Abstract

BACKGROUND

The fidelity of DNA replication serves as the nidus for both genetic evolution and genomic instability fostering disease. Single nucleotide polymorphisms (SNPs) constitute greater than 80% of the genetic variation between individuals. A new theory regarding DNA replication fidelity has emerged in which selectivity is governed by base-pair geometry through interactions between the selected nucleotide, the complementary strand, and the polymerase active site. We hypothesize that specific nucleotide combinations in the flanking regions of SNP fragments are associated with mutation.

RESULTS

We modeled the relationship between DNA sequence and observed polymorphisms using the novel multifactor dimensionality reduction (MDR) approach. MDR was originally developed to detect synergistic interactions between multiple SNPs that are predictive of disease susceptibility. We initially assembled data from the Broad Institute as a pilot test for the hypothesis that flanking region patterns associate with mutagenesis (n = 2194). We then confirmed and expanded our inquiry with human SNPs within coding regions and their flanking sequences collected from the National Center for Biotechnology Information (NCBI) database (n = 29967) and a control set of sequences (coding region) not associated with SNP sites randomly selected from the NCBI database (n = 29967). We discovered seven flanking region pattern associations in the Broad dataset which reached a minimum significance level of p </= 0.05. Significant models (p << 0.001) were detected for each SNP type examined in the larger NCBI dataset. Importantly, the flanking region models were elongated or truncated depending on the nucleotide change. Additionally, nucleotide distributions differed significantly at motif sites relative to the type of variation observed. The MDR approach effectively discerned specific sites within the flanking regions of observed SNPs and their respective identities, supporting the collective contribution of these sites to SNP genesis.

CONCLUSION

The present study represents the first use of this computational methodology for modeling nonlinear patterns in molecular genetics. MDR was able to identify distinct nucleotide patterning around sites of mutations dependent upon the observed nucleotide change. We discovered one flanking region set that included five nucleotides clustered around a specific type of SNP site. Based on the strongly associated patterns identified in this study, it may become possible to scan genomic databases for such clustering of nucleotides in order to predict likely sites of future SNPs, and even the type of polymorphism most likely to occur.

摘要

背景

DNA 复制的保真度是遗传进化和促进疾病发生的基因组不稳定性的核心。单核苷酸多态性（SNP）构成了个体之间超过 80%的遗传变异。一种新的 DNA 复制保真度理论已经出现，该理论认为选择性受碱基对几何形状的控制，通过所选核苷酸、互补链和聚合酶活性位点之间的相互作用。我们假设 SNP 片段侧翼区域的特定核苷酸组合与突变有关。

结果

我们使用新颖的多因素维度缩减（MDR）方法对 DNA 序列与观察到的多态性之间的关系进行建模。MDR 最初是为了检测与疾病易感性相关的多个 SNP 之间的协同作用而开发的。我们首先组装了 Broad 研究所的数据作为假设的试点测试，即侧翼区域模式与诱变相关（n=2194）。然后，我们使用来自 National Center for Biotechnology Information (NCBI) 数据库的人类 SNP 及其侧翼序列（n=29967）和从 NCBI 数据库随机选择的与 SNP 位点无关的序列（n=29967）扩展并确认了我们的研究。我们在 Broad 数据集发现了七个侧翼区域模式关联，达到了 p </= 0.05 的最小显著性水平。在更大的 NCBI 数据集检查的每种 SNP 类型中都检测到了显著模型（p << 0.001）。重要的是，侧翼区域模型根据核苷酸变化而延长或缩短。此外，与观察到的变异类型相比， motif 位点的核苷酸分布有显著差异。MDR 方法能够有效地辨别观察到的 SNP 及其各自身份的侧翼区域内的特定位点，支持这些位点对 SNP 生成的集体贡献。

结论

本研究代表了该计算方法在分子遗传学中用于建模非线性模式的首次应用。MDR 能够识别依赖于观察到的核苷酸变化的突变部位周围的独特核苷酸模式。我们发现了一组侧翼区域，其中包含五个核苷酸，聚集在特定类型的 SNP 位点周围。基于本研究中发现的强烈关联模式，可能有可能扫描基因组数据库以寻找此类核苷酸聚类，从而预测未来 SNP 的可能位点，甚至可能发生的多态性类型。