通过连续异常值检测识别与疾病相关的 SNP 簇。

Identifying disease-associated SNP clusters via contiguous outlier detection.

机构信息

Laboratory for Bioinformatics and Computational Biology, Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China.

出版信息

Bioinformatics. 2011 Sep 15;27(18):2578-85. doi: 10.1093/bioinformatics/btr424. Epub 2011 Jul 22.

DOI:10.1093/bioinformatics/btr424

PMID:21784794

Abstract

MOTIVATION

Although genome-wide association studies (GWAS) have identified many disease-susceptibility single-nucleotide polymorphisms (SNPs), these findings can only explain a small portion of genetic contributions to complex diseases, which is known as the missing heritability. A possible explanation is that genetic variants with small effects have not been detected. The chance is < 8 that a causal SNP will be directly genotyped. The effects of its neighboring SNPs may be too weak to be detected due to the effect decay caused by imperfect linkage disequilibrium. Moreover, it is still challenging to detect a causal SNP with a small effect even if it has been directly genotyped.

RESULTS

In order to increase the statistical power when detecting disease-associated SNPs with relatively small effects, we propose a method using neighborhood information. Since the disease-associated SNPs account for only a small fraction of the entire SNP set, we formulate this problem as Contiguous Outlier DEtection (CODE), which is a discrete optimization problem. In our formulation, we cast the disease-associated SNPs as outliers and further impose a spatial continuity constraint for outlier detection. We show that this optimization can be solved exactly using graph cuts. We also employ the stability selection strategy to control the false positive results caused by imperfect parameter tuning. We demonstrate its advantage in simulations and real experiments. In particular, the newly identified SNP clusters are replicable in two independent datasets.

AVAILABILITY

The software is available at: http://bioinformatics.ust.hk/CODE.zip.

CONTACT

eeyu@ust.hk

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

尽管全基因组关联研究 (GWAS) 已经确定了许多疾病易感性单核苷酸多态性 (SNP)，但这些发现只能解释复杂疾病遗传贡献的一小部分，这被称为遗传缺失。一种可能的解释是，具有小效应的遗传变异尚未被检测到。一个因果 SNP 被直接基因分型的机会 < 8。由于不完全连锁不平衡引起的效应衰减，其邻近 SNP 的效应可能太弱而无法检测到。此外，即使已经直接基因分型，检测具有小效应的因果 SNP 仍然具有挑战性。

结果

为了提高检测具有相对较小效应的疾病相关 SNP 的统计能力，我们提出了一种使用邻域信息的方法。由于疾病相关 SNP 仅占整个 SNP 集的一小部分，我们将这个问题表述为连续局外点检测 (CODE)，这是一个离散优化问题。在我们的表述中，我们将疾病相关 SNP 视为局外点，并进一步施加空间连续性约束来进行局外点检测。我们表明，这个优化可以使用图割来精确求解。我们还采用稳定性选择策略来控制由于参数调整不完美而导致的假阳性结果。我们在模拟和真实实验中展示了它的优势。特别是，新确定的 SNP 簇在两个独立的数据集之间是可复制的。