Llinares-López Felipe, Grimm Dominik G, Bodenham Dean A, Gieraths Udo, Sugiyama Mahito, Rowan Beth, Borgwardt Karsten
Machine Learning and Computational Biology Lab, Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland, The Institute of Scientific and Industrial Research, Osaka University, Osaka, Japan, JST, PRESTO, Japan and Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany.
Machine Learning and Computational Biology Lab, Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland, The Institute of Scientific and Industrial Research, Osaka University, Osaka, Japan, JST, PRESTO, Japan and Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany Machine Learning and Computational Biology Lab, Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland, The Institute of Scientific and Industrial Research, Osaka University, Osaka, Japan, JST, PRESTO, Japan and Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany.
Bioinformatics. 2015 Jun 15;31(12):i240-9. doi: 10.1093/bioinformatics/btv263.
Genetic heterogeneity, the fact that several sequence variants give rise to the same phenotype, is a phenomenon that is of the utmost interest in the analysis of complex phenotypes. Current approaches for finding regions in the genome that exhibit genetic heterogeneity suffer from at least one of two shortcomings: (i) they require the definition of an exact interval in the genome that is to be tested for genetic heterogeneity, potentially missing intervals of high relevance, or (ii) they suffer from an enormous multiple hypothesis testing problem due to the large number of potential candidate intervals being tested, which results in either many false positives or a lack of power to detect true intervals.
Here, we present an approach that overcomes both problems: it allows one to automatically find all contiguous sequences of single nucleotide polymorphisms in the genome that are jointly associated with the phenotype. It also solves both the inherent computational efficiency problem and the statistical problem of multiple hypothesis testing, which are both caused by the huge number of candidate intervals. We demonstrate on Arabidopsis thaliana genome-wide association study data that our approach can discover regions that exhibit genetic heterogeneity and would be missed by single-locus mapping.
Our novel approach can contribute to the genome-wide discovery of intervals that are involved in the genetic heterogeneity underlying complex phenotypes.
The code can be obtained at: http://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/sis.html.
遗传异质性,即多个序列变异导致相同表型的现象,是复杂表型分析中极为重要的现象。当前用于寻找基因组中表现出遗传异质性区域的方法至少存在以下两个缺点之一:(i)它们需要定义基因组中要进行遗传异质性测试的精确区间,可能会遗漏高度相关的区间;或者(ii)由于要测试的潜在候选区间数量众多,它们面临巨大的多重假设检验问题,这会导致出现许多假阳性结果或缺乏检测真实区间的能力。
在此,我们提出一种克服这两个问题的方法:它允许自动找到基因组中与表型共同相关的单核苷酸多态性的所有连续序列。它还解决了由大量候选区间导致的固有计算效率问题和多重假设检验的统计问题。我们在拟南芥全基因组关联研究数据上证明,我们的方法可以发现表现出遗传异质性且单基因座定位会遗漏的区域。
我们的新方法有助于在全基因组范围内发现参与复杂表型潜在遗传异质性的区间。
代码可从以下网址获取:http://www.bsse.ethz.ch/mlcb/research/bioinformatics-and-computational-biology/sis.html。