Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Miaoli County, Taiwan, ROC.
PLoS One. 2013 Aug 7;8(8):e71114. doi: 10.1371/journal.pone.0071114. Print 2013.
Advances in next-generation sequencing technologies have enabled the identification of multiple rare single nucleotide polymorphisms involved in diseases or traits. Several strategies for identifying rare variants that contribute to disease susceptibility have recently been proposed. An important feature of many of these statistical methods is the pooling or collapsing of multiple rare single nucleotide variants to achieve a reasonably high frequency and effect. However, if the pooled rare variants are associated with the trait in different directions, then the pooling may weaken the signal, thereby reducing its statistical power. In the present paper, we propose a backward support vector machine (BSVM)-based variant selection procedure to identify informative disease-associated rare variants. In the selection procedure, the rare variants are weighted and collapsed according to their positive or negative associations with the disease, which may be associated with common variants and rare variants with protective, deleterious, or neutral effects. This nonparametric variant selection procedure is able to account for confounding factors and can also be adopted in other regression frameworks. The results of a simulation study and a data example show that the proposed BSVM approach is more powerful than four other approaches under the considered scenarios, while maintaining valid type I errors.
下一代测序技术的进步使得鉴定出多种与疾病或特征相关的罕见单核苷酸多态性成为可能。最近已经提出了几种用于鉴定导致疾病易感性的罕见变异体的策略。这些统计方法的一个重要特点是将多个罕见的单核苷酸变异体进行pooling 或合并,以达到合理的高频率和效应。然而,如果pooled 的罕见变异体与性状呈不同方向相关,那么pooling 可能会削弱信号,从而降低其统计功效。在本文中,我们提出了一种基于后向支持向量机(BSVM)的变异选择程序,用于鉴定有意义的与疾病相关的罕见变异体。在选择过程中,根据罕见变异体与疾病的正相关或负相关,对它们进行加权和合并,这些变异体可能与常见变异体以及保护性、有害性或中性效应的罕见变异体相关。这种非参数变异选择程序能够考虑混杂因素,也可以应用于其他回归框架。模拟研究和一个数据实例的结果表明,在考虑的场景下,与其他四种方法相比,所提出的 BSVM 方法更有效,同时保持有效的Ⅰ型错误率。