关联研究中的生物标志物检测：通过逻辑方差分析同时对单核苷酸多态性进行建模。

Biomarker Detection in Association Studies: Modeling SNPs Simultaneously via Logistic ANOVA.

作者信息

Jung Yoonsuh, Huang Jianhua Z, Hu Jianhua

机构信息

Department of Statistics, Univerisity of Waikato, Private Bag 3105, Hamilton 3240, New Zealand.

Department of Statistics, Texas A&M University, College Station, TX, USA, and Special Term Professor at ISEM, Captial University of Economics and Business, Beijing, China.

出版信息

J Am Stat Assoc. 2014 Dec 1;109(508):1355-1367. doi: 10.1080/01621459.2014.928217.

DOI:10.1080/01621459.2014.928217

PMID:25642005

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4310485/

Abstract

In genome-wide association studies, the primary task is to detect biomarkers in the form of Single Nucleotide Polymorphisms (SNPs) that have nontrivial associations with a disease phenotype and some other important clinical/environmental factors. However, the extremely large number of SNPs comparing to the sample size inhibits application of classical methods such as the multiple logistic regression. Currently the most commonly used approach is still to analyze one SNP at a time. In this paper, we propose to consider the genotypes of the SNPs simultaneously via a logistic analysis of variance (ANOVA) model, which expresses the logit transformed mean of SNP genotypes as the summation of the SNP effects, effects of the disease phenotype and/or other clinical variables, and the interaction effects. We use a reduced-rank representation of the interaction-effect matrix for dimensionality reduction, and employ the -penalty in a penalized likelihood framework to filter out the SNPs that have no associations. We develop a Majorization-Minimization algorithm for computational implementation. In addition, we propose a modified BIC criterion to select the penalty parameters and determine the rank number. The proposed method is applied to a Multiple Sclerosis data set and simulated data sets and shows promise in biomarker detection.

摘要

在全基因组关联研究中，主要任务是检测单核苷酸多态性（SNP）形式的生物标志物，这些生物标志物与疾病表型以及其他一些重要的临床/环境因素存在显著关联。然而，与样本量相比，SNP的数量极其庞大，这限制了诸如多元逻辑回归等经典方法的应用。目前最常用的方法仍然是一次分析一个SNP。在本文中，我们建议通过逻辑方差分析（ANOVA）模型同时考虑SNP的基因型，该模型将SNP基因型的对数转换均值表示为SNP效应、疾病表型和/或其他临床变量的效应以及交互效应的总和。我们使用交互效应矩阵的降秩表示进行降维，并在惩罚似然框架中采用惩罚来筛选出无关联的SNP。我们开发了一种主元最小化算法用于计算实现。此外，我们提出了一种修正的BIC准则来选择惩罚参数并确定秩数。所提出的方法应用于一个多发性硬化症数据集和模拟数据集，并在生物标志物检测方面显示出前景。