Yu Ying, Chen Siyuan, Jones Samantha Jean, Hoque Rawnak, Vishnyakova Olga, Brooks-Wilson Angela, McNeney Brad
Hum Hered. 2022 Jun 29. doi: 10.1159/000525650.
Increasingly, logistic regression methods for genetic association studies of binary phenotypes must be able to accommodate data sparsity, which arises from unbalanced case-control ratios and/or rare genetic variants. Sparseness leads to maximum likelihood estimators (MLEs) of log-OR parameters that are biased away from their null value of zero and tests with inflated type 1 errors. Different penalized-likelihood methods have been developed to mitigate sparse-data bias. We study penalized logistic regression using a class of log-F priors indexed by a shrinkage parameter m to shrink the biased MLE towards zero. For a given m, log-F-penalized logistic regression may be easily implemented using data augmentation and standard software.
We propose a two-step approach to the analysis of a genetic association study: first, a set of variants that show evidence of association with the trait is used to estimate m; and second, the estimated m is used for log-F-penalized logistic regression analyses of all variants using data augmentation with standard software. Our estimate of m is the maximizer of a marginal likelihood obtained by integrating the latent log-ORs out of the joint distribution of the parameters and observed data. We consider two approximate approaches to maximizing the marginal likelihood: (i) a Monte Carlo EM algorithm (MCEM) and (ii) a Laplace approximation (LA) to each integral, followed by derivative-free optimization of the approximation.
We evaluate the statistical properties of our proposed two-step method and compared its performance to other shrinkage methods by a simulation study. Our simulation studies suggest that the proposed log-F-penalized approach has lower bias and mean squared error than other methods considered. We also illustrate the approach on data from a study of genetic associations with "super senior" cases and middle aged controls.
DISCUSSION/CONCLUSION: We have proposed a method for single rare variant analysis with binary phenotypes by logistic regression penalized by log-F priors. Our method has the advantage of being easily extended to correct for confounding due to population structure and genetic relatedness through a data augmentation approach.
对于二元表型的基因关联研究,逻辑回归方法越来越需要能够处理数据稀疏问题,这种问题源于不平衡的病例对照比例和/或罕见的基因变异。数据稀疏会导致对数优势比(log-OR)参数的最大似然估计值(MLEs)偏离其零值,并且检验的第一类错误会膨胀。已经开发了不同的惩罚似然方法来减轻稀疏数据偏差。我们使用一类由收缩参数m索引的对数F先验来研究惩罚逻辑回归,以使有偏差的MLE向零收缩。对于给定的m,对数F惩罚逻辑回归可以使用数据增强和标准软件轻松实现。
我们提出了一种用于基因关联研究分析的两步法:首先,使用一组显示与该性状存在关联证据的变异来估计m;其次,使用估计的m对所有变异进行对数F惩罚逻辑回归分析,使用标准软件通过数据增强来实现。我们对m的估计是通过从参数和观测数据的联合分布中积分出潜在的对数优势比而获得的边际似然的最大化者。我们考虑两种近似最大化边际似然的方法:(i)蒙特卡罗期望最大化算法(MCEM)和(ii)对每个积分的拉普拉斯近似(LA),然后对近似值进行无导数优化。
我们评估了我们提出的两步法的统计特性,并通过模拟研究将其性能与其他收缩方法进行了比较。我们的模拟研究表明,所提出的对数F惩罚方法比其他考虑的方法具有更低的偏差和均方误差。我们还使用与“超级老年人”病例和中年对照的基因关联研究数据说明了该方法。
讨论/结论:我们提出了一种通过对数F先验惩罚的逻辑回归对二元表型进行单罕见变异分析的方法。我们的方法具有易于扩展的优点,可通过数据增强方法校正由于群体结构和基因相关性导致的混杂。