Gauran Iris Ivy M, Park Junyong, Lim Johan, Park DoHwan, Zylstra John, Peterson Thomas, Kann Maricel, Spouge John L
Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, Maryland 21250, U.S.A.
School of Statistics, University of the Philippines Diliman, Quezon City, 1101, Philippines.
Biometrics. 2018 Jun;74(2):458-471. doi: 10.1111/biom.12779. Epub 2017 Sep 22.
In recent mutation studies, analyses based on protein domain positions are gaining popularity over gene-centric approaches since the latter have limitations in considering the functional context that the position of the mutation provides. This presents a large-scale simultaneous inference problem, with hundreds of hypothesis tests to consider at the same time. This article aims to select significant mutation counts while controlling a given level of Type I error via False Discovery Rate (FDR) procedures. One main assumption is that the mutation counts follow a zero-inflated model in order to account for the true zeros in the count model and the excess zeros. The class of models considered is the Zero-inflated Generalized Poisson (ZIGP) distribution. Furthermore, we assumed that there exists a cut-off value such that smaller counts than this value are generated from the null distribution. We present several data-dependent methods to determine the cut-off value. We also consider a two-stage procedure based on screening process so that the number of mutations exceeding a certain value should be considered as significant mutations. Simulated and protein domain data sets are used to illustrate this procedure in estimation of the empirical null using a mixture of discrete distributions. Overall, while maintaining control of the FDR, the proposed two-stage testing procedure has superior empirical power.
在最近的突变研究中,基于蛋白质结构域位置的分析比以基因为中心的方法更受欢迎,因为后者在考虑突变位置所提供的功能背景方面存在局限性。这带来了一个大规模的同时推断问题,需要同时考虑数百个假设检验。本文旨在通过错误发现率(FDR)程序在控制给定水平的I型错误的同时选择显著的突变计数。一个主要假设是突变计数遵循零膨胀模型,以便解释计数模型中的真实零值和过多的零值。所考虑的模型类别是零膨胀广义泊松(ZIGP)分布。此外,我们假设存在一个截止值,使得小于该值的计数是由零分布产生的。我们提出了几种依赖数据的方法来确定截止值。我们还考虑了一种基于筛选过程的两阶段程序,以便将超过某个值的突变数量视为显著突变。使用离散分布的混合,通过模拟和蛋白质结构域数据集来说明该程序在估计经验零值方面的应用。总体而言,在保持对FDR的控制的同时,所提出的两阶段测试程序具有优越的经验功效。