Wang Yuanjia, Liang Baosheng, Tong Xingwei, Marder Karen, Bressman Susan, Orr-Urtreger Avi, Giladi Nir, Zeng Donglin
Department of Biostatistics, Mailman School of Public Health, 722 W168th Street, New York 10032, U.S.A.
School of Mathematical Sciences, Beijing Normal University, Beijing 100875, China.
Biometrika. 2015 Sep 1;102(3):515-532. doi: 10.1093/biomet/asv030.
With an increasing number of causal genes discovered for complex human disorders, it is crucial to assess the genetic risk of disease onset for individuals who are carriers of these causal mutations and compare the distribution of age-at-onset with that in non-carriers. In many genetic epidemiological studies aiming at estimating causal gene effect on disease, the age-at-onset of disease is subject to censoring. In addition, some individuals' mutation carrier or non-carrier status can be unknown due to the high cost of in-person ascertainment to collect DNA samples or death in older individuals. Instead, the probability of these individuals' mutation status can be obtained from various sources. When mutation status is missing, the available data take the form of censored mixture data. Recently, various methods have been proposed for risk estimation from such data, but none is efficient for estimating a nonparametric distribution. We propose a fully efficient sieve maximum likelihood estimation method, in which we estimate the logarithm of the hazard ratio between genetic mutation groups using B-splines, while applying nonparametric maximum likelihood estimation for the reference baseline hazard function. Our estimator can be calculated via an expectation-maximization algorithm which is much faster than existing methods. We show that our estimator is consistent and semiparametrically efficient and establish its asymptotic distribution. Simulation studies demonstrate superior performance of the proposed method, which is applied to the estimation of the distribution of the age-at-onset of Parkinson's disease for carriers of mutations in the leucine-rich repeat kinase 2 gene.
随着越来越多与复杂人类疾病相关的因果基因被发现,对于携带这些因果突变的个体评估疾病发病的遗传风险,并将发病年龄分布与非携带者进行比较至关重要。在许多旨在估计因果基因对疾病影响的遗传流行病学研究中,疾病的发病年龄存在删失情况。此外,由于亲自采集DNA样本的成本高昂或老年个体死亡,一些个体的突变携带者或非携带者状态可能未知。相反,这些个体的突变状态概率可从各种来源获得。当突变状态缺失时,可用数据呈现为删失混合数据的形式。最近,已经提出了各种从这类数据进行风险估计的方法,但没有一种方法在估计非参数分布方面是有效的。我们提出一种完全有效的筛法最大似然估计方法,其中我们使用B样条估计基因突变组之间风险比的对数,同时对参考基线风险函数应用非参数最大似然估计。我们的估计量可以通过期望最大化算法计算,该算法比现有方法快得多。我们表明我们的估计量是一致的且半参数有效,并建立了其渐近分布。模拟研究证明了所提出方法的优越性能,该方法应用于估计富含亮氨酸重复激酶2基因突变携带者帕金森病的发病年龄分布。