Ma Yanyuan, Wang Yuanjia
Department of Statistics, Texas A&M University, College Station, TX 77845.
Electron J Stat. 2012;6:710-737. doi: 10.1214/12-EJS690.
We study efficient nonparametric estimation of distribution functions of several scientifically meaningful sub-populations from data consisting of mixed samples where the sub-population identifiers are missing. Only probabilities of each observation belonging to a sub-population are available. The problem arises from several biomedical studies such as quantitative trait locus (QTL) analysis and genetic studies with ungenotyped relatives where the scientific interest lies in estimating the cumulative distribution function of a trait given a specific genotype. However, in these studies subjects' genotypes may not be directly observed. The distribution of the trait outcome is therefore a mixture of several genotype-specific distributions. We characterize the complete class of consistent estimators which includes members such as one type of nonparametric maximum likelihood estimator (NPMLE) and least squares or weighted least squares estimators. We identify the efficient estimator in the class that reaches the semiparametric efficiency bound, and we implement it using a simple procedure that remains consistent even if several components of the estimator are mis-specified. In addition, our close inspections on two commonly used NPMLEs in these problems show the surprising results that the NPMLE in one form is highly inefficient, while in the other form is inconsistent. We provide simulation procedures to illustrate the theoretical results and demonstrate the proposed methods through two real data examples.
我们研究了从子群体标识符缺失的混合样本数据中,对几个具有科学意义的子群体的分布函数进行有效非参数估计的问题。这里仅可获得每个观测值属于某个子群体的概率。该问题源于多项生物医学研究,如数量性状基因座(QTL)分析以及对未进行基因分型的亲属的遗传学研究,其中科学兴趣在于估计给定特定基因型时某一性状的累积分布函数。然而,在这些研究中,受试者的基因型可能无法直接观测到。因此,性状结果的分布是几种基因型特异性分布的混合。我们刻画了一致估计量的完全类,其中包括诸如一种非参数最大似然估计量(NPMLE)以及最小二乘或加权最小二乘估计量等成员。我们在该类中识别出达到半参数效率界的有效估计量,并通过一个简单的程序来实现它,即使估计量的几个分量被错误设定,该程序仍保持一致性。此外,我们对这些问题中两个常用的NPMLE进行仔细研究后发现了令人惊讶的结果:一种形式的NPMLE效率极低,而另一种形式的NPMLE则不一致。我们提供了模拟程序来说明理论结果,并通过两个实际数据示例展示所提出的方法。