Department of Health Sciences, University of Leicester, Leicester, UK.
Int J Epidemiol. 2011 Dec;40(6):1629-42. doi: 10.1093/ije/dyr149.
In a recent paper by Homer et al. (Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet 2008;4:e1000167), a method for detecting whether a given individual is a contributor to a particular genomic mixture was proposed. This prompted grave concern about the public dissemination of aggregate statistics from genome-wide association studies. It is of clear scientific importance that such data be shared widely, but the confidentiality of study participants must not be compromised. The issue of what summary genomic data can safely be posted on the web is only addressed satisfactorily when the theoretical underpinnings of the proposed method are clarified and its performance evaluated in terms of dependence on underlying assumptions.
The original method raised a number of concerns and several alternatives have since been proposed, including a simple linear regression approach. In our proposed generalized estimating equation approach, we maintain the simplicity of the linear regression model but obtain inferences that are more robust to approximation of the variance/covariance structure and can accommodate linkage disequilibrium.
We affirm that, in principle, it is possible to determine that a 'candidate' individual has participated in a study, given a subset of aggregate statistics from that study. However, the methods depend critically on a number of key factors including: the ancestry of participants in the study; the absolute and relative numbers of cases and controls; and the number of single nucleotide polymorphisms.
Simple guidelines for publication that are based on a single criterion are therefore unlikely to suffice. In particular, 'directed' summary statistics should not be posted openly on the web but could be protected by an internet-based access check as proposed by the P3G_Consortium et al. (Public access to genome-wide data: five views on balancing research with privacy and protection. PLoS Genet 2009;5:e1000665).
在 Homer 等人最近的一篇论文中(使用高密度 SNP 基因分型微阵列解决痕量 DNA 对高度复杂混合物的个体贡献问题。PLoS Genet 2008;4:e1000167),提出了一种检测特定个体是否为特定基因组混合物贡献者的方法。这引发了人们对全基因组关联研究汇总统计数据公开传播的严重关注。显然,这些数据需要广泛共享,但研究参与者的机密性不得受到损害。只有当提出的方法的理论基础得到澄清,并根据其对基本假设的依赖性来评估其性能时,才能满意地解决可安全发布到网络上的摘要基因组数据的问题。
原始方法引起了一些关注,此后提出了几种替代方法,包括简单的线性回归方法。在我们提出的广义估计方程方法中,我们保持线性回归模型的简单性,但获得的推断结果更能抵抗方差/协方差结构的近似,并且可以适应连锁不平衡。
我们确认,原则上,给定研究的汇总统计数据的一个子集,就有可能确定一个“候选”个体是否参与了该研究。然而,这些方法严重依赖于一些关键因素,包括:研究参与者的祖源;病例和对照的绝对和相对数量;以及单核苷酸多态性的数量。
因此,基于单一标准的简单发布指南不太可能足够。特别是,“定向”汇总统计数据不应公开发布到网络上,但可以通过基于互联网的访问检查来保护,正如 P3G_Consortium 等人所提出的(Public access to genome-wide data: five views on balancing research with privacy and protection. PLoS Genet 2009;5:e1000665)。