Department of Computer Science, University of York, York, North Yorkshire, United Kingdom.
Genet Epidemiol. 2013 Jan;37(1):69-83. doi: 10.1002/gepi.21686. Epub 2012 Oct 3.
Large population biobanks of unrelated individuals have been highly successful in detecting common genetic variants affecting diseases of public health concern. However, they lack the statistical power to detect more modest gene-gene and gene-environment interaction effects or the effects of rare variants for which related individuals are ideally required. In reality, most large population studies will undoubtedly contain sets of undeclared relatives, or pedigrees. Although a crude measure of relatedness might sometimes suffice, having a good estimate of the true pedigree would be much more informative if this could be obtained efficiently. Relatives are more likely to share longer haplotypes around disease susceptibility loci and are hence biologically more informative for rare variants than unrelated cases and controls. Distant relatives are arguably more useful for detecting variants with small effects because they are less likely to share masking environmental effects. Moreover, the identification of relatives enables appropriate adjustments of statistical analyses that typically assume unrelatedness. We propose to exploit an integer linear programming optimisation approach to pedigree learning, which is adapted to find valid pedigrees by imposing appropriate constraints. Our method is not restricted to small pedigrees and is guaranteed to return a maximum likelihood pedigree. With additional constraints, we can also search for multiple high-probability pedigrees and thus account for the inherent uncertainty in any particular pedigree reconstruction. The true pedigree is found very quickly by comparison with other methods when all individuals are observed. Extensions to more complex problems seem feasible.
大型无关个体人群生物库在检测影响公众健康关注的疾病的常见遗传变异方面非常成功。然而,它们缺乏检测适度基因-基因和基因-环境相互作用效应或稀有变异效应的统计能力,而相关个体是检测这些效应的理想选择。实际上,大多数大型人群研究无疑会包含一系列未申报的亲属或家系。虽然有时粗略的亲缘关系测量可能就足够了,但如果能够有效地获得,则对真实家系进行良好估计将更具信息量。亲属在疾病易感基因座周围更有可能共享更长的单倍型,因此对于稀有变异,他们比无关的病例和对照更具生物学信息。由于遥远的亲属不太可能共享掩蔽环境效应,因此对于检测小效应的变体,他们可能更有用。此外,识别亲属可以对统计分析进行适当调整,这些分析通常假设不存在亲缘关系。我们建议利用整数线性规划优化方法进行系谱学习,该方法通过施加适当的约束来找到有效的系谱。我们的方法不仅限于小系谱,并且保证返回最大似然系谱。通过附加约束,我们还可以搜索多个高概率系谱,从而考虑到任何特定系谱重建中的固有不确定性。当所有个体都被观察到时,与其他方法相比,通过比较可以快速找到真实的系谱。扩展到更复杂的问题似乎是可行的。