分层提升：一种用于检测和分类人类群体中硬选择性清除的机器学习框架。

Hierarchical boosting: a machine-learning framework to detect and classify hard selective sweeps in human populations.

作者信息

Pybus Marc, Luisi Pierre, Dall'Olio Giovanni Marco, Uzkudun Manu, Laayouni Hafid, Bertranpetit Jaume, Engelken Johannes

机构信息

Institut de Biologia Evolutiva (UPF-CSIC), Universitat Pompeu Fabra, Barcelona 08003, Spain.

Institut de Biologia Evolutiva (UPF-CSIC), Universitat Pompeu Fabra, Barcelona 08003, Spain, Department of Biology, Stanford University, Stanford, CA 94305, USA.

出版信息

Bioinformatics. 2015 Dec 15;31(24):3946-52. doi: 10.1093/bioinformatics/btv493. Epub 2015 Aug 26.

DOI:10.1093/bioinformatics/btv493

PMID:26315912

Abstract

MOTIVATION

Detecting positive selection in genomic regions is a recurrent topic in natural population genetic studies. However, there is little consistency among the regions detected in several genome-wide scans using different tests and/or populations. Furthermore, few methods address the challenge of classifying selective events according to specific features such as age, intensity or state (completeness).

RESULTS

We have developed a machine-learning classification framework that exploits the combined ability of some selection tests to uncover different polymorphism features expected under the hard sweep model, while controlling for population-specific demography. As a result, we achieve high sensitivity toward hard selective sweeps while adding insights about their completeness (whether a selected variant is fixed or not) and age of onset. Our method also determines the relevance of the individual methods implemented so far to detect positive selection under specific selective scenarios. We calibrated and applied the method to three reference human populations from The 1000 Genome Project to generate a genome-wide classification map of hard selective sweeps. This study improves detection of selective sweep by overcoming the classical selection versus no-selection classification strategy, and offers an explanation to the lack of consistency observed among selection tests when applied to real data. Very few signals were observed in the African population studied, while our method presents higher sensitivity in this population demography.

AVAILABILITY AND IMPLEMENTATION

The genome-wide results for three human populations from The 1000 Genomes Project and an R-package implementing the 'Hierarchical Boosting' framework are available at http://hsb.upf.edu/.

摘要

动机

在自然群体遗传学研究中，检测基因组区域中的正向选择是一个反复出现的主题。然而，在使用不同测试和/或群体进行的几次全基因组扫描中检测到的区域之间几乎没有一致性。此外，很少有方法能够应对根据年龄、强度或状态（完整性）等特定特征对选择事件进行分类的挑战。

结果

我们开发了一个机器学习分类框架，该框架利用一些选择测试的综合能力，在控制群体特异性人口统计学的同时，揭示硬扫荡模型下预期的不同多态性特征。因此，我们在对硬选择扫荡具有高敏感性的同时，还增加了对其完整性（所选变体是否固定）和起始年龄的了解。我们的方法还确定了迄今为止实施的各个方法在特定选择场景下检测正向选择的相关性。我们对该方法进行了校准，并将其应用于来自千人基因组计划的三个人类参考群体，以生成硬选择扫荡的全基因组分类图。这项研究通过克服经典的选择与非选择分类策略，改进了对选择扫荡的检测，并为应用于实际数据时选择测试之间缺乏一致性提供了解释。在所研究的非洲人群中观察到的信号非常少，而我们的方法在该人群统计学中表现出更高的敏感性。