Lisi Alessandro, Campbell Michael C
Department of Biological Sciences (Human and Evolutionary Biology Section), University of Southern California, Los Angeles, CA 90089, United States.
Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf171.
Statistical methods that measure the extent of haplotype homozygosity on chromosomes have been highly informative for identifying episodes of recent selection. For example, the integrated haplotype score (iHS) and the extended haplotype homozygosity (EHH) statistics detect long-range haplotype structure around derived and ancestral alleles indicative of classic and soft selective sweeps, respectively. However, to our knowledge, there are currently no publicly available methods that classify ancestral and derived alleles in genomic datasets for the purpose of quantifying the extent of haplotype homozygosity. Here, we introduce the Polaris package, which polarizes chromosomal variants into ancestral and derived alleles and creates corresponding genetic maps for analysis by selscan and HaploSweep, two versatile haplotype-based programs that perform scans for selection. With the input files generated by Polaris, selscan and/or HaploSweep can produce the appropriate sign (either positive or negative) for outlier iHS statistics, enabling users to distinguish between selection on derived or ancestral alleles. In addition, Polaris can convert the numerical output of these analyses into graphical representations of selective sweeps, increasing the functionality of our software.
To demonstrate the utility of our approach, we applied the Polaris package to Chromosome 2 in the European Finnish, Middle Eastern Bedouin, and East African Maasai populations. More specifically, we examined the regulatory sequence in intron 13 of the MCM6 gene associated with lactase persistence (i.e. the ability to digest the lactose sugar present in fresh milk), a region of intense interest to human evolutionary geneticists. Our analyses showed that derived alleles (at known enhancers for lactase expression) sit on an extended haplotype background in the Finnish, Bedouin, and Maasai consistent with a classic selective sweep model as determined by iHS and EHH statistics. Importantly, we were able to immediately identify this target allele under selection based on the information generated by our software. We also explored outlier statistics across Chromosome 2 in two distinct datasets from these populations: (i) one containing polarized alleles generated with Polaris and (ii) the other containing unpolarized alleles in the original phased vcf file. Here, we found an excess of outlier statistics on Chromosome 2 in the unpolarized datasets, raising the possibility that a subset of these "hits" of selection may be unreliable. Overall, Polaris is a versatile package that enables users to efficiently explore, interpret, and report signals of recent selection in genomic datasets.
The Polaris package is free and open source on GitHub (https://github.com/alisi1989/Polaris) and DropBox (https://www.dropbox.com/scl/fo/mlxizft5267vem9u62qkn/AAnM0qX923zPzQBlPX8iteM?rlkey=uezrp4t2waffpj0nmo1evr320&e=1&st=jaodccws&dl=0).
测量染色体上单体型纯合程度的统计方法对于识别近期的选择事件极具参考价值。例如,综合单体型评分(iHS)和扩展单体型纯合性(EHH)统计量分别检测围绕衍生和祖先等位基因的长程单体型结构,这些结构分别指示经典和软选择扫荡。然而,据我们所知,目前尚无公开可用的方法来在基因组数据集中对祖先和衍生等位基因进行分类,以量化单体型纯合程度。在此,我们引入了北极星(Polaris)软件包,它将染色体变异极化到祖先和衍生等位基因中,并创建相应的遗传图谱,以供selscan和HaploSweep这两个基于单体型的通用程序进行选择扫描分析。利用北极星生成的输入文件,selscan和/或HaploSweep可以为异常iHS统计量生成适当的符号(正或负),从而使用户能够区分对衍生或祖先等位基因的选择。此外,北极星可以将这些分析的数值输出转换为选择扫荡的图形表示,增强了我们软件的功能。
为了证明我们方法的实用性,我们将北极星软件包应用于欧洲芬兰人、中东贝都因人以及东非马赛人的2号染色体。更具体地说,我们研究了与乳糖酶持久性(即消化新鲜牛奶中乳糖的能力)相关的MCM6基因第13内含子中的调控序列,这是人类进化遗传学家高度关注的一个区域。我们的分析表明,衍生等位基因(在已知的乳糖酶表达增强子处)在芬兰人、贝都因人和马赛人中处于扩展单体型背景上,这与iHS和EHH统计量确定的经典选择扫荡模型一致。重要的是,我们能够根据软件生成的数据立即识别出正在被选择的目标等位基因。我们还在来自这些人群的两个不同数据集中探索了2号染色体上的异常统计量:(i)一个包含用北极星极化的等位基因,(ii)另一个包含原始分阶段vcf文件中未极化的等位基因。在此,我们发现在未极化的数据集中2号染色体上有过多的异常统计量,这增加了这些“选择命中”的一部分可能不可靠的可能性。总体而言,北极星是一个通用软件包,它使用户能够有效地探索、解释和报告基因组数据集中近期选择的信号。