State Key Laboratory of Biocatalysis and Enzyme Engineering, Hubei Collaborative Innovation Center for Green Transformation of Bio-Resources, Hubei Key Laboratory of Industrial Biotechnology, School of Life Sciences, Hubei University, Wuhan, 430062, China.
School of Computer Science and Engineering, Guangdong Province Key Laboratory of Computational Science, and National Engineering Laboratory for Big Data Analysis and Application, Sun Yat-Sen University, Guangzhou, 510275, China.
BMC Bioinformatics. 2022 Jan 6;23(1):19. doi: 10.1186/s12859-021-04533-6.
The gene-specific sweep is a selection process where an advantageous mutation along with the nearby neutral sites in a gene region increases the frequency in the population. It has been demonstrated to play important roles in ecological differentiation or phenotypic divergence in microbial populations. Therefore, identifying gene-specific sweeps in microorganisms will not only provide insights into the evolutionary mechanisms, but also unravel potential genetic markers associated with biological phenotypes. However, current methods were mainly developed for detecting selective sweeps in eukaryotic data of sparse genotypes and are not readily applicable to prokaryotic data. Furthermore, some challenges have not been sufficiently addressed by the methods, such as the low spatial resolution of sweep regions and lack of consideration of the spatial distribution of mutations.
We proposed a novel gene-centric and spatial-aware approach for identifying gene-specific sweeps in prokaryotes and implemented it in a python tool SweepCluster. Our method searches for gene regions with a high level of spatial clustering of pre-selected polymorphisms in genotype datasets assuming a null distribution model of neutral selection. The pre-selection of polymorphisms is based on their genetic signatures, such as elevated population subdivision, excessive linkage disequilibrium, or significant phenotype association. Performance evaluation using simulation data showed that the sensitivity and specificity of the clustering algorithm in SweepCluster is above 90%. The application of SweepCluster in two real datasets from the bacteria Streptococcus pyogenes and Streptococcus suis showed that the impact of pre-selection was dramatic and significantly reduced the uninformative signals. We validated our method using the genotype data from Vibrio cyclitrophicus, the only available dataset of gene-specific sweeps in bacteria, and obtained a concordance rate of 78%. We noted that the concordance rate could be underestimated due to distinct reference genomes and clustering strategies. The application to the human genotype datasets showed that SweepCluster is also applicable to eukaryotic data and is able to recover 80% of a catalog of known sweep regions.
SweepCluster is applicable to a broad category of datasets. It will be valuable for detecting gene-specific sweeps in diverse genotypic data and provide novel insights on adaptive evolution.
基因特异性漂变是一种选择过程,其中一个基因区域内有利的突变及其附近的中性位点会增加其在种群中的频率。它已被证明在微生物种群的生态分化或表型分化中发挥着重要作用。因此,鉴定微生物中的基因特异性漂变不仅可以深入了解进化机制,还可以揭示与生物表型相关的潜在遗传标记。然而,目前的方法主要是为检测真核生物稀疏基因型数据中的选择漂变而开发的,并不适用于原核生物数据。此外,一些方法尚未充分解决的挑战,如漂变区域的低空间分辨率以及缺乏对突变空间分布的考虑。
我们提出了一种新的基于基因和空间感知的方法,用于鉴定原核生物中的基因特异性漂变,并在一个名为 SweepCluster 的 python 工具中实现了该方法。我们的方法在基因型数据中搜索具有高水平空间聚类的基因区域,假设中性选择的零分布模型。多态性的预选择基于其遗传特征,如种群划分的增加、过度连锁不平衡或显著的表型关联。使用模拟数据进行的性能评估表明,SweepCluster 中的聚类算法的灵敏度和特异性均高于 90%。将 SweepCluster 应用于来自链球菌和猪链球菌的两个真实数据集,结果表明预选择的影响显著,显著减少了无信息信号。我们使用细菌中唯一可用的基因特异性漂变的基因型数据验证了我们的方法,得到了 78%的一致性率。我们注意到,由于参考基因组和聚类策略的不同,一致性率可能被低估。对人类基因型数据集的应用表明,SweepCluster 也适用于真核生物数据,并能够恢复已知的 80%的漂变区域目录。
SweepCluster 适用于广泛的数据集。它将有助于检测不同基因型数据中的基因特异性漂变,并为适应性进化提供新的见解。