Guo Xingyu, Qin Jie, Wang Shikai, Zhong Jincheng, Liu Li, Kangzhu Yixi, Lan Daoliang, Wang Jiabo
Key Laboratory of Qinghai-Tibetan Plateau Animal Genetic Resource Reservation and Utilization, Ministry of Education and Sichuan Province, Southwest Minzu University, Chengdu 610041, China.
Key Laboratory of Combining Farming and Animal Husbandry of Ministry of Agriculture, Institute of Animal Husbandry, Heilongjiang Academy of Agricultural Sciences, Harbin 150028, China.
Int J Mol Sci. 2025 Jun 18;26(12):5797. doi: 10.3390/ijms26125797.
Whole-genome sequencing (WGS) technology has made significant progress in obtaining the genomic information of organisms and is now the primary way to uncover genetic variation. However, due to the complexity of the genome and technical limitations, large genome segments remain ungenotyped. Imputation is a useful strategy for predicting missing genotypes. The accuracy and computing speed of imputation software are important criteria that should inform future developments in genomic research. In this study, the K-Means algorithm and multithreading were used to cluster reference individuals to reduce the number and improve the length of haplotypes in the subpopulation. We named this strategy "KBeagle". In the comparison test, we determined that the KBeagle-imputed dataset (KID) can identify more single-nucleotide polymorphism (SNP) loci associated with the specified traits compared to the Beagle-imputed dataset (BID), while also achieving much lower false discovery rates (FDRs) and Type I error rates under the same power of detection of association signals. We envision that the main application of KBeagle will focus on livestock sequencing studies under a strong genetic structure. In summary, we have generated an accurate and efficient imputation method, improving the imputation matching rate and calculation time.
全基因组测序(WGS)技术在获取生物体基因组信息方面取得了重大进展,如今已成为揭示遗传变异的主要方式。然而,由于基因组的复杂性和技术限制,大片段基因组仍未进行基因分型。插补是预测缺失基因型的一种有用策略。插补软件的准确性和计算速度是指导基因组研究未来发展的重要标准。在本研究中,使用K均值算法和多线程对参考个体进行聚类,以减少亚群中单体型的数量并增加其长度。我们将此策略命名为“KBeagle”。在比较测试中,我们确定与Beagle插补数据集(BID)相比,KBeagle插补数据集(KID)能够识别出更多与特定性状相关的单核苷酸多态性(SNP)位点,同时在相同的关联信号检测能力下,误发现率(FDR)和I型错误率也低得多。我们设想KBeagle的主要应用将集中在具有强遗传结构的家畜测序研究中。总之,我们开发了一种准确高效的插补方法,提高了插补匹配率和计算时间。