Yang Qimeng, Sun Jianfeng, Wang Xinyu, Wang Jiong, Liu Quanzhong, Ru Jinlong, Zhang Xin, Wang Sizhe, Hao Ran, Bian Peipei, Dai Xuelei, Gong Mian, Zhang Zhuangbiao, Wang Ao, Bai Fengting, Li Ran, Cai Yudong, Jiang Yu
Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, China.
Botnar Research Centre, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, UK.
Nat Commun. 2025 Mar 11;16(1):2406. doi: 10.1038/s41467-025-57756-z.
Structural variations (SVs) are diverse forms of genetic alterations and drive a wide range of human diseases. Accurately genotyping SVs, particularly occurring at repetitive genomic regions, from short-read sequencing data remains challenging. Here, we introduce SVLearn, a machine-learning approach for genotyping bi-allelic SVs. It exploits a dual-reference strategy to engineer a curated set of genomic, alignment, and genotyping features based on a reference genome in concert with an allele-based alternative genome. Using 38,613 human-derived SVs, we show that SVLearn significantly outperforms four state-of-the-art tools, with precision improvements of up to 15.61% for insertions and 13.75% for deletions in repetitive regions. On two additional sets of 121,435 cattle SVs and 113,042 sheep SVs, SVLearn demonstrates a strong generalizability to cross-species genotype SVs with a weighted genotype concordance score of up to 90%. Notably, SVLearn enables accurate genotyping of SVs at low sequencing coverage, which is comparable to the accuracy at 30× coverage. Our studies suggest that SVLearn can accelerate the understanding of associations between the genome-scale, high-quality genotyped SVs and diseases across multiple species.
结构变异(SVs)是多种形式的基因改变,可引发多种人类疾病。从短读长测序数据中准确地对SVs进行基因分型,尤其是在重复基因组区域发生的SVs,仍然具有挑战性。在此,我们介绍了SVLearn,一种用于对双等位基因SVs进行基因分型的机器学习方法。它采用双参考策略,基于参考基因组并结合基于等位基因的替代基因组,构建一组经过精心策划的基因组、比对和基因分型特征。使用38,613个人源SVs,我们表明SVLearn显著优于四种最先进的工具,在重复区域中,插入的精确率提高了高达15.61%,缺失的精确率提高了13.75%。在另外两组分别为121,435个牛SVs和113,042个羊SVs的数据集上,SVLearn展示了强大的跨物种基因分型SVs的通用性,加权基因型一致性得分高达90%。值得注意的是,SVLearn能够在低测序覆盖度下对SVs进行准确的基因分型,其准确性与30×覆盖度时相当。我们的研究表明,SVLearn可以加速对跨多个物种的基因组规模、高质量基因分型SVs与疾病之间关联的理解。