Ma Walfred, Chaisson Mark
bioRxiv. 2025 Apr 13:2024.08.11.607269. doi: 10.1101/2024.08.11.607269.
Copy number variant (CNV) genes are important in evolution and disease, yet sequence variation in CNV genes remains a blind spot in large-scale studies. We present ctyper, a method that leverages pangenomes to produce allele-specific copy numbers with locally phased variants from next-generation sequencing (NGS) reads. Benchmarking on 3,351 CNV genes, including HLA, SMN, and CYP2D6, and 212 challenging medically relevant (CMR) genes that are poorly mapped by NGS, ctyper captures 96.5% of phased variants with ≥99.1% correctness of copy number on CNV genes and 94.8% of phased variants on CMR genes. Applying alignment-free algorithms, ctyper requires 1.5 hours per genome on a single CPU. The results improve prediction of gene expression compared to known expression quantitative trait loci (eQTL) variants. Allele-specific expression quantified divergent expression on 7.94% of paralogs and tissue-specific biases on 4.68% of paralogs. We found reduced expression of SMN-2 due to SMN1 conversion, potentially affecting spinal muscular atrophy, and increased expression of translocated duplications of AMY2B. Overall, ctyper enables biobank-scale genotyping of CNV and CMR genes.
拷贝数变异(CNV)基因在进化和疾病中具有重要意义,然而在大规模研究中,CNV基因的序列变异仍然是一个盲点。我们提出了ctyper方法,该方法利用泛基因组,通过下一代测序(NGS)读取的局部定相变异来生成等位基因特异性拷贝数。在3351个CNV基因(包括HLA、SMN和CYP2D6)以及212个NGS定位不佳的具有医学相关性的挑战性基因(CMR)上进行基准测试,ctyper在CNV基因上捕获了96.5%的定相变异,拷贝数正确性≥99.1%,在CMR基因上捕获了94.8%的定相变异。应用无比对算法,ctyper在单个CPU上每个基因组需要1.5小时。与已知的表达数量性状位点(eQTL)变异相比,该结果改善了基因表达的预测。等位基因特异性表达量化了7.94%的旁系同源基因的差异表达和4.68%的旁系同源基因的组织特异性偏差。我们发现由于SMN1转换导致SMN-2表达降低,这可能影响脊髓性肌萎缩症,并且AMY2B易位重复的表达增加。总体而言,ctyper能够对CNV和CMR基因进行生物样本库规模的基因分型。