PacBio, Menlo Park, CA, USA.
Genomic Medicine Center, Children's Mercy Kansas City, Kansas City, MO, USA; UMKC School of Medicine, University of Missouri Kansas City, Kansas City, MO, USA; Department of Pediatrics, Children's Mercy Kansas City, Kansas City, MO, USA.
Am J Hum Genet. 2023 Feb 2;110(2):240-250. doi: 10.1016/j.ajhg.2023.01.001. Epub 2023 Jan 19.
Spinal muscular atrophy, a leading cause of early infant death, is caused by bi-allelic mutations of SMN1. Sequence analysis of SMN1 is challenging due to high sequence similarity with its paralog SMN2. Both genes have variable copy numbers across populations. Furthermore, without pedigree information, it is currently not possible to identify silent carriers (2+0) with two copies of SMN1 on one chromosome and zero copies on the other. We developed Paraphase, an informatics method that identifies full-length SMN1 and SMN2 haplotypes, determines the gene copy numbers, and calls phased variants using long-read PacBio HiFi data. The SMN1 and SMN2 copy-number calls by Paraphase are highly concordant with orthogonal methods (99.2% for SMN1 and 100% for SMN2). We applied Paraphase to 438 samples across 5 ethnic populations to conduct a population-wide haplotype analysis of these highly homologous genes. We identified major SMN1 and SMN2 haplogroups and characterized their co-segregation through pedigree-based analyses. We identified two SMN1 haplotypes that form a common two-copy SMN1 allele in African populations. Testing positive for these two haplotypes in an individual with two copies of SMN1 gives a silent carrier risk of 88.5%, which is significantly higher than the currently used marker (1.7%-3.0%). Extending beyond simple copy-number testing, Paraphase can detect pathogenic variants and enable potential haplotype-based screening of silent carriers through statistical phasing of haplotypes into alleles. Future analysis of larger population data will allow identification of more diverse haplotypes and genetic markers for silent carriers.
脊髓性肌萎缩症是导致婴儿早期死亡的主要原因,由 SMN1 的双等位基因突变引起。由于 SMN1 与其同源基因 SMN2 具有高度的序列相似性,因此对 SMN1 进行序列分析具有挑战性。这两个基因在不同人群中的拷贝数存在差异。此外,在没有家系信息的情况下,目前无法识别染色体上一条携带两个 SMN1 拷贝(2+0)而另一条携带零个拷贝的沉默携带者。我们开发了 Paraphase 这一信息学方法,它可以识别全长 SMN1 和 SMN2 单倍型,确定基因拷贝数,并使用长读长 PacBio HiFi 数据对相位变体进行调用。Paraphase 对 SMN1 和 SMN2 的拷贝数调用与正交方法高度一致(SMN1 为 99.2%,SMN2 为 100%)。我们将 Paraphase 应用于 5 个人群的 438 个样本,对这些高度同源基因进行了全人群单倍型分析。我们确定了主要的 SMN1 和 SMN2 单倍型,并通过基于家系的分析对其共分离进行了特征描述。我们确定了两种在非洲人群中形成常见的两个拷贝 SMN1 等位基因的 SMN1 单倍型。在具有两个 SMN1 拷贝的个体中,这两种单倍型检测呈阳性,其沉默携带者的风险为 88.5%,显著高于目前使用的标记物(1.7%-3.0%)。除了简单的拷贝数检测外,Paraphase 还可以检测致病性变异,并通过单倍型的统计相位将潜在的单倍型筛查应用于沉默携带者。对更大人群数据的未来分析将允许识别更多不同的沉默携带者的单倍型和遗传标记。