Marin Wesley M, Augusto Danillo G, Wade Kristen J, Hollenbach Jill A
Weill Institute for Neurosciences, Department of Neurology, University of California San Francisco, San Francisco, CA, United States.
Department of Biological Sciences, University of North Carolina Charlotte, Charlotte, NC, United States.
bioRxiv. 2023 Jul 19:2023.07.18.549551. doi: 10.1101/2023.07.18.549551.
The complement component 4 gene locus, composed of the and genes and located on chromosome 6, encodes for C4 protein, a key intermediate in the classical and lectin pathways of the complement system. The complement system is an important modulator of immune system activity and is also involved in the clearance of immune complexes and cellular debris. The gene locus exhibits copy number variation, with each composite gene varying between 0-5 copies per haplotype, genes also vary in size depending on the presence of the HERV retrovirus in intron 9, denoted by for long-form and for short-form, which modulates expression and is found in both and . Additionally, human blood group antigens Rodgers and Chido are located on the C4 protein, with the Rodger epitope generally found on C4A protein, and the Chido epitope generally found on C4B protein. copy number variation has been implicated in numerous autoimmune and pathogenic diseases. Despite the central role of C4 in immune function and regulation, high-throughput genomic sequence analysis of variants has been impeded by the high degree of sequence similarity and complex genetic variation exhibited by these genes. To investigate C4 variation using genomic sequencing data, we have developed a novel bioinformatic pipeline for comprehensive, high-throughput characterization of human sequence from short-read sequencing data, named C4Investigator. Using paired-end targeted or whole genome sequence data as input, C4Investigator determines gene copy number for overall and , additionally, C4Ivestigator reports the full overall aligned sequence, enabling nucleotide level analysis of . To demonstrate the utility of this workflow we have analyzed variation in the 1000 Genomes Project Dataset, showing that the genes are highly poly-allelic with many variants that have the potential to impact C4 protein function.
补体成分4基因座由α和β基因组成,位于6号染色体上,编码C4蛋白,它是补体系统经典途径和凝集素途径中的关键中间体。补体系统是免疫系统活动的重要调节因子,也参与免疫复合物和细胞碎片的清除。α基因座表现出拷贝数变异,每个复合基因的单倍型拷贝数在0至5个之间变化,β基因的大小也因内含子9中HERV逆转录病毒的存在而有所不同,长形式用L表示,短形式用S表示,这会调节表达,且在α和β中均有发现。此外,人类血型抗原Rodgers和Chido位于C4蛋白上,Rodger表位通常存在于C4A蛋白上,Chido表位通常存在于C4B蛋白上。α拷贝数变异与多种自身免疫性疾病和致病性疾病有关。尽管C4在免疫功能和调节中起着核心作用,但由于这些基因表现出高度的序列相似性和复杂的遗传变异,对α变体的高通量基因组序列分析受到了阻碍。为了利用基因组测序数据研究C4变异,我们开发了一种新颖的生物信息学流程,用于从短读长测序数据中对人类α序列进行全面、高通量的表征,名为C4Investigator。以双末端靶向或全基因组序列数据作为输入,C4Investigator可确定整体α和β的基因拷贝数,此外,C4Investigator还会报告完整的整体α比对序列,从而能够对α进行核苷酸水平的分析。为了证明此工作流程的实用性,我们分析了千人基因组计划数据集中的α变异,结果表明α基因具有高度多等位基因性,有许多变体有可能影响C4蛋白功能。