Weill Institute for Neurosciences, Department of Neurology, University of California San Francisco, San Francisco, California, USA.
Department of Biological Sciences, University of North Carolina Charlotte, Charlotte, North Carolina, USA.
HLA. 2024 Jan;103(1):e15273. doi: 10.1111/tan.15273. Epub 2023 Oct 29.
The complement component 4 gene loci, composed of the C4A and C4B genes and located on chromosome 6, encodes for complement component 4 (C4) proteins, a key intermediate in the classical and lectin pathways of the complement system. The complement system is an important modulator of immune system activity and is also involved in the clearance of immune complexes and cellular debris. C4A and C4B gene loci exhibit copy number variation, with each composite gene varying between 0 and 5 copies per haplotype. C4A and C4B genes also vary in size depending on the presence of the human endogenous retrovirus (HERV) in intron 9, denoted by C4(L) for long-form and C4(S) for short-form, which affects expression and is found in both C4A and C4B. Additionally, human blood group antigens Rodgers and Chido are located on the C4 protein, with the Rodger epitope generally found on C4A protein, and the Chido epitope generally found on C4B protein. C4A and C4B copy number variation has been implicated in numerous autoimmune and pathogenic diseases. Despite the central role of C4 in immune function and regulation, high-throughput genomic sequence analysis of C4A and C4B variants has been impeded by the high degree of sequence similarity and complex genetic variation exhibited by these genes. To investigate C4 variation using genomic sequencing data, we have developed a novel bioinformatic pipeline for comprehensive, high-throughput characterization of human C4A and C4B sequences from short-read sequencing data, named C4Investigator. Using paired-end targeted or whole genome sequence data as input, C4Investigator determines the overall gene copy numbers, as well as C4A, C4B, C4(Rodger), C4(Ch), C4(L), and C4(S). Additionally, C4Ivestigator reports the full overall C4A and C4B aligned sequence, enabling nucleotide level analysis. To demonstrate the utility of this workflow we have analyzed C4A and C4B variation in the 1000 Genomes Project Data set, showing that these genes are highly poly-allelic with many variants that have the potential to impact C4 protein function.
补体成分 4 基因座,由 C4A 和 C4B 基因组成,位于 6 号染色体上,编码补体成分 4(C4)蛋白,这是补体系统经典途径和凝集素途径中的关键中间产物。补体系统是免疫系统活性的重要调节剂,也参与免疫复合物和细胞碎片的清除。C4A 和 C4B 基因座表现出拷贝数变异,每个复合基因在每个单倍型中变化 0 到 5 个拷贝。C4A 和 C4B 基因的大小也因内含子 9 中的人类内源性逆转录病毒(HERV)的存在而不同,用长形式的 C4(L)和短形式的 C4(S)表示,这会影响表达,并且存在于 C4A 和 C4B 中。此外,人类血型抗原 Rodgers 和 Chido 位于 C4 蛋白上,Rodger 表位通常存在于 C4A 蛋白上,Chido 表位通常存在于 C4B 蛋白上。C4A 和 C4B 拷贝数变异与许多自身免疫和致病性疾病有关。尽管 C4 在免疫功能和调节中起着核心作用,但由于这些基因表现出高度的序列相似性和复杂的遗传变异,高通量基因组序列分析 C4A 和 C4B 变体受到阻碍。为了使用基因组测序数据研究 C4 变异,我们开发了一种新的生物信息学管道,用于从短读测序数据中全面、高通量地描述人类 C4A 和 C4B 序列,命名为 C4Investigator。使用配对末端靶向或全基因组序列数据作为输入,C4Investigator 确定总体基因拷贝数,以及 C4A、C4B、C4(Rodger)、C4(Ch)、C4(L)和 C4(S)。此外,C4Ivestigator 报告完整的 C4A 和 C4B 对齐序列,从而实现核苷酸水平的分析。为了证明这个工作流程的实用性,我们分析了 1000 基因组项目数据集中的 C4A 和 C4B 变异,表明这些基因是高度多态性的,有许多变体有可能影响 C4 蛋白的功能。