Gómez-González Paula Josefina, Grabowska Anna D, Tientcheu Leopold D, Tsolaki Anthony G, Hibberd Martin L, Campino Susana, Phelan Jody E, Clark Taane G
Faculty of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, London, United Kingdom.
Department of Biophysics, Physiology and Pathophysiology, Medical University of Warsaw, Warsaw, Poland.
Front Microbiol. 2023 Oct 9;14:1244319. doi: 10.3389/fmicb.2023.1244319. eCollection 2023.
Around 10% of the coding potential of is constituted by two poorly understood gene families, the and loci, thought to be involved in host-pathogen interactions. Their repetitive nature and high GC content have hindered sequence analysis, leading to exclusion from whole-genome studies. Understanding the genetic diversity of families is essential to facilitate their potential translation into tools for tuberculosis prevention and treatment.
To investigate the genetic diversity of the 169 / genes, we performed a sequence analysis across 73 long-read assemblies representing seven different lineages of and BCG. Individual gene alignments were extracted and diversity and conservation across the different lineages studied.
The / genes were classified into three groups based on the level of protein sequence conservation relative to H37Rv, finding that >50% were conserved, with indels in and sub-families being major drivers of structural variation. Gene rearrangements, such as duplications and gene fusions, were observed between and genes. Inter-lineage diversity revealed lineage-specific SNPs and indels.
The high level of genes conservation, together with the lineage-specific findings, suggest their phylogenetic informativeness. However, structural variants and gene rearrangements differing from the reference were also identified, with potential implications for pathogenicity. Overall, improving our knowledge of these complex gene families may have insights into pathogenicity and inform the development of much-needed tools for tuberculosis control.
约10%的编码潜力由两个了解甚少的基因家族构成,即 和 位点,它们被认为参与宿主与病原体的相互作用。其重复性质和高GC含量阻碍了序列分析,导致它们被排除在全基因组研究之外。了解 家族的遗传多样性对于促进其转化为结核病预防和治疗工具至关重要。
为了研究169个 / 基因的遗传多样性,我们对代表 和卡介苗七个不同谱系的73个长读长组装序列进行了序列分析。提取了各个 基因的比对序列,并研究了不同谱系间的多样性和保守性。
根据相对于H37Rv的蛋白质序列保守水平,将 / 基因分为三组,发现超过50%的基因是保守的, 亚家族和 亚家族中的插入缺失是结构变异的主要驱动因素。在 和 基因之间观察到基因重排,如重复和基因融合。谱系间多样性揭示了谱系特异性的单核苷酸多态性(SNP)和插入缺失。
基因的高度保守性以及谱系特异性的发现表明它们具有系统发育信息性。然而,也鉴定出了与参考序列不同的结构变异和基因重排,这可能对致病性有影响。总体而言,增进我们对这些复杂基因家族的了解可能有助于深入了解致病性,并为结核病控制急需的工具开发提供信息。