Grant Avery R, Johnson Kevin P, Stanley Edward L, Baldwin-Brown James, Kolenčík Stanislav, Allen Julie M
Department of Biology, University of Nevada, Reno, Reno, NV, USA.
Illinois Natural History Survey, Prairie Research Institute, University of Illinois at Urbana-Champaign, Champaign, IL, USA.
Bioinform Biol Insights. 2024 Jun 9;18:11779322241257991. doi: 10.1177/11779322241257991. eCollection 2024.
Nucleotide base composition plays an influential role in the molecular mechanisms involved in gene function, phenotype, and amino acid composition. GC content (proportion of guanine and cytosine in DNA sequences) shows a high level of variation within and among species. Many studies measure GC content in a small number of genes, which may not be representative of genome-wide GC variation. One challenge when assembling extensive genomic data sets for these studies is the significant amount of resources (monetary and computational) associated with data processing, and many bioinformatic tools have not been optimized for resource efficiency. Using a high-performance computing (HPC) cluster, we manipulated resources provided to the targeted gene assembly program, automated target restricted assembly method (aTRAM), to determine an optimum way to run the program to maximize resource use. Using our optimum assembly approach, we assembled and measured GC content of all of the protein-coding genes of a diverse group of parasitic feather lice. Of the 499 426 genes assembled across 57 species, feather lice were GC-poor (mean GC = 42.96%) with a significant amount of variation within and between species (GC range = 19.57%-73.33%). We found a significant correlation between GC content and standard deviation per taxon for overall GC and GC, which could indicate selection for G and C nucleotides in some species. Phylogenetic signal of GC content was detected in both GC and GC. This research provides a large-scale investigation of GC content in parasitic lice laying the foundation for understanding the basis of variation in base composition across species.
核苷酸碱基组成在涉及基因功能、表型和氨基酸组成的分子机制中发挥着重要作用。GC含量(DNA序列中鸟嘌呤和胞嘧啶的比例)在物种内部和物种之间表现出高度的变异性。许多研究测量的是少数基因中的GC含量,这可能无法代表全基因组的GC变异情况。在为这些研究组装大量基因组数据集时,一个挑战是与数据处理相关的大量资源(资金和计算资源),而且许多生物信息学工具尚未针对资源效率进行优化。我们使用高性能计算(HPC)集群,对提供给靶向基因组装程序——自动靶向受限组装方法(aTRAM)的资源进行操控,以确定运行该程序的最佳方式,从而最大限度地利用资源。使用我们的最佳组装方法,我们组装并测量了一组多样化的寄生羽虱所有蛋白质编码基因的GC含量。在跨越57个物种组装的499426个基因中,羽虱的GC含量较低(平均GC = 42.96%),在物种内部和物种之间存在显著差异(GC范围 = 19.57% - 73.33%)。我们发现总体GC和GC的每个分类单元的GC含量与标准差之间存在显著相关性,这可能表明在某些物种中对鸟嘌呤和胞嘧啶核苷酸存在选择。在GC和GC中均检测到了GC含量的系统发育信号。这项研究对寄生虱的GC含量进行了大规模调查,为理解物种间碱基组成变异的基础奠定了基础。