Suppr超能文献

微生物基因组的压缩率与基因组大小和碱基组成有关。

Compression rates of microbial genomes are associated with genome size and base composition.

作者信息

Bohlin Jon, Pettersson John H-O

机构信息

Norwegian Institute of Public Health, Domain for Infection Control, Section for Modeling and Bioinformatics, Oslo, Norway.

Zoonosis Science Center, Clinical Microbiology, Department of Medical Sciences, University of Uppsala, 751 85, Uppsala, Sweden.

出版信息

Genomics Inform. 2024 Oct 10;22(1):16. doi: 10.1186/s44342-024-00018-z.

Abstract

BACKGROUND

To what degree a string of symbols can be compressed reveals important details about its complexity. For instance, strings that are not compressible are random and carry a low information potential while the opposite is true for highly compressible strings. We explore to what extent microbial genomes are amenable to compression as they vary considerably both with respect to size and base composition. For instance, microbial genome sizes vary from less than 100,000 base pairs in symbionts to more than 10 million in soil-dwellers. Genomic base composition, often summarized as genomic AT or GC content due to the similar frequencies of adenine and thymine on one hand and cytosine and guanine on the other, also vary substantially; the most extreme microbes can have genomes with AT content below 25% or above 85% AT. Base composition determines the frequency of DNA words, consisting of multiple nucleotides or oligonucleotides, and may therefore also influence compressibility. Using 4,713 RefSeq genomes, we examined the association between compressibility, using both a DNA based- (MBGC) and a general purpose (ZPAQ) compression algorithm, and genome size, AT content as well as genomic oligonucleotide usage variance (OUV) using generalized additive models.

RESULTS

We find that genome size (p < 0.001) and OUV (p < 0.001) are both strongly associated with genome redundancy for both type of file compressors. The DNA-based MBGC compressor managed to improve compression with approximately 3% on average with respect to ZPAQ. Moreover, MBGC detected a significant (p < 0.001) compression ratio difference between AT poor and AT rich genomes which was not detected with ZPAQ.

CONCLUSION

As lack of compressibility is equivalent to randomness, our findings suggest that smaller and AT rich genomes may have accumulated more random mutations on average than larger and AT poor genomes which, in turn, were significantly more redundant. Moreover, we find that OUV is a strong proxy for genome compressibility in microbial genomes. The ZPAQ compressor was found to agree with the MBGC compressor, albeit with a poorer performance, except for the compressibility of AT-rich and AT-poor/GC-rich genomes.

摘要

背景

一串符号的可压缩程度揭示了其复杂性的重要细节。例如,不可压缩的字符串是随机的,携带的信息潜力较低,而高度可压缩的字符串则相反。我们探讨微生物基因组在多大程度上适合压缩,因为它们在大小和碱基组成方面差异很大。例如,微生物基因组大小从共生体中小于100,000个碱基对到土壤居住者中超过1000万个碱基对不等。基因组碱基组成,由于腺嘌呤和胸腺嘧啶以及胞嘧啶和鸟嘌呤的频率相似,通常总结为基因组AT或GC含量,也有很大差异;最极端的微生物基因组的AT含量可低于25%或高于85%。碱基组成决定了由多个核苷酸或寡核苷酸组成的DNA单词的频率,因此也可能影响可压缩性。我们使用4713个RefSeq基因组,通过广义相加模型,研究了使用基于DNA的(MBGC)和通用(ZPAQ)压缩算法的可压缩性与基因组大小、AT含量以及基因组寡核苷酸使用方差(OUV)之间的关联。

结果

我们发现,对于两种类型的文件压缩器,基因组大小(p < 0.001)和OUV(p < 0.001)都与基因组冗余密切相关。基于DNA的MBGC压缩器相对于ZPAQ平均能够将压缩率提高约3%。此外,MBGC检测到AT含量低和AT含量高的基因组之间存在显著的(p < 0.001)压缩率差异,而ZPAQ未检测到。

结论

由于缺乏可压缩性等同于随机性,我们的研究结果表明,平均而言,较小且AT含量高的基因组可能比较大且AT含量低的基因组积累了更多的随机突变,而后者的冗余度明显更高。此外,我们发现OUV是微生物基因组中基因组可压缩性的有力指标。除了富含AT和贫AT/富含GC的基因组的可压缩性外,发现ZPAQ压缩器与MBGC压缩器的结果一致,尽管性能较差。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e3c6/11468749/5c9d8f89d9f4/44342_2024_18_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验