School of Computer Science, University of Waterloo, Waterloo, ON, Canada.
Department of Biology, University of Western Ontario, London, ON, Canada.
Sci Rep. 2023 Sep 26;13(1):16105. doi: 10.1038/s41598-023-42518-y.
This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and unsupervised machine learning algorithms were used to analyze genomic signatures, each computed as the k-mer frequency vector of a 500 kbp DNA fragment arbitrarily selected to represent a genome. Computational experiments classified/clustered genomic signatures extracted from a curated dataset of [Formula: see text] extremophile (temperature, pH) bacteria and archaea genomes, at multiple scales of analysis, [Formula: see text]. The supervised learning resulted in high accuracies for taxonomic classifications at [Formula: see text], and medium to medium-high accuracies for environment category classifications of the same datasets at [Formula: see text]. For [Formula: see text], our findings were largely consistent with amino acid compositional biases and codon usage patterns in coding regions, previously attributed to extreme environment adaptations. The unsupervised learning of unlabelled sequences identified several exemplars of hyperthermophilic organisms with large similarities in their genomic signatures, in spite of belonging to different domains in the Tree of Life.
这项研究提供了全面的定量证据,表明极端温度和 pH 值的适应在微生物极端生物的基因组特征中留下了明显的环境成分。本研究使用监督和无监督机器学习算法来分析基因组特征,每个特征都计算为任意选择代表基因组的 500 kbp DNA 片段的 k-mer 频率向量。在多个分析尺度上,对来自 [Formula: see text] 极端(温度、pH)细菌和古菌基因组的经过策展的数据集提取的基因组特征进行了计算实验分类/聚类。监督学习在 [Formula: see text] 时实现了对分类学分类的高准确性,在 [Formula: see text] 时实现了对同一数据集的环境类别分类的中等至高准确性。对于 [Formula: see text],我们的发现与先前归因于极端环境适应的编码区域中的氨基酸组成偏倚和密码子使用模式基本一致。对未标记序列的无监督学习确定了几个高温生物体的范例,它们的基因组特征非常相似,尽管它们属于生命之树的不同领域。