无注释的微生物需氧利用预测。

Annotation-free prediction of microbial dioxygen utilization.

机构信息

Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA.

Division of Geological & Planetary Sciences, California Institute of Technology, Pasadena, California, USA.

出版信息

mSystems. 2024 Oct 22;9(10):e0076324. doi: 10.1128/msystems.00763-24. Epub 2024 Sep 4.

DOI:10.1128/msystems.00763-24

PMID:39230322

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11494890/

Abstract

Aerobes require dioxygen (O) to grow; anaerobes do not. However, nearly all microbes-aerobes, anaerobes, and facultative organisms alike-express enzymes whose substrates include O, if only for detoxification. This presents a challenge when trying to assess which organisms are aerobic from genomic data alone. This challenge can be overcome by noting that O utilization has wide-ranging effects on microbes: aerobes typically have larger genomes encoding distinctive O-utilizing enzymes, for example. These effects permit high-quality prediction of O utilization from annotated genome sequences, with several models displaying ≈80% accuracy on a ternary classification task for which blind guessing is only 33% accurate. Since genome annotation is compute-intensive and relies on many assumptions, we asked if annotation-free methods also perform well. We discovered that simple and efficient models based entirely on genomic sequence content-e.g., triplets of amino acids-perform as well as intensive annotation-based classifiers, enabling rapid processing of genomes. We further show that amino acid trimers are useful because they encode information about protein composition and phylogeny. To showcase the utility of rapid prediction, we estimated the prevalence of aerobes and anaerobes in diverse natural environments cataloged in the Earth Microbiome Project. Focusing on a well-studied O gradient in the Black Sea, we found quantitative correspondence between local chemistry (O:sulfide concentration ratio) and the composition of microbial communities. We, therefore, suggest that statistical methods like ours might be used to estimate, or "sense," pivotal features of the chemical environment using DNA sequencing data.IMPORTANCEWe now have access to sequence data from a wide variety of natural environments. These data document a bewildering diversity of microbes, many known only from their genomes. Physiology-an organism's capacity to engage metabolically with its environment-may provide a more useful lens than taxonomy for understanding microbial communities. As an example of this broader principle, we developed algorithms that accurately predict microbial dioxygen utilization directly from genome sequences without annotating genes, e.g., by considering only the amino acids in protein sequences. Annotation-free algorithms enable rapid characterization of natural samples, highlighting quantitative correspondence between sequences and local O levels in a data set from the Black Sea. This example suggests that DNA sequencing might be repurposed as a multi-pronged chemical sensor, estimating concentrations of O and other key facets of complex natural settings.

摘要

需氧生物的生长需要氧气 (O)；厌氧生物则不需要。然而，几乎所有的微生物——需氧生物、厌氧生物和兼性生物——都表达了其底物包括 O 的酶，如果只是为了解毒的话。这在试图仅从基因组数据评估哪些生物是需氧生物时带来了挑战。通过注意到 O 的利用对微生物有广泛的影响，可以克服这一挑战：例如，需氧生物通常具有更大的基因组，编码独特的 O 利用酶。这些影响允许从注释的基因组序列中对 O 的利用进行高质量预测，有几个模型在一项三元分类任务中的准确率约为 80%，而盲目猜测的准确率仅为 33%。由于基因组注释计算密集且依赖于许多假设，我们想知道无注释方法是否也能很好地执行。我们发现，完全基于基因组序列内容（例如，氨基酸三联体）的简单而有效的模型与基于密集注释的分类器一样出色，从而能够快速处理基因组。我们还表明，氨基酸三联体很有用，因为它们编码了有关蛋白质组成和系统发育的信息。为了展示快速预测的实用性，我们估计了地球微生物组计划中编目的各种自然环境中好氧生物和厌氧生物的流行程度。我们专注于黑海的一个研究充分的 O 梯度，发现局部化学物质（O:硫化物浓度比）与微生物群落的组成之间存在定量对应关系。因此，我们建议可以使用像我们这样的统计方法来使用 DNA 测序数据估计（或“感知”）化学环境的关键特征。

重要性

我们现在可以访问来自各种自然环境的序列数据。这些数据记录了令人眼花缭乱的微生物多样性，其中许多仅从它们的基因组中得知。与分类学相比，生理学——生物体与环境进行代谢相互作用的能力——可能为理解微生物群落提供了一个更有用的视角。作为这一更广泛原则的一个例子，我们开发了算法，可以直接从基因组序列中准确预测微生物的氧气利用情况，而无需对基因进行注释，例如，仅考虑蛋白质序列中的氨基酸。无注释的算法可以实现对自然样本的快速特征描述，突出了黑海数据集序列和局部 O 水平之间的定量对应关系。这个例子表明，DNA 测序可以被重新用作一种多方面的化学传感器，估计氧气和复杂自然环境的其他关键方面的浓度。