对ESKAPE噬菌体的泛基因组分析:代表性不足可能会影响机器学习模型。
A pangenome analysis of ESKAPE bacteriophages: the underrepresentation may impact machine learning models.
作者信息
Lee Jeesu, Hunter Branden, Shim Hyunjin
机构信息
Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, Republic of Korea.
Department of Biology, California State University, Fresno, CA, United States.
出版信息
Front Mol Biosci. 2024 Jun 21;11:1395450. doi: 10.3389/fmolb.2024.1395450. eCollection 2024.
Bacteriophages are the most prevalent biological entities in the biosphere. However, limitations in both medical relevance and sequencing technologies have led to a systematic underestimation of the genetic diversity within phages. This underrepresentation not only creates a significant gap in our understanding of phage roles across diverse biosystems but also introduces biases in computational models reliant on these data for training and testing. In this study, we focused on publicly available genomes of bacteriophages infecting high-priority ESKAPE pathogens to show the extent and impact of this underrepresentation. First, we demonstrate a stark underrepresentation of ESKAPE phage genomes within the public genome and protein databases. Next, a pangenome analysis of these ESKAPE phages reveals extensive sharing of core genes among phages infecting the same host. Furthermore, genome analyses and clustering highlight close nucleotide-level relationships among the ESKAPE phages, raising concerns about the limited diversity within current public databases. Lastly, we uncover a scarcity of unique lytic phages and phage proteins with antimicrobial activities against ESKAPE pathogens. This comprehensive analysis of the ESKAPE phages underscores the severity of underrepresentation and its potential implications. This lack of diversity in phage genomes may restrict the resurgence of phage therapy and cause biased outcomes in data-driven computational models due to incomplete and unbalanced biological datasets.
噬菌体是生物圈中最普遍的生物实体。然而,医学相关性和测序技术方面的局限性导致了对噬菌体遗传多样性的系统性低估。这种代表性不足不仅在我们对噬菌体在不同生物系统中作用的理解上造成了重大差距,还在依赖这些数据进行训练和测试的计算模型中引入了偏差。在本研究中,我们聚焦于感染高优先级ESKAPE病原体的噬菌体的公开可用基因组,以展示这种代表性不足的程度和影响。首先,我们证明了ESKAPE噬菌体基因组在公共基因组和蛋白质数据库中的严重代表性不足。接下来,对这些ESKAPE噬菌体的全基因组分析揭示了感染同一宿主的噬菌体之间核心基因的广泛共享。此外,基因组分析和聚类突出了ESKAPE噬菌体之间在核苷酸水平上的密切关系,引发了对当前公共数据库中有限多样性的担忧。最后,我们发现缺乏对ESKAPE病原体具有抗菌活性的独特裂解性噬菌体和噬菌体蛋白。对ESKAPE噬菌体的这种全面分析强调了代表性不足的严重性及其潜在影响。噬菌体基因组中这种多样性的缺乏可能会限制噬菌体疗法的复兴,并由于不完整和不平衡的生物学数据集而在数据驱动的计算模型中导致有偏差的结果。
相似文献
Front Mol Biosci. 2024-6-21
J Bacteriol. 2018-3-12
Database (Oxford). 2024-3-26
Expert Rev Anti Infect Ther. 2021-7
Front Microbiol. 2017-9-12
引用本文的文献
本文引用的文献
ISME Commun. 2021-10-20
Antibiotics (Basel). 2023-1-18
Cell Rep. 2022-6-21
Evol Bioinform Online. 2022-6-8
Pharmaceuticals (Basel). 2022-3-4