基于改进的机器学习方法评估印度主要北方河流生态系统中的微生物多样性。

An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems.

机构信息

ICAR-Indian Agricultural Research Institute, New Delhi 110012, India.

ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.

出版信息

Genes (Basel). 2023 May 14;14(5):1082. doi: 10.3390/genes14051082.

DOI:10.3390/genes14051082

PMID:37239442

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10218686/

Abstract

The rapidly evolving high-throughput sequencing (HTS) technologies generate voluminous genomic and metagenomic sequences, which can help classify the microbial communities with high accuracy in many ecosystems. Conventionally, the rule-based binning techniques are used to classify the contigs or scaffolds based on either sequence composition or sequence similarity. However, the accurate classification of the microbial communities remains a major challenge due to massive data volumes at hand as well as a requirement of efficient binning methods and classification algorithms. Therefore, we attempted here to implement iterative K-Means clustering for the initial binning of metagenomics sequences and applied various machine learning algorithms (MLAs) to classify the newly identified unknown microbes. The cluster annotation was achieved through the BLAST program of NCBI, which resulted in the grouping of assembled scaffolds into five classes, i.e., bacteria, archaea, eukaryota, viruses and others. The annotated cluster sequences were used to train machine learning algorithms (MLAs) to develop prediction models to classify unknown metagenomic sequences. In this study, we used metagenomic datasets of samples collected from the Ganga (Kanpur and Farakka) and the Yamuna (Delhi) rivers in India for clustering and training the MLA models. Further, the performance of MLAs was evaluated by 10-fold cross validation. The results revealed that the developed model based on the Random Forest had a superior performance compared to the other considered learning algorithms. The proposed method can be used for annotating the metagenomic scaffolds/contigs being complementary to existing methods of metagenomic data analysis. An offline predictor source code with the best prediction model is available at (https://github.com/Nalinikanta7/metagenomics).

摘要

高通量测序 (HTS) 技术的快速发展产生了大量的基因组和宏基因组序列，这有助于在许多生态系统中高精度地对微生物群落进行分类。传统上，基于规则的分箱技术用于根据序列组成或序列相似性对 contigs 或 scaffolds 进行分类。然而，由于手头有大量数据，并且需要高效的分箱方法和分类算法，因此微生物群落的准确分类仍然是一个主要挑战。因此，我们试图在这里实施迭代 K-Means 聚类对宏基因组序列进行初始分箱，并应用各种机器学习算法 (MLA) 对新识别的未知微生物进行分类。通过 NCBI 的 BLAST 程序实现了聚类注释，这导致组装支架被分为五类，即细菌、古菌、真核生物、病毒和其他。注释的聚类序列用于训练机器学习算法 (MLA) 以开发预测模型来分类未知的宏基因组序列。在这项研究中，我们使用了从印度恒河（坎普尔和法卡）和亚穆纳河（德里）收集的样本的宏基因组数据集进行聚类和训练 MLA 模型。此外，还通过 10 倍交叉验证评估了 MLA 的性能。结果表明，基于随机森林的开发模型的性能优于其他考虑的学习算法。所提出的方法可用于注释宏基因组支架/ contigs，这是对现有宏基因组数据分析方法的补充。带有最佳预测模型的离线预测源代码可在 (https://github.com/Nalinikanta7/metagenomics) 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00c2/10218686/21254b67504f/genes-14-01082-g001.jpg

相似文献

An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems.基于改进的机器学习方法评估印度主要北方河流生态系统中的微生物多样性。

Genes (Basel). 2023 May 14;14(5):1082. doi: 10.3390/genes14051082.

AFITbin: a metagenomic contig binning method using aggregate l-mer frequency based on initial and terminal nucleotides.AﬁTbin：一种基于初始和末端核苷酸的基于聚合 l-mer 频率的宏基因组序列拼接方法。

BMC Bioinformatics. 2024 Jul 16;25(1):241. doi: 10.1186/s12859-024-05859-7.

Massive metagenomic data analysis using abundance-based machine learning.基于丰度的机器学习在海量宏基因组数据分析中的应用。

Biol Direct. 2019 Aug 1;14(1):12. doi: 10.1186/s13062-019-0242-0.

GraphBin: refined binning of metagenomic contigs using assembly graphs.GraphBin：使用组装图对宏基因组序列进行精细化分箱。

Bioinformatics. 2020 Jun 1;36(11):3307-3313. doi: 10.1093/bioinformatics/btaa180.

Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets.评估宏基因组工具在真实宏基因组数据集和 CAMI 数据集上的基因组 binning 效果。

BMC Bioinformatics. 2020 Jul 28;21(1):334. doi: 10.1186/s12859-020-03667-3.

HiCBin: binning metagenomic contigs and recovering metagenome-assembled genomes using Hi-C contact maps.HiCBin：使用 Hi-C 接触图谱对宏基因组 contigs 进行 binning 和恢复宏基因组组装基因组。

Genome Biol. 2022 Feb 28;23(1):63. doi: 10.1186/s13059-022-02626-w.

Improving contig binning of metagenomic data using [Formula: see text] oligonucleotide frequency dissimilarity.使用[公式：见正文]寡核苷酸频率差异改进宏基因组数据的重叠群分箱

BMC Bioinformatics. 2017 Sep 20;18(1):425. doi: 10.1186/s12859-017-1835-1.

HiFine: integrating Hi-C-based and shotgun-based methods to refine binning of metagenomic contigs.HiFine：整合基于 Hi-C 和 shotgun 的方法来优化宏基因组 contigs 的 bin 划分。

Bioinformatics. 2022 May 26;38(11):2973-2979. doi: 10.1093/bioinformatics/btac295.

CH-Bin: A convex hull based approach for binning metagenomic contigs.CH-Bin：一种基于凸壳的宏基因组 contigs 分箱方法。

Comput Biol Chem. 2022 Oct;100:107734. doi: 10.1016/j.compbiolchem.2022.107734. Epub 2022 Jul 14.

CoMet: a workflow using contig coverage and composition for binning a metagenomic sample with high precision.CoMet：一种使用 contig 覆盖度和组成进行宏基因组样本高精度分箱的工作流程。

BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):571. doi: 10.1186/s12859-017-1967-3.

引用本文的文献

High-Throughput Shotgun Metagenomics of Microbial Footprints Uncovers a Cocktail of Noxious Antibiotic Resistance Genes in the Winam Gulf of Lake Victoria, Kenya.通过高通量鸟枪法宏基因组学分析微生物足迹揭示肯尼亚维多利亚湖维纳姆湾中有害抗生素抗性基因的混合情况

J Trop Med. 2024 Dec 23;2024:7857069. doi: 10.1155/jotm/7857069. eCollection 2024.

Unveiling the Microbiome Landscape: A Metagenomic Study of Bacterial Diversity, Antibiotic Resistance, and Virulence Factors in the Sediments of the River Ganga, India.揭示微生物群落景观：对印度恒河沉积物中细菌多样性、抗生素抗性和毒力因子的宏基因组研究

Antibiotics (Basel). 2023 Dec 14;12(12):1735. doi: 10.3390/antibiotics12121735.

Discordant patterns between nitrogen-cycling functional traits and taxa in distant coastal sediments reveal important community assembly mechanisms.遥远海岸沉积物中氮循环功能性状与分类群之间的不一致模式揭示了重要的群落组装机制。

Front Microbiol. 2023 Nov 20;14:1291242. doi: 10.3389/fmicb.2023.1291242. eCollection 2023.

本文引用的文献

Microsatellite analysis reveals low genetic diversity in managed populations of the critically endangered gharial (Gavialis gangeticus) in India.微卫星分析显示，印度极度濒危的恒河鳄（Gavialis gangeticus）人工养殖种群遗传多样性较低。

Sci Rep. 2021 Mar 11;11(1):5627. doi: 10.1038/s41598-021-85201-w.

Machine learning applications in microbial ecology, human microbiome studies, and environmental monitoring.机器学习在微生物生态学、人类微生物组研究和环境监测中的应用。

Comput Struct Biotechnol J. 2021 Jan 27;19:1092-1107. doi: 10.1016/j.csbj.2021.01.028. eCollection 2021.

MegaR: an interactive R package for rapid sample classification and phenotype prediction using metagenome profiles and machine learning.MegaR：一个交互式 R 包，用于使用宏基因组谱和机器学习快速对样本进行分类和表型预测。

BMC Bioinformatics. 2021 Jan 18;22(1):25. doi: 10.1186/s12859-020-03933-4.

GenBank.GenBank

Nucleic Acids Res. 2021 Jan 8;49(D1):D92-D96. doi: 10.1093/nar/gkaa1023.

Metagenomic Analysis Reveals Bacterial and Fungal Diversity and Their Bioremediation Potential From Sediments of River Ganga and Yamuna in India.宏基因组分析揭示了印度恒河和亚穆纳河沉积物中的细菌和真菌多样性及其生物修复潜力。

Front Microbiol. 2020 Oct 16;11:556136. doi: 10.3389/fmicb.2020.556136. eCollection 2020.

Metagenome analysis from the sediment of river Ganga and Yamuna: In search of beneficial microbiome.恒河与亚穆纳河沉积物的宏基因组分析：寻找有益微生物组。

PLoS One. 2020 Oct 6;15(10):e0239594. doi: 10.1371/journal.pone.0239594. eCollection 2020.

Emerging Priorities for Microbiome Research.微生物组研究的新重点

Front Microbiol. 2020 Feb 19;11:136. doi: 10.3389/fmicb.2020.00136. eCollection 2020.

Metagenomic insights to understand transient influence of Yamuna River on taxonomic and functional aspects of bacterial and archaeal communities of River Ganges.元基因组学研究揭示雅姆纳河对恒河细菌和古菌群落分类和功能特性的短暂影响

Sci Total Environ. 2019 Jul 15;674:288-299. doi: 10.1016/j.scitotenv.2019.04.166. Epub 2019 Apr 13.

One Health Relationships Between Human, Animal, and Environmental Microbiomes: A Mini-Review.人类、动物和环境微生物群之间的“同一健康”关系：一篇综述短文

Front Public Health. 2018 Aug 30;6:235. doi: 10.3389/fpubh.2018.00235. eCollection 2018.

Metagenomics Study of Contaminated Sediments from the Yamuna River at Kalindi Kunj, Delhi, India.印度德里卡林迪昆杰亚穆纳河受污染沉积物的宏基因组学研究。

Genome Announc. 2018 Jan 4;6(1):e01379-17. doi: 10.1128/genomeA.01379-17.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于改进的机器学习方法评估印度主要北方河流生态系统中的微生物多样性。

An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献