Suppr超能文献

基于改进的机器学习方法评估印度主要北方河流生态系统中的微生物多样性。

An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems.

机构信息

ICAR-Indian Agricultural Research Institute, New Delhi 110012, India.

ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.

出版信息

Genes (Basel). 2023 May 14;14(5):1082. doi: 10.3390/genes14051082.

Abstract

The rapidly evolving high-throughput sequencing (HTS) technologies generate voluminous genomic and metagenomic sequences, which can help classify the microbial communities with high accuracy in many ecosystems. Conventionally, the rule-based binning techniques are used to classify the contigs or scaffolds based on either sequence composition or sequence similarity. However, the accurate classification of the microbial communities remains a major challenge due to massive data volumes at hand as well as a requirement of efficient binning methods and classification algorithms. Therefore, we attempted here to implement iterative K-Means clustering for the initial binning of metagenomics sequences and applied various machine learning algorithms (MLAs) to classify the newly identified unknown microbes. The cluster annotation was achieved through the BLAST program of NCBI, which resulted in the grouping of assembled scaffolds into five classes, i.e., bacteria, archaea, eukaryota, viruses and others. The annotated cluster sequences were used to train machine learning algorithms (MLAs) to develop prediction models to classify unknown metagenomic sequences. In this study, we used metagenomic datasets of samples collected from the Ganga (Kanpur and Farakka) and the Yamuna (Delhi) rivers in India for clustering and training the MLA models. Further, the performance of MLAs was evaluated by 10-fold cross validation. The results revealed that the developed model based on the Random Forest had a superior performance compared to the other considered learning algorithms. The proposed method can be used for annotating the metagenomic scaffolds/contigs being complementary to existing methods of metagenomic data analysis. An offline predictor source code with the best prediction model is available at (https://github.com/Nalinikanta7/metagenomics).

摘要

高通量测序 (HTS) 技术的快速发展产生了大量的基因组和宏基因组序列,这有助于在许多生态系统中高精度地对微生物群落进行分类。传统上,基于规则的分箱技术用于根据序列组成或序列相似性对 contigs 或 scaffolds 进行分类。然而,由于手头有大量数据,并且需要高效的分箱方法和分类算法,因此微生物群落的准确分类仍然是一个主要挑战。因此,我们试图在这里实施迭代 K-Means 聚类对宏基因组序列进行初始分箱,并应用各种机器学习算法 (MLA) 对新识别的未知微生物进行分类。通过 NCBI 的 BLAST 程序实现了聚类注释,这导致组装支架被分为五类,即细菌、古菌、真核生物、病毒和其他。注释的聚类序列用于训练机器学习算法 (MLA) 以开发预测模型来分类未知的宏基因组序列。在这项研究中,我们使用了从印度恒河(坎普尔和法卡)和亚穆纳河(德里)收集的样本的宏基因组数据集进行聚类和训练 MLA 模型。此外,还通过 10 倍交叉验证评估了 MLA 的性能。结果表明,基于随机森林的开发模型的性能优于其他考虑的学习算法。所提出的方法可用于注释宏基因组支架/ contigs,这是对现有宏基因组数据分析方法的补充。带有最佳预测模型的离线预测源代码可在 (https://github.com/Nalinikanta7/metagenomics) 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/00c2/10218686/21254b67504f/genes-14-01082-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验