ICAR-Indian Agricultural Research Institute, New Delhi 110012, India.
ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.
Genes (Basel). 2023 May 14;14(5):1082. doi: 10.3390/genes14051082.
The rapidly evolving high-throughput sequencing (HTS) technologies generate voluminous genomic and metagenomic sequences, which can help classify the microbial communities with high accuracy in many ecosystems. Conventionally, the rule-based binning techniques are used to classify the contigs or scaffolds based on either sequence composition or sequence similarity. However, the accurate classification of the microbial communities remains a major challenge due to massive data volumes at hand as well as a requirement of efficient binning methods and classification algorithms. Therefore, we attempted here to implement iterative K-Means clustering for the initial binning of metagenomics sequences and applied various machine learning algorithms (MLAs) to classify the newly identified unknown microbes. The cluster annotation was achieved through the BLAST program of NCBI, which resulted in the grouping of assembled scaffolds into five classes, i.e., bacteria, archaea, eukaryota, viruses and others. The annotated cluster sequences were used to train machine learning algorithms (MLAs) to develop prediction models to classify unknown metagenomic sequences. In this study, we used metagenomic datasets of samples collected from the Ganga (Kanpur and Farakka) and the Yamuna (Delhi) rivers in India for clustering and training the MLA models. Further, the performance of MLAs was evaluated by 10-fold cross validation. The results revealed that the developed model based on the Random Forest had a superior performance compared to the other considered learning algorithms. The proposed method can be used for annotating the metagenomic scaffolds/contigs being complementary to existing methods of metagenomic data analysis. An offline predictor source code with the best prediction model is available at (https://github.com/Nalinikanta7/metagenomics).
高通量测序 (HTS) 技术的快速发展产生了大量的基因组和宏基因组序列,这有助于在许多生态系统中高精度地对微生物群落进行分类。传统上,基于规则的分箱技术用于根据序列组成或序列相似性对 contigs 或 scaffolds 进行分类。然而,由于手头有大量数据,并且需要高效的分箱方法和分类算法,因此微生物群落的准确分类仍然是一个主要挑战。因此,我们试图在这里实施迭代 K-Means 聚类对宏基因组序列进行初始分箱,并应用各种机器学习算法 (MLA) 对新识别的未知微生物进行分类。通过 NCBI 的 BLAST 程序实现了聚类注释,这导致组装支架被分为五类,即细菌、古菌、真核生物、病毒和其他。注释的聚类序列用于训练机器学习算法 (MLA) 以开发预测模型来分类未知的宏基因组序列。在这项研究中,我们使用了从印度恒河(坎普尔和法卡)和亚穆纳河(德里)收集的样本的宏基因组数据集进行聚类和训练 MLA 模型。此外,还通过 10 倍交叉验证评估了 MLA 的性能。结果表明,基于随机森林的开发模型的性能优于其他考虑的学习算法。所提出的方法可用于注释宏基因组支架/ contigs,这是对现有宏基因组数据分析方法的补充。带有最佳预测模型的离线预测源代码可在 (https://github.com/Nalinikanta7/metagenomics) 获得。