Madival Sharanbasappa D, Mishra Dwijesh Chandra, Sharma Anu, Kumar Sanjeev, Maji Arpan Kumar, Budhlakoti Neeraj, Sinha Dipro, Rai Anil
Division of Agriculture Bioinformatics, ICAR-IASRI, New Delhi- 110012, India.
Division of Computer Applications, ICAR-IASRI, New Delhi- 110012, India.
Curr Genomics. 2022 Nov 18;23(5):353-368. doi: 10.2174/1389202923666220928150100.
One major challenge in binning Metagenomics data is the limited availability of reference datasets, as only 1% of the total microbial population is yet cultured. This has given rise to the efficacy of unsupervised methods for binning in the absence of any reference datasets.
To develop a deep clustering-based binning approach for Metagenomics data and to evaluate results with suitable measures.
In this study, a deep learning-based approach has been taken for binning the Metagenomics data. The results are validated on different datasets by considering features such as Tetra-nucleotide frequency (TNF), Hexa-nucleotide frequency (HNF) and GC-Content. Convolutional Autoencoder is used for feature extraction and for binning; the K-means clustering method is used.
In most cases, it has been found that evaluation parameters such as the Silhouette index and Rand index are more than 0.5 and 0.8, respectively, which indicates that the proposed approach is giving satisfactory results. The performance of the developed approach is compared with current methods and tools using benchmarked low complexity simulated and real metagenomic datasets. It is found better for unsupervised and at par with semi-supervised methods.
An unsupervised advanced learning-based approach for binning has been proposed, and the developed method shows promising results for various datasets. This is a novel approach for solving the lack of reference data problem of binning in metagenomics.
宏基因组学数据分箱的一个主要挑战是参考数据集的可用性有限,因为目前仅培养了1%的微生物种群。这使得在没有任何参考数据集的情况下,无监督分箱方法的有效性得以凸显。
开发一种基于深度聚类的宏基因组学数据分箱方法,并用合适的指标评估结果。
在本研究中,采用了一种基于深度学习的方法对宏基因组学数据进行分箱。通过考虑四核苷酸频率(TNF)、六核苷酸频率(HNF)和GC含量等特征,在不同数据集上对结果进行验证。使用卷积自动编码器进行特征提取和分箱;采用K均值聚类方法。
在大多数情况下,发现轮廓系数和兰德指数等评估参数分别大于0.5和0.8,这表明所提出的方法给出了令人满意的结果。使用基准低复杂度模拟和真实宏基因组数据集,将所开发方法的性能与当前方法和工具进行比较。发现该方法在无监督情况下表现更好,与半监督方法相当。
提出了一种基于无监督深度学习的分箱方法,所开发方法在各种数据集上显示出有前景的结果。这是一种解决宏基因组学分箱中参考数据缺乏问题的新方法。