CNN_FunBar：真菌 ITS 区分类的高级学习技术。

CNN_FunBar: Advanced Learning Technique for Fungi ITS Region Classification.

机构信息

Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.

出版信息

Genes (Basel). 2023 Mar 3;14(3):634. doi: 10.3390/genes14030634.

DOI:10.3390/genes14030634

PMID:36980906

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10048311/

Abstract

Fungal species identification from metagenomic data is a highly challenging task. Internal Transcribed Spacer (ITS) region is a potential DNA marker for fungi taxonomy prediction. Computational approaches, especially deep learning algorithms, are highly efficient for better pattern recognition and classification of large datasets compared to in silico techniques such as BLAST and machine learning methods. Here in this study, we present CNN_FunBar, a convolutional neural network-based approach for the classification of fungi ITS sequences from UNITE+INSDC reference datasets. Effects of convolution kernel size, filter numbers, -mer size, degree of diversity and category-wise frequency of ITS sequences on classification performances of CNN models have been assessed at all taxonomic levels (species, genus, family, order, class and phylum). It is observed that CNN models can produce >93% average accuracy for classifying ITS sequences from balanced datasets with 500 sequences per category and 6-mer frequency features at all levels. The comparative study has revealed that CNN_FunBar can outperform machine learning-based algorithms (SVM, KNN, Naïve-Bayes and Random Forest) as well as existing fungal taxonomy prediction software (funbarRF, Mothur, RDP Classifier and SINTAX). The present study will be helpful for fungal taxonomy classification using large metagenomic datasets.

摘要

从宏基因组数据中鉴定真菌物种是一项极具挑战性的任务。内部转录间隔区（ITS）区域是真菌分类预测的潜在 DNA 标记。与 BLAST 等计算技术和机器学习方法相比，计算方法，特别是深度学习算法，对于更好地识别和分类大型数据集非常有效。在本研究中，我们提出了 CNN_FunBar，这是一种基于卷积神经网络的方法，用于从 UNITE+INSDC 参考数据集对真菌 ITS 序列进行分类。在所有分类水平（物种、属、科、目、纲和门）上评估了卷积核大小、滤波器数量、-mer 大小、多样性程度和类别频率对 CNN 模型分类性能的影响。结果表明，CNN 模型可以在平衡数据集（每个类别 500 个序列和 6-mer 频率特征）上产生>93%的平均准确率，用于分类 ITS 序列。比较研究表明，CNN_FunBar 可以优于基于机器学习的算法（SVM、KNN、朴素贝叶斯和随机森林）以及现有的真菌分类预测软件（funbarRF、Mothur、RDP Classifier 和 SINTAX）。本研究将有助于使用大型宏基因组数据集进行真菌分类学分类。