School of Artificial Intelligence and Computer Science, Jiangnan university, Wuxi, Jiangsu 214122, China.
National Engineering Research Center of Cereal Fermentation and Food Biomanufacturing, State Key Laboratory of Food Science and Technology, School of Food Science and Technology, Jiangnan University, Wuxi, Jiangsu 214122, China.
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae157.
Microbial community analysis is an important field to study the composition and function of microbial communities. Microbial species annotation is crucial to revealing microorganisms' complex ecological functions in environmental, ecological and host interactions. Currently, widely used methods can suffer from issues such as inaccurate species-level annotations and time and memory constraints, and as sequencing technology advances and sequencing costs decline, microbial species annotation methods with higher quality classification effectiveness become critical. Therefore, we processed 16S rRNA gene sequences into k-mers sets and then used a trained DNABERT model to generate word vectors. We also design a parallel network structure consisting of deep and shallow modules to extract the semantic and detailed features of 16S rRNA gene sequences. Our method can accurately and rapidly classify bacterial sequences at the SILVA database's genus and species level. The database is characterized by long sequence length (1500 base pairs), multiple sequences (428,748 reads) and high similarity. The results show that our method has better performance. The technique is nearly 20% more accurate at the species level than the currently popular naive Bayes-dominated QIIME 2 annotation method, and the top-5 results at the species level differ from BLAST methods by <2%. In summary, our approach combines a multi-module deep learning approach that overcomes the limitations of existing methods, providing an efficient and accurate solution for microbial species labeling and more reliable data support for microbiology research and application.
微生物群落分析是研究微生物群落组成和功能的重要领域。微生物物种注释对于揭示微生物在环境、生态和宿主相互作用中的复杂生态功能至关重要。目前,广泛使用的方法可能存在物种水平注释不准确、时间和内存限制等问题,随着测序技术的进步和测序成本的降低,具有更高质量分类效果的微生物物种注释方法变得至关重要。因此,我们将 16S rRNA 基因序列处理成 k-mers 集,然后使用经过训练的 DNABERT 模型生成单词向量。我们还设计了一个由深、浅模块组成的并行网络结构,以提取 16S rRNA 基因序列的语义和详细特征。我们的方法可以准确、快速地对 SILVA 数据库的属和种水平的细菌序列进行分类。该数据库的特点是序列长度长(1500 个碱基对)、序列数量多(428748 个读取)、相似度高。结果表明,我们的方法具有更好的性能。与目前流行的以朴素贝叶斯为主导的 QIIME 2 注释方法相比,我们的方法在物种水平上的准确率提高了近 20%,在物种水平上的前 5 名结果与 BLAST 方法的差异<2%。总之,我们的方法结合了一种多模块深度学习方法,克服了现有方法的局限性,为微生物物种标记提供了一种高效、准确的解决方案,为微生物学研究和应用提供了更可靠的数据支持。