Suppr超能文献

利用重叠群嵌入和分解四核苷酸频率对宏基因组重叠群进行分箱

Binning Metagenomic Contigs Using Contig Embedding and Decomposed Tetranucleotide Frequency.

作者信息

Fu Long, Shi Jiabin, Huang Baohua

机构信息

School of Computer and Electronic Information, Guangxi University, Nanning 530004, China.

Guangxi Key Laboratory of Digital Infrastructure, Guangxi Zhuang Autonomous Region Information Center, Nanning 530004, China.

出版信息

Biology (Basel). 2024 Sep 24;13(10):755. doi: 10.3390/biology13100755.

Abstract

Metagenomic binning is a crucial step in metagenomic research. It can aggregate the genome sequences belonging to the same microbial species into independent bins. Most existing methods ignore the semantic information of contigs and lack effective processing of tetranucleotide frequency, resulting in insufficient and complex feature information extracted for binning and poor binning results. To address the above problems, we propose CedtBin, a metagenomic binning method based on contig embedding and decomposed tetranucleotide frequency. First, the improved BERT model is used to learn the contigs to obtain their embedding representation. Secondly, the tetranucleotide frequencies are decomposed using a non-negative matrix factorization (NMF) algorithm. After that, the two features are spliced and input into the clustering algorithm for binning. Considering the sensitivity of the DBSCAN clustering algorithm to input parameters, in order to solve the drawbacks of manual parameter input, we also propose an Annoy-DBSCAN algorithm that can adaptively determine the parameters of the DBSCAN algorithm. This algorithm uses Approximate Nearest Neighbors Oh Yeah (Annoy) and combines it with a grid search strategy to find the optimal parameters of the DBSCAN algorithm. On simulated and real datasets, CedtBin achieves better binning results than mainstream methods and can reconstruct more genomes, indicating that the proposed method is effective.

摘要

宏基因组分箱是宏基因组研究中的关键步骤。它可以将属于同一微生物物种的基因组序列聚集到独立的分箱中。大多数现有方法忽略了重叠群的语义信息,并且缺乏对四核苷酸频率的有效处理,导致为分箱提取的特征信息不足且复杂,分箱结果不佳。为了解决上述问题,我们提出了CedtBin,一种基于重叠群嵌入和分解四核苷酸频率的宏基因组分箱方法。首先,使用改进的BERT模型学习重叠群以获得它们的嵌入表示。其次,使用非负矩阵分解(NMF)算法分解四核苷酸频率。之后,将这两个特征拼接起来并输入到聚类算法中进行分箱。考虑到DBSCAN聚类算法对输入参数的敏感性,为了解决手动输入参数的缺点,我们还提出了一种Annoy-DBSCAN算法,它可以自适应地确定DBSCAN算法的参数。该算法使用近似最近邻(Annoy)并将其与网格搜索策略相结合,以找到DBSCAN算法的最优参数。在模拟和真实数据集上,CedtBin比主流方法取得了更好的分箱结果,并且可以重建更多基因组,表明所提出的方法是有效的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/88fc/11505167/7aad274f92b8/biology-13-00755-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验