读段分箱可改善宏基因组样本中病毒基因组序列的组装。

Reads Binning Improves the Assembly of Viral Genome Sequences From Metagenomic Samples.

作者信息

Song Kai

机构信息

School of Mathematics and Statistics, Qingdao University, Qingdao China.

出版信息

Front Microbiol. 2021 May 21;12:664560. doi: 10.3389/fmicb.2021.664560. eCollection 2021.

Abstract

Metagenomes can be considered as mixtures of viral, bacterial, and other eukaryotic DNA sequences. Mining viral sequences from metagenomes could shed insight into virus-host relationships and expand viral databases. Current alignment-based methods are unsuitable for identifying viral sequences from metagenome sequences because most assembled metagenomic contigs are short and possess few or no predicted genes, and most metagenomic viral genes are dissimilar to known viral genes. In this study, I developed a Markov model-based method, VirMC, to identify viral sequences from metagenomic data. VirMC uses Markov chains to model sequence signatures and construct a scoring model using a likelihood test to distinguish viral and bacterial sequences. Compared with the other two state-of-the-art viral sequence-prediction methods, VirFinder and PPR-Meta, my proposed method outperformed VirFinder and had similar performance with PPR-Meta for short contigs with length less than 400 bp. VirMC outperformed VirFinder and PPR-Meta for identifying viral sequences in contaminated metagenomic samples with eukaryotic sequences. VirMC showed better performance in assembling viral-genome sequences from metagenomic data (based on filtering potential bacterial reads). Applying VirMC to human gut metagenomes from healthy subjects and patients with type-2 diabetes (T2D) revealed that viral contigs could help classify healthy and diseased statuses. This alignment-free method complements gene-based alignment approaches and will significantly improve the precision of viral sequence identification.

摘要

宏基因组可以被视为病毒、细菌和其他真核生物DNA序列的混合物。从宏基因组中挖掘病毒序列有助于深入了解病毒与宿主的关系,并扩展病毒数据库。目前基于比对的方法不适用于从宏基因组序列中识别病毒序列,因为大多数组装的宏基因组重叠群很短,几乎没有或没有预测基因,而且大多数宏基因组病毒基因与已知病毒基因不同。在本研究中,我开发了一种基于马尔可夫模型的方法VirMC,用于从宏基因组数据中识别病毒序列。VirMC使用马尔可夫链对序列特征进行建模,并使用似然检验构建评分模型,以区分病毒序列和细菌序列。与另外两种最先进的病毒序列预测方法VirFinder和PPR-Meta相比,我提出的方法在长度小于400 bp的短重叠群上优于VirFinder,并且与PPR-Meta具有相似的性能。在识别含有真核序列的污染宏基因组样本中的病毒序列方面,VirMC优于VirFinder和PPR-Meta。在从宏基因组数据中组装病毒基因组序列方面(基于过滤潜在的细菌读数),VirMC表现出更好的性能。将VirMC应用于健康受试者和2型糖尿病(T2D)患者的人类肠道宏基因组,发现病毒重叠群有助于对健康和疾病状态进行分类。这种无需比对的方法补充了基于基因的比对方法,并将显著提高病毒序列识别的精度。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6d4b/8175635/704c8b09bdad/fmicb-12-664560-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索