Department of Electrical Engineering, City University of Hong Kong, Kowloon, China.
Microbiome. 2023 Aug 17;11(1):183. doi: 10.1186/s40168-023-01615-w.
Bacterial strains under the same species can exhibit different biological properties, making strain-level composition analysis an important step in understanding the dynamics of microbial communities. Metagenomic sequencing has become the major means for probing the microbial composition in host-associated or environmental samples. Although there are a plethora of composition analysis tools, they are not optimized to address the challenges in strain-level analysis: highly similar strain genomes and the presence of multiple strains under one species in a sample. Thus, this work aims to provide a high-resolution and more accurate strain-level analysis tool for short reads.
In this work, we present a new strain-level composition analysis tool named StrainScan that employs a novel tree-based k-mers indexing structure to strike a balance between the strain identification accuracy and the computational complexity. We tested StrainScan extensively on a large number of simulated and real sequencing data and benchmarked StrainScan with popular strain-level analysis tools including Krakenuniq, StrainSeeker, Pathoscope2, Sigma, StrainGE, and StrainEst. The results show that StrainScan has higher accuracy and resolution than the state-of-the-art tools on strain-level composition analysis. It improves the F1 score by 20% in identifying multiple strains at the strain level.
By using a novel k-mer indexing structure, StrainScan is able to provide strain-level analysis with higher resolution than existing tools, enabling it to return more informative strain composition analysis in one sample or across multiple samples. StrainScan takes short reads and a set of reference strains as input and its source codes are freely available at https://github.com/liaoherui/StrainScan . Video Abstract.
同一物种下的细菌菌株可能表现出不同的生物学特性,因此菌株水平的组成分析成为理解微生物群落动态的重要步骤。宏基因组测序已成为探测宿主相关或环境样本中微生物组成的主要手段。尽管有大量的组成分析工具,但它们并未针对菌株水平分析的挑战进行优化:高度相似的菌株基因组和一个物种下存在多个菌株。因此,本研究旨在为短读长提供一种高分辨率且更准确的菌株水平分析工具。
在本工作中,我们提出了一种新的菌株水平组成分析工具 StrainScan,它采用了一种新颖的基于树的 k-mer 索引结构,在菌株识别准确性和计算复杂度之间取得了平衡。我们在大量模拟和真实测序数据上对 StrainScan 进行了广泛测试,并将其与流行的菌株水平分析工具(包括 Krakenuniq、StrainSeeker、Pathoscope2、Sigma、StrainGE 和 StrainEst)进行了基准测试。结果表明,StrainScan 在菌株水平组成分析方面比现有工具具有更高的准确性和分辨率。它在识别多个菌株方面将 F1 分数提高了 20%。
通过使用新颖的 k-mer 索引结构,StrainScan 能够提供比现有工具更高分辨率的菌株水平分析,从而在一个样本或多个样本中返回更具信息量的菌株组成分析。StrainScan 以短读长和一组参考菌株为输入,其源代码可在 https://github.com/liaoherui/StrainScan 上免费获取。视频摘要。