GSK, Siena, Italy.
Present address: Department of Experimental Oncology, European Institute of Oncology, Milan, Italy.
BMC Bioinformatics. 2019 Nov 22;20(Suppl 9):347. doi: 10.1186/s12859-019-2887-1.
Multi-locus sequence typing (MLST) is a standard typing technique used to associate a sequence type (ST) to a bacterial isolate. When the output of whole genome sequencing (WGS) of a sample is available the ST can be assigned directly processing the read-set. Current approaches employ reads mapping (SRST2) against the MLST loci, k-mer distribution (stringMLST), selective assembly (GRAbB) or whole genome assembly (BIGSdb) followed by BLASTn sequence query. Here we present STRAIN (ST Reduced Assembly IdentificatioN), an R package that implements a hybrid strategy between assembly and mapping of the reads to assign the ST to an isolate starting from its read-sets.
Analysis of 540 publicly accessible Illumina read sets showed STRAIN to be more accurate at correct allele assignment and new alleles identification compared to SRTS2, stringMLST and GRAbB. STRAIN assigned correctly 3666 out of 3780 alleles (capability to identify correct alleles 97%) and, when presented with samples containing new alleles, identified them in 3730 out of 3780 STs (capability to identify new alleles 98.7%) of the cases. On the same dataset the other tested tools achieved lower capability to identify correct alleles (from 28.5 to 96.9%) and lower capability to identify new alleles (from 1.1 to 97.1%).
STRAIN is a new accurate method to assign the alleles and ST to an isolate by processing the raw reads output of WGS. STRAIN is also able to retrieve new allele sequences if present. Capability to identify correct and new STs/alleles, evaluated on a benchmark dataset, are higher than other existing methods. STRAIN is designed for single allele typing as well as MLST. Its implementation in R makes allele and ST assignment simple, direct and prompt to be integrated in wider pipeline of downstream bioinformatics analyses.
多位点序列分型(MLST)是一种将序列型(ST)与细菌分离株相关联的标准分型技术。当可用样本的全基因组测序(WGS)的输出时,可以直接处理读段来分配 ST。当前的方法采用读取映射(SRST2)针对 MLST 基因座、k- -mer 分布(stringMLST)、选择性组装(GRAbB)或全基因组组装(BIGSdb),然后进行 BLASTn 序列查询。在这里,我们提出了 STRAIN(ST 简化组装鉴定),这是一个 R 包,它实现了一种混合策略,即在从其读段开始将 ST 分配给分离株时,对读取进行组装和映射。
对 540 个公开可用的 Illumina 读取集的分析表明,与 SRTS2、stringMLST 和 GRAbB 相比,STRAIN 在正确分配等位基因和识别新等位基因方面更准确。STRAIN 正确识别了 3666 个 3780 个等位基因中的 3780 个(识别正确等位基因的能力为 97%),并且在遇到包含新等位基因的样本时,在 3780 个 ST 中的 3730 个中识别了它们(识别新等位基因的能力为 98.7%)。在同一数据集上,其他测试工具的正确识别等位基因的能力较低(从 28.5%到 96.9%),识别新等位基因的能力较低(从 1.1%到 97.1%)。
STRAIN 是一种通过处理 WGS 的原始读取输出来分配等位基因和 ST 的新的准确方法。如果存在,STRAIN 还能够检索新的等位基因序列。在基准数据集上评估的正确和新 ST/等位基因的识别能力高于其他现有方法。STRAIN 旨在用于单一位点分型和 MLST。它在 R 中的实现使等位基因和 ST 的分配变得简单、直接,并可以快速集成到更广泛的下游生物信息学分析管道中。