Laboratorio de Farmacogenómica, Instituto Nacional de Medicina Genómica (INMEGEN), Ciudad de México, México.
Instituto de Investigaciones Biológicas, Universidad Veracruzana, Xalapa, Veracruz, México.
PLoS One. 2021 Oct 26;16(10):e0258774. doi: 10.1371/journal.pone.0258774. eCollection 2021.
Next-Generation Sequencing (NGS) is widely used to investigate genomic variation. In several studies, the genetic variation of Mycobacterium tuberculosis has been analyzed in sputum samples without previous culture, using target enrichment methodologies for NGS. Alignments obtained by different programs generally map the sequences under default parameters, and from these results, it is assumed that only Mycobacterium reads will be obtained. However, variants of interest microorganism in clinical samples can be confused with a vast collection of reads from other bacteria, viruses, and human DNA. Currently, there are no standardized pipelines, and the cleaning success is never verified since there is a lack of rigorous controls to identify and remove reads from other sputum-microorganisms genetically similar to M. tuberculosis. Therefore, we designed a bioinformatic pipeline to process NGS data from sputum samples, including several filters and quality control points to identify and eliminate non-M. tuberculosis reads to obtain a reliable genetic variant report. Our proposal uses the SURPI software as a taxonomic classifier to filter input sequences and perform a mapping that provides the highest percentage of Mycobacterium reads, minimizing the reads from other microorganisms. We then use the filtered sequences to perform variant calling with the GATK software, ensuring the mapping quality, realignment, recalibration, hard-filtering, and post-filter to increase the reliability of the reported variants. Using default mapping parameters, we identified reads of contaminant bacteria, such as Streptococcus, Rhotia, Actinomyces, and Veillonella. Our final mapping strategy allowed a sequence identity of 97.8% between the input reads and the whole M. tuberculosis reference genome H37Rv using a genomic edit distance of three, thus removing 98.8% of the off-target sequences with a Mycobacterium reads loss of 1.7%. Finally, more than 200 unreliable genetic variants were removed during the variant calling, increasing the report's reliability.
下一代测序(NGS)广泛用于研究基因组变异。在几项研究中,使用靶向富集方法对 NGS 对未经培养的痰液样本中的结核分枝杆菌的遗传变异进行了分析。不同程序获得的比对通常在默认参数下对序列进行映射,并且根据这些结果,假设只会获得结核分枝杆菌的读段。然而,临床样本中感兴趣的变异微生物可能会与大量来自其他细菌、病毒和人类 DNA 的读段混淆。目前,没有标准化的流程,并且由于缺乏严格的控制来识别和去除与结核分枝杆菌在遗传上相似的其他痰液微生物的读段,因此无法验证清洗的成功。因此,我们设计了一个生物信息学流程来处理痰液样本的 NGS 数据,其中包括几个过滤器和质量控制点,以识别和消除非结核分枝杆菌的读段,从而获得可靠的遗传变异报告。我们的建议使用 SURPI 软件作为分类器来过滤输入序列,并进行映射,提供最高百分比的结核分枝杆菌读段,最大限度地减少其他微生物的读段。然后,我们使用过滤后的序列使用 GATK 软件进行变异调用,确保映射质量、重新比对、重新校准、硬过滤和后过滤,以提高报告变异的可靠性。使用默认的映射参数,我们鉴定了污染细菌的读段,如链球菌、Rhotia、放线菌和韦荣球菌。我们的最终映射策略允许输入读段与整个结核分枝杆菌参考基因组 H37Rv 之间的序列同一性为 97.8%,使用基因组编辑距离为 3,从而去除了 98.8%的非目标序列,结核分枝杆菌读段损失了 1.7%。最后,在变异调用过程中去除了 200 多个不可靠的遗传变异,提高了报告的可靠性。