de Oliveira Raul Vitor Ferreira, Garrido Leandro Maza, Padilla Gabriel
Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo (USP), São Paulo, SP, 05508-900, Brazil.
Braz J Microbiol. 2025 Mar;56(1):79-89. doi: 10.1007/s42770-024-01598-2. Epub 2025 Jan 15.
Despite meticulous precautions, contamination of genomic DNA samples is not uncommon, which can significantly compromise the analysis of microorganisms' whole-genome sequencing data, thus affecting all subsequent analyses. Thanks to advancements in software and bioinformatics techniques, it is now possible to address this issue and prevent the loss of the entire dataset obtained in a contaminated whole-genome sequencing, where the DNA of another bacterium is present. In this study, it was observed that the sequencing reads from Streptomyces sp. BRB040, generated using the HiSeq System platform (Illumina Inc., San Diego, USA), were contaminated with the DNA of Bacillus licheniformis. To eliminate the contamination in Streptomyces sp. BRB040, a combination of tools available on the Galaxy platform and other web-based resources were used (MeDuSa and Blast). The contaminated reads were treated as a metagenome to isolate the genome of the contaminating organism. They were assembled using the metaSPAdes, resulting in a large scaffold of 4.187 Mb, which was identified as Bacillus licheniformis. After the identification of the contaminating organism, its genome was used as a filter to remove sequencing reads that could align using then Bowtie 2 software for this step. Once the contaminated reads were removed a new assembly was performed using the Unicycler software, yielding 117 contigs with a total size of 7.9 Mb. The completeness of this genome was assessed through BUSCO, resulting in a completeness of 95.9%. We also used an alternative tool (BBduk) to eliminate contaminated reads and the resulting assembly by Unicycler generated 85 contigs with a total size of 8.3 Mb and completeness of 99.5%. These results were better than the assembly obtained via SPAdes, which generated less complete genomes (maximum of 97.8% completeness) compared to Unicycler and which was unable to perform an adequate assembly of the data obtained from decontamination by BBduk. When compared with the uncontaminated BRB040 genome, which has a total size of 8.2 Mb and completeness of 99.8%, this pipeline revealed that the assembly performed with the decontaminated reads via BBduk presented better results, with completeness 0.3% lower than the reference. The genome mining of both genomes using antiSMASH 7.0 revealed the number of 24 Biosynthetic Gene Clusters (BGCs) for BBduk data as well as in the control assembly of the BRB040. In silico decontamination process allows the genome mining of BGCs despite the loss of nucleotides. These findings show that contamination can be effectively removed from a genome using readily available online tools, while preserving a dataset suitable for extracting valuable insights into the secondary metabolism of the target organism. This approach is particularly beneficial in scenarios where resequencing samples is not immediately feasible.
尽管采取了细致的预防措施,但基因组DNA样本的污染并不罕见,这可能会严重影响微生物全基因组测序数据的分析,进而影响所有后续分析。得益于软件和生物信息学技术的进步,现在有能力解决这个问题,并防止在存在另一种细菌DNA的污染全基因组测序中丢失整个数据集。在本研究中,观察到使用HiSeq系统平台(美国圣地亚哥的Illumina公司)生成的链霉菌属BRB040的测序读数被地衣芽孢杆菌的DNA污染。为了消除链霉菌属BRB040中的污染,使用了Galaxy平台上可用的工具和其他基于网络的资源(MeDuSa和Blast)的组合。将受污染的读数视为宏基因组以分离污染生物体的基因组。使用metaSPAdes对它们进行组装,得到一个4.187 Mb的大支架,被鉴定为地衣芽孢杆菌。在鉴定出污染生物体后,将其基因组用作过滤器,以去除使用Bowtie 2软件在此步骤中可以比对的测序读数。一旦去除受污染的读数,就使用Unicycler软件进行新的组装,产生117个重叠群,总大小为7.9 Mb。通过BUSCO评估该基因组的完整性,完整性为95.9%。我们还使用了另一种工具(BBduk)来消除受污染的读数,Unicycler生成的组装结果产生了85个重叠群,总大小为8.3 Mb,完整性为99.5%。这些结果优于通过SPAdes获得的组装结果,与Unicycler相比,SPAdes生成的基因组完整性较低(最高为97.8%),并且无法对通过BBduk去污染获得的数据进行充分组装。与未受污染的BRB040基因组(总大小为8.2 Mb,完整性为99.8%)相比,该流程表明,通过BBduk对去污染读数进行的组装呈现出更好的结果,完整性比参考基因组低0.3%。使用antiSMASH 7.0对两个基因组进行基因组挖掘,发现BBduk数据以及BRB040的对照组装中有24个生物合成基因簇(BGC)。计算机去污染过程允许在核苷酸丢失的情况下对BGC进行基因组挖掘。这些发现表明,使用现成的在线工具可以有效地从基因组中去除污染,同时保留适合提取目标生物体次级代谢有价值见解的数据集。这种方法在重新测序样本不可行的情况下特别有益。