对链霉菌基因组中的DNA序列进行净化以实现最佳基因组挖掘。

Decontamination of DNA sequences from a Streptomyces genome for optimal genome mining.

作者信息

de Oliveira Raul Vitor Ferreira, Garrido Leandro Maza, Padilla Gabriel

机构信息

Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo (USP), São Paulo, SP, 05508-900, Brazil.

出版信息

Braz J Microbiol. 2025 Mar;56(1):79-89. doi: 10.1007/s42770-024-01598-2. Epub 2025 Jan 15.

DOI:10.1007/s42770-024-01598-2

PMID:39812972

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11885714/

Abstract

Despite meticulous precautions, contamination of genomic DNA samples is not uncommon, which can significantly compromise the analysis of microorganisms' whole-genome sequencing data, thus affecting all subsequent analyses. Thanks to advancements in software and bioinformatics techniques, it is now possible to address this issue and prevent the loss of the entire dataset obtained in a contaminated whole-genome sequencing, where the DNA of another bacterium is present. In this study, it was observed that the sequencing reads from Streptomyces sp. BRB040, generated using the HiSeq System platform (Illumina Inc., San Diego, USA), were contaminated with the DNA of Bacillus licheniformis. To eliminate the contamination in Streptomyces sp. BRB040, a combination of tools available on the Galaxy platform and other web-based resources were used (MeDuSa and Blast). The contaminated reads were treated as a metagenome to isolate the genome of the contaminating organism. They were assembled using the metaSPAdes, resulting in a large scaffold of 4.187 Mb, which was identified as Bacillus licheniformis. After the identification of the contaminating organism, its genome was used as a filter to remove sequencing reads that could align using then Bowtie 2 software for this step. Once the contaminated reads were removed a new assembly was performed using the Unicycler software, yielding 117 contigs with a total size of 7.9 Mb. The completeness of this genome was assessed through BUSCO, resulting in a completeness of 95.9%. We also used an alternative tool (BBduk) to eliminate contaminated reads and the resulting assembly by Unicycler generated 85 contigs with a total size of 8.3 Mb and completeness of 99.5%. These results were better than the assembly obtained via SPAdes, which generated less complete genomes (maximum of 97.8% completeness) compared to Unicycler and which was unable to perform an adequate assembly of the data obtained from decontamination by BBduk. When compared with the uncontaminated BRB040 genome, which has a total size of 8.2 Mb and completeness of 99.8%, this pipeline revealed that the assembly performed with the decontaminated reads via BBduk presented better results, with completeness 0.3% lower than the reference. The genome mining of both genomes using antiSMASH 7.0 revealed the number of 24 Biosynthetic Gene Clusters (BGCs) for BBduk data as well as in the control assembly of the BRB040. In silico decontamination process allows the genome mining of BGCs despite the loss of nucleotides. These findings show that contamination can be effectively removed from a genome using readily available online tools, while preserving a dataset suitable for extracting valuable insights into the secondary metabolism of the target organism. This approach is particularly beneficial in scenarios where resequencing samples is not immediately feasible.

摘要

尽管采取了细致的预防措施，但基因组DNA样本的污染并不罕见，这可能会严重影响微生物全基因组测序数据的分析，进而影响所有后续分析。得益于软件和生物信息学技术的进步，现在有能力解决这个问题，并防止在存在另一种细菌DNA的污染全基因组测序中丢失整个数据集。在本研究中，观察到使用HiSeq系统平台（美国圣地亚哥的Illumina公司）生成的链霉菌属BRB040的测序读数被地衣芽孢杆菌的DNA污染。为了消除链霉菌属BRB040中的污染，使用了Galaxy平台上可用的工具和其他基于网络的资源（MeDuSa和Blast）的组合。将受污染的读数视为宏基因组以分离污染生物体的基因组。使用metaSPAdes对它们进行组装，得到一个4.187 Mb的大支架，被鉴定为地衣芽孢杆菌。在鉴定出污染生物体后，将其基因组用作过滤器，以去除使用Bowtie 2软件在此步骤中可以比对的测序读数。一旦去除受污染的读数，就使用Unicycler软件进行新的组装，产生117个重叠群，总大小为7.9 Mb。通过BUSCO评估该基因组的完整性，完整性为95.9%。我们还使用了另一种工具（BBduk）来消除受污染的读数，Unicycler生成的组装结果产生了85个重叠群，总大小为8.3 Mb，完整性为99.5%。这些结果优于通过SPAdes获得的组装结果，与Unicycler相比，SPAdes生成的基因组完整性较低（最高为97.8%），并且无法对通过BBduk去污染获得的数据进行充分组装。与未受污染的BRB040基因组（总大小为8.2 Mb，完整性为99.8%）相比，该流程表明，通过BBduk对去污染读数进行的组装呈现出更好的结果，完整性比参考基因组低0.3%。使用antiSMASH 7.0对两个基因组进行基因组挖掘，发现BBduk数据以及BRB040的对照组装中有24个生物合成基因簇（BGC）。计算机去污染过程允许在核苷酸丢失的情况下对BGC进行基因组挖掘。这些发现表明，使用现成的在线工具可以有效地从基因组中去除污染，同时保留适合提取目标生物体次级代谢有价值见解的数据集。这种方法在重新测序样本不可行的情况下特别有益。

相似文献

Decontamination of DNA sequences from a Streptomyces genome for optimal genome mining.对链霉菌基因组中的DNA序列进行净化以实现最佳基因组挖掘。

Braz J Microbiol. 2025 Mar;56(1):79-89. doi: 10.1007/s42770-024-01598-2. Epub 2025 Jan 15.

Can a Liquid Biopsy Detect Circulating Tumor DNA With Low-passage Whole-genome Sequencing in Patients With a Sarcoma? A Pilot Evaluation.液体活检能否通过低深度全基因组测序检测肉瘤患者的循环肿瘤DNA？一项初步评估。

Clin Orthop Relat Res. 2025 Jan 1;483(1):39-48. doi: 10.1097/CORR.0000000000003161. Epub 2024 Jun 21.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Cost-effectiveness of using prognostic information to select women with breast cancer for adjuvant systemic therapy.利用预后信息为乳腺癌患者选择辅助性全身治疗的成本效益

Health Technol Assess. 2006 Sep;10(34):iii-iv, ix-xi, 1-204. doi: 10.3310/hta10340.

Immunogenicity and seroefficacy of pneumococcal conjugate vaccines: a systematic review and network meta-analysis.肺炎球菌结合疫苗的免疫原性和血清效力：系统评价和网络荟萃分析。

Health Technol Assess. 2024 Jul;28(34):1-109. doi: 10.3310/YWHA3079.

Home treatment for mental health problems: a systematic review.心理健康问题的居家治疗：一项系统综述

Health Technol Assess. 2001;5(15):1-139. doi: 10.3310/hta5150.

Intravenous magnesium sulphate and sotalol for prevention of atrial fibrillation after coronary artery bypass surgery: a systematic review and economic evaluation.静脉注射硫酸镁和索他洛尔预防冠状动脉搭桥术后房颤：系统评价与经济学评估

Health Technol Assess. 2008 Jun;12(28):iii-iv, ix-95. doi: 10.3310/hta12280.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病：网络荟萃分析。

Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

Diagnostic test accuracy and cost-effectiveness of tests for codeletion of chromosomal arms 1p and 19q in people with glioma.染色体臂 1p 和 19q 缺失的检测在胶质瘤患者中的诊断准确性和成本效益。

Cochrane Database Syst Rev. 2022 Mar 2;3(3):CD013387. doi: 10.1002/14651858.CD013387.pub2.

本文引用的文献

The promise and pitfalls of synteny in phylogenomics.系统发生基因组学中同线性的前景与陷阱。

PLoS Biol. 2024 May 20;22(5):e3002632. doi: 10.1371/journal.pbio.3002632. eCollection 2024 May.

antiSMASH 7.0: new and improved predictions for detection, regulation, chemical structures and visualisation.antiSMASH 7.0：用于检测、调控、化学结构和可视化的全新且改进的预测功能。

Nucleic Acids Res. 2023 Jul 5;51(W1):W46-W50. doi: 10.1093/nar/gkad344.

The human "contaminome": bacterial, viral, and computational contamination in whole genome sequences from 1000 families.人类“污染组”：1000 个家庭的全基因组序列中的细菌、病毒和计算污染。

Sci Rep. 2022 Jun 14;12(1):9863. doi: 10.1038/s41598-022-13269-z.

Contamination detection in genomic data: more is not enough.基因组数据中的污染检测：更多并不一定更好。

Genome Biol. 2022 Feb 21;23(1):60. doi: 10.1186/s13059-022-02619-9.

Draft Genome Sequence of a Poly-γ-Glutamic Acid-Producing Isolate, Bacillus paralicheniformis Strain bcasdu2018/01.一株产聚γ-谷氨酸的解淀粉芽孢杆菌菌株bcasdu2018/01的基因组草图序列

Microbiol Resour Announc. 2021 Nov 18;10(46):e0101321. doi: 10.1128/MRA.01013-21.

WGA-LP: a pipeline for whole genome assembly of contaminated reads.WGA-LP：一种用于污染读段全基因组组装的流程。

Bioinformatics. 2022 Jan 12;38(3):846-848. doi: 10.1093/bioinformatics/btab719.

Genome mining for drug discovery: progress at the front end.基因组挖掘在药物发现中的应用：前端进展。

J Ind Microbiol Biotechnol. 2021 Dec 23;48(9-10). doi: 10.1093/jimb/kuab044.

Using SPAdes De Novo Assembler.使用 SPAdes 从头组装。

Curr Protoc Bioinformatics. 2020 Jun;70(1):e102. doi: 10.1002/cpbi.102.

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank.终止污染：大规模搜索在 GenBank 中发现超过 200 万条污染条目。

Genome Biol. 2020 May 12;21(1):115. doi: 10.1186/s13059-020-02023-1.

Marine Bacteria from Rocas Atoll as a Rich Source of Pharmacologically Active Compounds.罗卡阿托尔环礁的海洋细菌作为具有药理活性化合物的丰富来源。

Mar Drugs. 2019 Nov 28;17(12):671. doi: 10.3390/md17120671.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验