Castanet：一种用于快速分析靶向多病原体基因组数据的管道。

Castanet: a pipeline for rapid analysis of targeted multi-pathogen genomic data.

机构信息

Nuffield Department of Medicine, Peter Medawar Building for Pathogen Research, University of Oxford, Oxfordshire OX1 3SY, United Kingdom.

Radcliffe Department of Medicine, University of Oxford, West Wing John Radcliffe Hospital, Oxfordshire OX3 9DU, United Kingdom.

出版信息

Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae591.

DOI:10.1093/bioinformatics/btae591

PMID:39360992

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11494375/

Abstract

MOTIVATION

Target enrichment strategies generate genomic data from multiple pathogens in a single process, greatly improving sensitivity over metagenomic sequencing and enabling cost-effective, high-throughput surveillance and clinical applications. However, uptake by research and clinical laboratories is constrained by an absence of computational tools that are specifically designed for the analysis of multi-pathogen enrichment sequence data. Here we present an analysis pipeline, Castanet, for use with multi-pathogen enrichment sequencing data. Castanet is designed to work with short-read data produced by existing targeted enrichment strategies, but can be readily deployed on any BAM file generated by another methodology. Also included are an optional graphical interface and installer script.

RESULTS

In addition to genome reconstruction, Castanet reports method-specific metrics that enable quantification of capture efficiency, estimation of pathogen load, differentiation of low-level positives from contamination, and assessment of sequencing quality. Castanet can be used as a traditional end-to-end pipeline for consensus generation, but its strength lies in the ability to process a flexible, pre-defined set of pathogens of interest directly from multi-pathogen enrichment experiments. In our tests, Castanet consensus sequences were accurate reconstructions of reference sequences, including in instances where multiple strains of the same pathogen were present. Castanet performs effectively on standard computers and can process the entire output of a 96-sample enrichment sequencing run (50M reads) using a single batch process command, in $<$2 h.

AVAILABILITY AND IMPLEMENTATION

Source code freely available under GPL-3 license at https://github.com/MultipathogenGenomics/castanet, implemented in Python 3.10 and supported in Ubuntu Linux 22.04. The data underlying this article are available in Europe Nucleotide Archives, at https://www.ebi.ac.uk/ena/browser/view/PRJEB77004.

摘要

动机

目标富集策略可在单个过程中从多种病原体生成基因组数据，大大提高了宏基因组测序的灵敏度，并实现了具有成本效益的高通量监测和临床应用。然而，由于缺乏专门针对多病原体富集序列数据分析而设计的计算工具，研究和临床实验室对其采用受到限制。在这里，我们提出了一种分析管道 Castanet，用于多病原体富集测序数据。Castanet 旨在与现有靶向富集策略生成的短读数据一起使用，但可以轻松部署在由另一种方法生成的任何 BAM 文件上。还包括一个可选的图形界面和安装脚本。

结果

除了基因组重建外，Castanet 还报告了特定于方法的指标，这些指标可用于量化捕获效率、估计病原体载量、区分低水平阳性与污染，以及评估测序质量。Castanet 可作为传统的端到端共识生成管道使用，但它的优势在于能够直接从多病原体富集实验处理灵活的、预定义的一组感兴趣的病原体。在我们的测试中，Castanet 共识序列是参考序列的准确重建，包括存在同一病原体的多个菌株的情况。Castanet 在标准计算机上执行效果良好，并且可以使用单个批处理命令处理 96 个样本富集测序运行（50M 个读取）的整个输出，耗时不到 2 小时。

可用性和实现

源代码在 GPL-3 许可证下免费提供，网址为 https://github.com/MultipathogenGenomics/castanet，使用 Python 3.10 实现，并在 Ubuntu Linux 22.04 上得到支持。本文所依据的数据可在欧洲核苷酸档案库中获得，网址为 https://www.ebi.ac.uk/ena/browser/view/PRJEB77004。