US Department of Agriculture, Agricultural Research Service, US Meat Animal Research Center, 844 Rd 313, PO Box 165, Clay Center, NE, 68933, USA.
US Department of Agriculture, Agricultural Research Service, Eastern Regional Research Center, 600 East Mermaid Lane, Wyndmoor, PA, 19038, USA.
Sci Rep. 2024 Jun 10;14(1):13257. doi: 10.1038/s41598-024-63832-z.
Salmonella enterica and Escherichia coli are major food-borne human pathogens, and their genomes are routinely sequenced for clinical surveillance. Computational pipelines designed for analyzing pathogen genomes should both utilize the most current information from annotation databases and increase the coverage of these databases over time. We report the development of the GEA pipeline to analyze large batches of E. coli and S. enterica genomes. The GEA pipeline takes as input paired Illumina raw reads files which are then assembled followed by annotation. Alternatively, assemblies can be provided as input and directly annotated. The pipeline provides predictive genome annotations for E. coli and S. enterica with a focus on the Center for Genomic Epidemiology tools. Annotation results are provided as a tab delimited text file. The GEA pipeline is designed for large-scale E. coli and S. enterica genome assembly and characterization using the Center for Genomic Epidemiology command-line tools and high-performance computing. Large scale annotation is demonstrated by an analysis of more than 14,000 Salmonella genome assemblies. Testing the GEA pipeline on E. coli raw reads demonstrates reproducibility across multiple compute environments and computational usage is optimized on high performance computers.
肠沙门氏菌和大肠杆菌是主要的食源性人类病原体,它们的基因组通常被用于临床监测进行测序。用于分析病原体基因组的计算管道应同时利用注释数据库中的最新信息,并随着时间的推移增加这些数据库的覆盖范围。我们报告了 GEA 管道的开发,用于分析大量的大肠杆菌和肠沙门氏菌基因组。GEA 管道以 Illumina 原始读取文件对作为输入,然后进行组装和注释。或者,可以提供组装作为输入并直接注释。该管道提供了对大肠杆菌和肠沙门氏菌的预测基因组注释,重点是基因组流行病学中心的工具。注释结果以制表符分隔的文本文件形式提供。GEA 管道旨在使用基因组流行病学中心的命令行工具和高性能计算对大规模的大肠杆菌和肠沙门氏菌基因组进行组装和特征分析。通过对 14000 多个沙门氏菌基因组组装的分析,展示了大规模注释的能力。在大肠杆菌原始读取上测试 GEA 管道证明了在多个计算环境中的可重复性,并且在高性能计算机上优化了计算使用。