Key Laboratory of Molecular Virology and Immunology, Institut Pasteur of Shanghai, Center for Biosafety Mega-Science, Chinese Academy of Sciences, Shanghai, 200031, China.
Bio-Med Big Data Center, Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, 20031, China.
BMC Med Genomics. 2021 Dec 14;14(Suppl 6):289. doi: 10.1186/s12920-021-01138-z.
Virus screening and viral genome reconstruction are urgent and crucial for the rapid identification of viral pathogens, i.e., tracing the source and understanding the pathogenesis when a viral outbreak occurs. Next-generation sequencing (NGS) provides an efficient and unbiased way to identify viral pathogens in host-associated and environmental samples without prior knowledge. Despite the availability of software, data analysis still requires human operations. A mature pipeline is urgently needed when thousands of viral pathogen and viral genome reconstruction samples need to be rapidly identified.
In this paper, we present a rapid and accurate workflow to screen metagenomics sequencing data for viral pathogens and other compositions, as well as enable a reference-based assembler to reconstruct viral genomes. Moreover, we tested our workflow on several metagenomics datasets, including a SARS-CoV-2 patient sample with NGS data, pangolins tissues with NGS data, Middle East Respiratory Syndrome (MERS)-infected cells with NGS data, etc. Our workflow demonstrated high accuracy and efficiency when identifying target viruses from large scale NGS metagenomics data. Our workflow was flexible when working with a broad range of NGS datasets from small (kb) to large (100 Gb). This took from a few minutes to a few hours to complete each task. At the same time, our workflow automatically generates reports that incorporate visualized feedback (e.g., metagenomics data quality statistics, host and viral sequence compositions, details about each of the identified viral pathogens and their coverages, and reassembled viral pathogen sequences based on their closest references).
Overall, our system enabled the rapid screening and identification of viral pathogens from metagenomics data, providing an important piece to support viral pathogen research during a pandemic. The visualized report contains information from raw sequence quality to a reconstructed viral sequence, which allows non-professional people to screen their samples for viruses by themselves (Additional file 1).
病毒筛查和病毒基因组重建对于快速鉴定病毒病原体至关重要,即在病毒爆发时追踪病原体来源和了解发病机制。下一代测序(NGS)为在宿主相关和环境样本中鉴定病毒病原体提供了一种高效且无偏倚的方法,而无需事先了解相关信息。尽管有软件可用,但数据分析仍需要人工操作。当需要快速鉴定数千个病毒病原体和病毒基因组重建样本时,迫切需要一个成熟的流程。
本文提出了一种快速准确的工作流程,用于筛选宏基因组测序数据中的病毒病原体和其他成分,并启用基于参考的组装程序来重建病毒基因组。此外,我们还在几个宏基因组数据集上测试了我们的工作流程,包括具有 NGS 数据的 SARS-CoV-2 患者样本、具有 NGS 数据的穿山甲组织、具有 NGS 数据的中东呼吸综合征(MERS)感染细胞等。我们的工作流程在从大规模 NGS 宏基因组数据中鉴定目标病毒时表现出了很高的准确性和效率。我们的工作流程在处理从小型(kb)到大型(100Gb)的各种 NGS 数据集时具有灵活性。每个任务完成时间从几分钟到几个小时不等。同时,我们的工作流程自动生成报告,其中包含可视化反馈(例如,宏基因组数据质量统计、宿主和病毒序列组成、每个鉴定出的病毒病原体及其覆盖率的详细信息,以及基于其最接近的参考重新组装的病毒病原体序列)。
总的来说,我们的系统能够从宏基因组数据中快速筛选和鉴定病毒病原体,为大流行期间的病毒病原体研究提供了重要支持。可视化报告包含从原始序列质量到重建病毒序列的信息,允许非专业人员自行筛选其样本中的病毒(附加文件 1)。