Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta 55281, Indonesia.
Department of Bioinformatics, School of Life Sciences, Indonesia International Institute for Life Sciences, Jakarta 13210, Indonesia.
Genes (Basel). 2022 Jul 26;13(8):1330. doi: 10.3390/genes13081330.
Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) is a newly emerging virus well known as the major cause of the worldwide pandemic due to Coronavirus Disease 2019 (COVID-19). Major breakthroughs in the Next Generation Sequencing (NGS) field were elucidated following the first release of a full-length SARS-CoV-2 genome on the 10 January 2020, with the hope of turning the table against the worsening pandemic situation. Previous studies in respiratory virus characterization require mapping of raw sequences to the human genome in the downstream bioinformatics pipeline as part of metagenomic principles. Illumina, as the major player in the NGS arena, took action by releasing guidelines for improved enrichment kits called the Respiratory Virus Oligo Panel (RVOP) based on a hybridization capture method capable of capturing targeted respiratory viruses, including SARS-CoV-2; therefore, allowing a direct map of raw sequences data to SARS-CoV-2 genome in downstream bioinformatics pipeline. Consequently, two bioinformatics pipelines emerged with no previous studies benchmarking the pipelines. This study focuses on gaining insight and understanding of target enrichment workflow by Illumina through the utilization of different bioinformatics pipelines named as 'Fast Pipeline' and 'Normal Pipeline' to SARS-CoV-2 strains isolated from Yogyakarta and Central Java, Indonesia. Overall, both pipelines work well in the characterization of SARS-CoV-2 samples, including in the identification of major studied nucleotide substitutions and amino acid mutations. A higher number of reads mapped to the SARS-CoV-2 genome in Fast Pipeline and merely were discovered as a contributing factor in a higher number of coverage depth and identified variations (SNPs, insertion, and deletion). Fast Pipeline ultimately works well in a situation where time is a critical factor. On the other hand, Normal Pipeline would require a longer time as it mapped reads to the human genome. Certain limitations were identified in terms of pipeline algorithm, whereas it is highly recommended in future studies to design a pipeline in an integrated framework, for instance, by using NextFlow, a workflow framework to combine all scripts into one fully integrated pipeline.
严重急性呼吸综合征冠状病毒 2 (SARS-CoV-2) 是一种新出现的病毒,由于 2019 年冠状病毒病 (COVID-19),它被公认为导致全球大流行的主要原因。2020 年 1 月 10 日首次发布 SARS-CoV-2 全长基因组后,下一代测序 (NGS) 领域取得了重大突破,希望扭转不断恶化的大流行局面。先前的呼吸道病毒特征研究需要将原始序列映射到下游生物信息学管道中的人类基因组,这是宏基因组学原理的一部分。Illumina 作为 NGS 领域的主要参与者,采取了行动,发布了改进的富集试剂盒的指南,称为呼吸道病毒寡核苷酸面板 (RVOP),该试剂盒基于杂交捕获方法,能够捕获靶向呼吸道病毒,包括 SARS-CoV-2;因此,允许在下游生物信息学管道中直接将原始序列数据映射到 SARS-CoV-2 基因组。因此,出现了两种以前没有研究基准的生物信息学管道。本研究通过利用不同的生物信息学管道,即“快速管道”和“正常管道”,重点研究 Illumina 通过目标富集工作流程获得的见解和理解,以从印度尼西亚日惹和中爪哇分离的 SARS-CoV-2 株。总体而言,两种管道都能很好地对 SARS-CoV-2 样本进行特征描述,包括鉴定主要研究的核苷酸取代和氨基酸突变。Fast Pipeline 中映射到 SARS-CoV-2 基因组的读长数量更多,被发现是覆盖深度和鉴定变异(SNP、插入和缺失)数量较高的一个因素。Fast Pipeline 在时间是关键因素的情况下效果很好。另一方面,Normal Pipeline 需要更长的时间,因为它将读长映射到人类基因组。在管道算法方面发现了某些局限性,因此在未来的研究中强烈建议设计一个集成框架的管道,例如使用 NextFlow,这是一个工作流框架,可将所有脚本组合到一个完全集成的管道中。