一种基于测序的改进型生物信息学流程，用于追踪前病毒整合位点的分布和克隆结构。

An Improved Sequencing-Based Bioinformatics Pipeline to Track the Distribution and Clonal Architecture of Proviral Integration Sites.

作者信息

Rosewick Nicolas, Hahaut Vincent, Durkin Keith, Artesi Maria, Karpe Snehal, Wayet Jérôme, Griebel Philip, Arsic Natasa, Marçais Ambroise, Hermine Olivier, Burny Arsène, Georges Michel, Van den Broeke Anne

机构信息

Laboratory of Experimental Hematology, Institut Jules Bordet, Université Libre de Bruxelles (ULB), Brussels, Belgium.

Unit of Animal Genomics, GIGA, Université de Liège (ULiège), Liège, Belgium.

出版信息

Front Microbiol. 2020 Oct 20;11:587306. doi: 10.3389/fmicb.2020.587306. eCollection 2020.

DOI:10.3389/fmicb.2020.587306

PMID:33193242

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7606357/

Abstract

The combined application of linear amplification-mediated PCR (LAM-PCR) protocols with next-generation sequencing (NGS) has had a large impact on our understanding of retroviral pathogenesis. Previously, considerable effort has been expended to optimize NGS methods to explore the genome-wide distribution of proviral integration sites and the clonal architecture of clinically important retroviruses like human T-cell leukemia virus type-1 (HTLV-1). Once sequencing data are generated, the application of rigorous bioinformatics analysis is central to the biological interpretation of the data. To better exploit the potential information available through these methods, we developed an optimized bioinformatics pipeline to analyze NGS clonality datasets. We found that short-read aligners, specifically designed to manage NGS datasets, provide increased speed, significantly reducing processing time and decreasing the computational burden. This is achieved while also accounting for sequencing base quality. We demonstrate the utility of an additional trimming step in the workflow, which adjusts for the number of reads supporting each insertion site. In addition, we developed a recall procedure to reduce bias associated with proviral integration within low complexity regions of the genome, providing a more accurate estimation of clone abundance. Finally, we recommend the application of a "clean-and-recover" step to clonality datasets generated from large cohorts and longitudinal studies. In summary, we report an optimized bioinformatics workflow for NGS clonality analysis and describe a new set of steps to guide the computational process. We demonstrate that the application of this protocol to the analysis of HTLV-1 and bovine leukemia virus (BLV) clonality datasets improves the quality of data processing and provides a more accurate definition of the clonal landscape in infected individuals. The optimized workflow and analysis recommendations can be implemented in the majority of bioinformatics pipelines developed to analyze LAM-PCR-based NGS clonality datasets.

摘要

线性扩增介导的聚合酶链反应（LAM-PCR）方案与下一代测序（NGS）的联合应用，对我们理解逆转录病毒发病机制产生了重大影响。此前，人们付出了巨大努力来优化NGS方法，以探索前病毒整合位点的全基因组分布以及诸如人类1型T细胞白血病病毒（HTLV-1）等临床重要逆转录病毒的克隆结构。一旦生成测序数据，严格的生物信息学分析应用对于数据的生物学解释至关重要。为了更好地利用通过这些方法获得的潜在信息，我们开发了一种优化的生物信息学流程来分析NGS克隆性数据集。我们发现，专门设计用于处理NGS数据集的短读长比对器提高了速度，显著减少了处理时间并减轻了计算负担。在考虑测序碱基质量的同时实现了这一点。我们展示了工作流程中额外的修剪步骤的效用，该步骤针对支持每个插入位点的读数数量进行调整。此外，我们开发了一种召回程序，以减少与基因组低复杂性区域内前病毒整合相关的偏差，从而更准确地估计克隆丰度。最后，我们建议对从大型队列和纵向研究中生成的克隆性数据集应用“清理和恢复”步骤。总之，我们报告了一种用于NGS克隆性分析的优化生物信息学工作流程，并描述了一套新的步骤来指导计算过程。我们证明，将该方案应用于HTLV-1和牛白血病病毒（BLV）克隆性数据集的分析，可提高数据处理质量，并更准确地定义受感染个体中的克隆格局。优化后的工作流程和分析建议可在大多数为分析基于LAM-PCR的NGS克隆性数据集而开发的生物信息学流程中实施。