AmpSeqR：一个用于扩增子高通量测序数据分析的 R 包。

AmpSeqR: an R package for amplicon deep sequencing data analysis.

机构信息

Population Health and Immunity Division, The Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC, 3052, Australia.

Department of Medical Biology, The University of Melbourne, Melbourne, VIC, 3052, Australia.

出版信息

F1000Res. 2023 Mar 23;12:327. doi: 10.12688/f1000research.129581.1. eCollection 2023.

DOI:10.12688/f1000research.129581.1

PMID:39584015

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11584457/

Abstract

Amplicon sequencing (AmpSeq) is a methodology that targets specific genomic regions of interest for polymerase chain reaction (PCR) amplification so that they can be sequenced to a high depth of coverage. Amplicons are typically chosen to be highly polymorphic, usually with several highly informative, high frequency single nucleotide polymorphisms (SNPs) segregating in an amplicon of 100-200 base pair (bp). This allows high sensitivity detection and quantification of the frequency of each sequence within each sample making it suitable for applications such as low frequency somatic mosaicism detection or minor clone detection in mixed samples. AmpSeq is being increasingly applied to both biological and medical studies, in applications such as cancer, infectious diseases and brain mosaicism studies. Current bioinformatics pipelines for AmpSeq data processing lack downstream analysis, have difficulty distinguishing between true sequences and PCR sequencing errors and artifacts, and often require bioinformatic expertise. We present a new R package: AmpSeqR, designed for the processing of deep short-read amplicon sequencing data, with a focus on infectious diseases. The pipeline integrates several existing R packages combining them with newly developed functions to perform optimal filtering of reads to remove noise and improve the accuracy of the detected sequences data, permitting detection of very low frequency clones in mixed samples. The package provides useful functions including data pre-processing, amplicon sequence variants (ASVs) estimation, data post-processing, data visualization, and automatically generates a comprehensive Rmarkdown report that contains all essential results facilitating easy inclusion into reports and publications. AmpSeqR is publicly available at https://github.com/bahlolab/AmpSeqR.

摘要

扩增子测序（AmpSeq）是一种针对聚合酶链式反应（PCR）扩增的特定基因组感兴趣区域的方法，以便对其进行测序以获得高深度的覆盖度。扩增子通常选择高度多态性，通常在 100-200 个碱基对（bp）的扩增子中存在几个高度信息丰富、高频的单核苷酸多态性（SNP）。这允许对每个样本中每个序列的频率进行高灵敏度检测和定量，使其适用于低频体细胞嵌合体检测或混合样本中少量克隆的检测等应用。AmpSeq 越来越多地应用于生物和医学研究，如癌症、传染病和大脑嵌合体研究。目前用于 AmpSeq 数据处理的生物信息学管道缺乏下游分析，难以区分真实序列和 PCR 测序错误和伪影，并且通常需要生物信息学专业知识。我们提出了一个新的 R 包：AmpSeqR，专为处理深度短读扩增子测序数据而设计，重点是传染病。该管道集成了几个现有的 R 包，结合新开发的功能，对读取进行最佳过滤，以去除噪声并提高检测序列数据的准确性，从而可以在混合样本中检测非常低频的克隆。该包提供了有用的功能，包括数据预处理、扩增子序列变体（ASV）估计、数据后处理、数据可视化，并自动生成包含所有必要结果的综合 Rmarkdown 报告，便于轻松包含在报告和出版物中。AmpSeqR 可在 https://github.com/bahlolab/AmpSeqR 上获得。