Hui Yan, Sandris Nielsen Dennis, Krych Lukasz
Department of Preventive Medicine, School of Public Health and Nursing, Hangzhou Normal University, Hangzhou, China.
Department of Food Science, Faculty of Science, University of Copenhagen, Frederiksberg C, Denmark.
Gut Microbes. 2025 Dec;17(1):2516703. doi: 10.1080/19490976.2025.2516703. Epub 2025 Jun 11.
Long-read amplicon profiling through read classification limits phylogenetic analysis of amplicons while community analysis of multicopy genes, relying on unique molecular identifier (UMI) corrections, often demands deep sequencing. To address this, we present a long amplicon consensus analysis (LACA) workflow employing multiple clustering approaches based on sequence dissimilarity. LACA controls the average error rate of corrected sequences below 1% for the Oxford Nanopore Technologies (ONT) R9.4.1 and ONT R10.3 data, 0.2% for ONT R10.4.1, and 0.1% for high-accuracy ONT Duplex and Pacific Biosciences (PacBio) circular consensus sequencing (CCS) data in both simulated 16S rRNA and real 16-23S rRNA amplicon datasets. In high-accuracy PacBio CCS data, the clustering-based correction matched UMI correction, while outperforming 4× UMI correction in noisy ONT R10.3 and R9.4.1 data. Notably, LACA preserved phylogenetic fidelity in long operational taxonomic units and enhanced microbiome-wide phenotype characterization for synthetic mock communities and human vaginal samples.
通过读段分类进行的长读长扩增子分析限制了扩增子的系统发育分析,而基于独特分子标识符(UMI)校正的多拷贝基因群落分析通常需要深度测序。为了解决这个问题,我们提出了一种长扩增子一致性分析(LACA)工作流程,该流程采用了基于序列差异的多种聚类方法。在模拟的16S rRNA和真实的16-23S rRNA扩增子数据集中,对于牛津纳米孔技术公司(ONT)的R9.4.1和ONT R10.3数据,LACA将校正序列的平均错误率控制在1%以下;对于ONT R10.4.1数据,错误率控制在0.2%以下;对于高精度的ONT双链和太平洋生物科学公司(PacBio)的环形一致性测序(CCS)数据,错误率控制在0.1%以下。在高精度的PacBio CCS数据中,基于聚类的校正与UMI校正相当,而在有噪声的ONT R10.3和R9.4.1数据中,其性能优于4倍UMI校正。值得注意的是,LACA在长操作分类单元中保持了系统发育保真度,并增强了对合成模拟群落和人类阴道样本的全微生物组表型特征描述。