Dubois Benjamin, Delitte Mathieu, Lengrand Salomé, Bragard Claude, Legrève Anne, Debode Frédéric
Bioengineering Unit, Life Sciences Department, Walloon Agricultural Research Centre, Gembloux, Belgium.
Earth and Life Institute - Applied Microbiology, Plant Health, UCLouvain, Louvain-la-Neuve, Belgium.
Front Bioinform. 2024 Dec 20;4:1483255. doi: 10.3389/fbinf.2024.1483255. eCollection 2024.
The study of sample taxonomic composition has evolved from direct observations and labor-intensive morphological studies to different DNA sequencing methodologies. Most of these studies leverage the metabarcoding approach, which involves the amplification of a small taxonomically-informative portion of the genome and its subsequent high-throughput sequencing. Recent advances in sequencing technology brought by Oxford Nanopore Technologies have revolutionized the field, enabling portability, affordable cost and long-read sequencing, therefore leading to a significant increase in taxonomic resolution. However, Nanopore sequencing data exhibit a particular profile, with a higher error rate compared with Illumina sequencing, and existing bioinformatics pipelines for the analysis of such data are scarce and often insufficient, requiring specialized tools to accurately process long-read sequences.
We present PRONAME (PROcessing NAnopore MEtabarcoding data), an open-source, user-friendly pipeline optimized for processing raw Nanopore sequencing data. PRONAME includes precompiled databases for complete 16S sequences (Silva138 and Greengenes2) and a newly developed and curated database dedicated to bacterial 16S-ITS-23S operon sequences. The user can also provide a custom database if desired, therefore enabling the analysis of metabarcoding data for any domain of life. The pipeline significantly improves sequence accuracy, implementing innovative error-correction strategies and taking advantage of the new sequencing chemistry to produce high-quality duplex reads. Evaluations using a mock community have shown that PRONAME delivers consensus sequences demonstrating at least 99.5% accuracy with standard settings (and up to 99.7%), making it a robust tool for genomic analysis of complex multi-species communities.
PRONAME meets the challenges of long-read Nanopore data processing, offering greater accuracy and versatility than existing pipelines. By integrating Nanopore-specific quality filtering, clustering and error correction, PRONAME produces high-precision consensus sequences. This brings the accuracy of Nanopore sequencing close to that of Illumina sequencing, while taking advantage of the benefits of long-read technologies.
样本分类组成的研究已从直接观察和劳动密集型的形态学研究发展到不同的DNA测序方法。这些研究大多采用宏条形码方法,该方法涉及对基因组中一小部分具有分类学信息的片段进行扩增,随后进行高通量测序。牛津纳米孔技术带来的测序技术最新进展彻底改变了该领域,实现了便携性、可承受的成本和长读长测序,从而显著提高了分类分辨率。然而,纳米孔测序数据呈现出一种特殊的特征,与Illumina测序相比错误率更高,并且用于分析此类数据的现有生物信息学流程稀缺且往往不足,需要专门的工具来准确处理长读长序列。
我们展示了PRONAME(处理纳米孔宏条形码数据),这是一个为处理原始纳米孔测序数据而优化的开源、用户友好的流程。PRONAME包括用于完整16S序列的预编译数据库(Silva138和Greengenes2)以及一个新开发和整理的专门用于细菌16S - ITS - 23S操纵子序列的数据库。如果需要,用户还可以提供自定义数据库,从而能够分析任何生命领域的宏条形码数据。该流程显著提高了序列准确性,实施了创新的纠错策略,并利用新的测序化学方法生成高质量的双链读数。使用模拟群落进行的评估表明,PRONAME在标准设置下可提供准确率至少为99.5%(最高可达99.7%)的一致序列,使其成为复杂多物种群落基因组分析的强大工具。
PRONAME应对了长读长纳米孔数据处理的挑战,比现有流程具有更高的准确性和通用性。通过整合纳米孔特定的质量过滤、聚类和纠错功能,PRONAME生成高精度的一致序列。这使得纳米孔测序的准确性接近Illumina测序,同时利用了长读长技术的优势。