Yang Chao, Zhang Zhenmiao, Huang Yufen, Xie Xuefeng, Liao Herui, Xiao Jin, Veldsman Werner Pieter, Yin Kejing, Fang Xiaodong, Zhang Lu
Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR 999077, Hong Kong.
BGI Research, Shenzhen 518083, China.
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae028.
Linked-read sequencing technologies generate high-base quality short reads that contain extrapolative information on long-range DNA connectedness. These advantages of linked-read technologies are well known and have been demonstrated in many human genomic and metagenomic studies. However, existing linked-read analysis pipelines (e.g., Long Ranger) were primarily developed to process sequencing data from the human genome and are not suited for analyzing metagenomic sequencing data. Moreover, linked-read analysis pipelines are typically limited to 1 specific sequencing platform.
To address these limitations, we present the Linked-Read ToolKit (LRTK), a unified and versatile toolkit for platform agnostic processing of linked-read sequencing data from both human genome and metagenome. LRTK provides functions to perform linked-read simulation, barcode sequencing error correction, barcode-aware read alignment and metagenome assembly, reconstruction of long DNA fragments, taxonomic classification and quantification, and barcode-assisted genomic variant calling and phasing. LRTK has the ability to process multiple samples automatically and provides users with the option to generate reproducible reports during processing of raw sequencing data and at multiple checkpoints throughout downstream analysis. We applied LRTK on linked reads from simulation, mock community, and real datasets for both human genome and metagenome. We showcased LRTK's ability to generate comparative performance results from preceding benchmark studies and to report these results in publication-ready HTML document plots.
LRTK provides comprehensive and flexible modules along with an easy-to-use Python-based workflow for processing linked-read sequencing datasets, thereby filling the current gap in the field caused by platform-centric genome-specific linked-read data analysis tools.
连接读长测序技术可生成高质量碱基的短读长序列,其中包含有关长距离DNA连接性的推断信息。连接读长技术的这些优势广为人知,并已在许多人类基因组和宏基因组研究中得到证实。然而,现有的连接读长分析流程(例如Long Ranger)主要是为处理人类基因组的测序数据而开发的,并不适合分析宏基因组测序数据。此外,连接读长分析流程通常仅限于1种特定的测序平台。
为了解决这些局限性,我们推出了连接读长工具包(LRTK),这是一个统一且通用的工具包,用于对来自人类基因组和宏基因组的连接读长测序数据进行与平台无关的处理。LRTK提供了执行连接读长模拟、条形码测序错误校正、条形码感知读长比对和宏基因组组装、长DNA片段重建、分类学分类和定量以及条形码辅助基因组变异检测和定相的功能。LRTK能够自动处理多个样本,并为用户提供在原始测序数据处理过程中以及整个下游分析的多个检查点生成可重现报告的选项。我们将LRTK应用于来自模拟、mock群落以及人类基因组和宏基因组真实数据集的连接读长。我们展示了LRTK从先前的基准研究中生成比较性能结果并在可用于发表的HTML文档图中报告这些结果的能力。
LRTK提供了全面且灵活的模块以及基于Python的易于使用的工作流程,用于处理连接读长测序数据集,从而填补了由以平台为中心的基因组特定连接读长数据分析工具导致的该领域当前空白。