Mencius Jun, Chen Wenjun, Zheng Youqi, An Tingyi, Yu Yongguo, Sun Kun, Feng Huijuan, Feng Zhixing
Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai, China.
Department of Clinical Genetics, Xinhua Hospital affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China.
Nat Commun. 2025 May 2;16(1):4102. doi: 10.1038/s41467-025-59378-x.
As nanopore sequencing has been widely adopted, data accumulation has surged, resulting in over 700,000 public datasets. While these data hold immense potential for advancing genomic research, their utility is compromised by the absence of flowcell type and basecaller configuration in about 85% of the data and associated publications. These parameters are essential for many analysis algorithms, and their misapplication can lead to significant drops in performance. To address this issue, we present LongBow, designed to infer flowcell type and basecaller configuration directly from the base quality value patterns of FASTQ files. LongBow has been tested on 66 in-house basecalled FAST5/POD5 datasets and 1989 public FASTQ datasets, achieving accuracies of 95.33% and 91.45%, respectively. We demonstrate its utility by reanalyzing nanopore sequencing data from the COVID-19 Genomics UK (COG-UK) project. The results show that LongBow is essential for reproducing reported genomic variants and, through a LongBow-based analysis pipeline, we discovered substantially more functionally important variants while improving accuracy in lineage assignment. Overall, LongBow is poised to play a critical role in maximizing the utility of public nanopore sequencing data, while significantly enhancing the reproducibility of related research.
随着纳米孔测序技术的广泛应用,数据积累量激增,已产生超过70万个公共数据集。虽然这些数据在推动基因组研究方面具有巨大潜力,但约85%的数据及相关出版物中缺少流动槽类型和碱基识别器配置,这削弱了它们的实用性。这些参数对许多分析算法至关重要,其错误应用可能导致性能大幅下降。为解决这一问题,我们推出了LongBow,旨在直接从FASTQ文件的碱基质量值模式推断流动槽类型和碱基识别器配置。LongBow已在66个内部碱基识别的FAST5/POD5数据集和1989个公共FASTQ数据集上进行了测试,准确率分别达到95.33%和91.45%。我们通过重新分析英国新冠病毒基因组学(COG-UK)项目的纳米孔测序数据来证明其效用。结果表明,LongBow对于重现已报道的基因组变异至关重要,并且通过基于LongBow的分析流程,我们发现了更多功能上重要的变异,同时提高了谱系分配的准确性。总体而言,LongBow有望在最大化公共纳米孔测序数据的效用方面发挥关键作用,同时显著提高相关研究的可重复性。