训练数据的多样性增强了新型RNA修饰诱导的纳米孔测序读数的碱基识别。

Training data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readouts.

作者信息

Wang Ziyuan, Liu Ziyang, Fang Yinshan, Zhang Hao Helen, Sun Xiaoxiao, Hao Ning, Que Jianwen, Ding Hongxu

机构信息

Department of Pharmacy Practice and Science, University of Arizona, Tucson, AZ, USA.

Statistics and Data Science GIDP, University of Arizona, Tucson, AZ, USA.

出版信息

Nat Commun. 2025 Jan 15;16(1):679. doi: 10.1038/s41467-025-55974-z.

DOI:10.1038/s41467-025-55974-z

PMID:39814719

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11735843/

Abstract

Accurately basecalling sequence backbones in the presence of nucleotide modifications remains a substantial challenge in nanopore sequencing bioinformatics. It has been extensively demonstrated that state-of-the-art basecallers are less compatible with modification-induced sequencing signals. A precise basecalling, on the other hand, serves as the prerequisite for virtually all the downstream analyses. Here, we report that basecallers exposed to diverse training modifications gain the generalizability to analyze novel modifications. With synthesized oligos as the model system, we precisely basecall various out-of-sample RNA modifications. From the representation learning perspective, we attribute this generalizability to basecaller representation space expanded by diverse training modifications. Taken together, we conclude increasing the training data diversity as a paradigm for building modification-tolerant nanopore sequencing basecallers.

摘要

在存在核苷酸修饰的情况下准确地进行碱基识别以确定序列骨架，仍然是纳米孔测序生物信息学中的一项重大挑战。已有大量研究表明，最先进的碱基识别器与修饰诱导的测序信号的兼容性较差。另一方面，精确的碱基识别是几乎所有下游分析的前提条件。在此，我们报告称，暴露于多种训练修饰的碱基识别器获得了分析新型修饰的通用性。以合成寡核苷酸作为模型系统，我们精确地对各种样本外RNA修饰进行碱基识别。从表征学习的角度来看，我们将这种通用性归因于通过多种训练修饰扩展的碱基识别器表征空间。综上所述，我们得出结论，增加训练数据的多样性是构建耐修饰纳米孔测序碱基识别器的一种范式。