一种用于纳米孔直接RNA测序中嵌合体伪影检测的基因组语言模型。

A Genomic Language Model for Chimera Artifact Detection in Nanopore Direct RNA Sequencing.

作者信息

Li Yangyang, Wang Ting-You, Guo Qingxiang, Ren Yanan, Lu Xiaotong, Cao Qi, Yang Rendong

机构信息

Department of Urology, Northwestern University Feinberg School of Medicine, 303 E Superior St, Chicago, 60611, IL, USA.

Robert H. Lurie Comprehensive Cancer Center, Northwestern University Feinberg School of Medicine, 675 N St Clair St, Chicago, 60611, IL, USA.

出版信息

bioRxiv. 2024 Oct 26:2024.10.23.619929. doi: 10.1101/2024.10.23.619929.

DOI:10.1101/2024.10.23.619929

PMID:39484530

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11526916/

Abstract

Chimera artifacts in nanopore direct RNA sequencing (dRNA-seq) can significantly distort transcriptome analyses, yet their detection and removal remain challenging due to limitations in existing basecalling models. We present DeepChopper, a genomic language model that precisely identifies and removes adapter sequences from base-called dRNA-seq long reads at single-base resolution, operating independently of raw signal or alignment information to effectively eliminate chimeric read artifacts. By removing these artifacts, DeepChopper substantially improves the accuracy of critical downstream analyses, such as transcript annotation and gene fusion detection, thereby enhancing the reliability and utility of nanopore dRNA-seq for transcriptomics research.

摘要

纳米孔直接RNA测序（dRNA-seq）中的嵌合体伪影会严重扭曲转录组分析，但由于现有碱基识别模型的局限性，其检测和去除仍然具有挑战性。我们提出了DeepChopper，这是一种基因组语言模型，它能以单碱基分辨率精确识别并从碱基识别的dRNA-seq长读段中去除接头序列，独立于原始信号或比对信息进行操作，以有效消除嵌合读段伪影。通过去除这些伪影，DeepChopper显著提高了关键下游分析（如转录本注释和基因融合检测）的准确性，从而增强了纳米孔dRNA-seq在转录组学研究中的可靠性和实用性。