使用基于纳米孔测序的一维分数聚类计算方法检测多种表观转录组修饰。

Detecting a wide range of epitranscriptomic modifications using a nanopore-sequencing-based computational approach with 1D score-clustering.

作者信息

Vujaklija Ivan, Biđin Siniša, Volarić Marin, Bakić Sara, Li Zhe, Foo Roger, Liu Jianjun, Šikić Mile

机构信息

Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia.

Laboratory of non-coding DNA, Division of Molecular Biology, Ruđer Bošković Institute, Bijenička cesta 54, 10000 Zagreb, Croatia.

出版信息

Nucleic Acids Res. 2025 Jan 7;53(1). doi: 10.1093/nar/gkae1168.

DOI:10.1093/nar/gkae1168

PMID:39658045

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11724293/

Abstract

To date, over 40 epigenetic and 300 epitranscriptomic modifications have been identified. However, current short-read sequencing-based experimental methods can detect <10% of these modifications. Integrating long-read sequencing technologies with advanced computational approaches, including statistical analysis and machine learning, offers a promising new frontier to address this challenge. While supervised machine learning methods have achieved some success, their usefulness is restricted to a limited number of well-characterized modifications. Here, we introduce Modena, an innovative unsupervised learning approach utilizing long-read nanopore sequencing capable of detecting a broad range of modifications. Modena outperformed other methods in five out of six benchmark datasets, in some cases by a wide margin, while being equally competitive with the second best method on one dataset. Uniquely, Modena also demonstrates consistent accuracy on a DNA dataset, distinguishing it from other approaches. A key feature of Modena is its use of 'dynamic thresholding', an approach based on 1D score-clustering. This methodology differs substantially from the traditional statistics-based 'hard-thresholds.' We show that this approach is not limited to Modena but has broader applicability. Specifically, when combined with two existing algorithms, 'dynamic thresholding' significantly enhances their performance, resulting in up to a threefold improvement in F1-scores.

摘要

迄今为止，已鉴定出40多种表观遗传修饰和300多种表观转录组修饰。然而，目前基于短读长测序的实验方法只能检测到这些修饰的不到10%。将长读长测序技术与包括统计分析和机器学习在内的先进计算方法相结合，为应对这一挑战提供了一个充满希望的新领域。虽然监督式机器学习方法已经取得了一些成功，但其效用仅限于少数特征明确的修饰。在这里，我们介绍了Modena，一种创新的无监督学习方法，它利用长读长纳米孔测序能够检测广泛的修饰。在六个基准数据集中的五个中，Modena的表现优于其他方法，在某些情况下优势明显，而在一个数据集上与第二好的方法具有同等竞争力。独特的是，Modena在DNA数据集上也表现出一致的准确性，这使其有别于其他方法。Modena的一个关键特征是它使用了“动态阈值化”，这是一种基于一维分数聚类的方法。这种方法与传统的基于统计的“硬阈值”有很大不同。我们表明，这种方法不仅限于Modena，而且具有更广泛的适用性。具体而言，当与两种现有算法结合使用时，“动态阈值化”显著提高了它们的性能，F1分数提高了两倍。