Suppr超能文献

Aryana-bs:亚硫酸氢盐测序读数的上下文感知比对

Aryana-bs: context-aware alignment of bisulfite-sequencing reads.

作者信息

Nikaein Hassan, Sharifi-Zarchi Ali, Afzal Afsoon, Ezzati Saeedeh, Rasti Farzane, Chitsaz Hamidreza, Kunde-Ramamoorthy Govindarajan

机构信息

Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.

School of Computer Science, Carnegie Mellon University, Pittsburgh, USA.

出版信息

BMC Bioinformatics. 2025 Jul 21;26(1):188. doi: 10.1186/s12859-025-06182-5.

Abstract

BACKGROUND

DNA methylation is essential in various biological processes, including imprinting, development, inflammation, and numerous disorders, such as cancer. Bisulfite sequencing (BS) serves as the gold standard for measuring DNA methylation at single-base resolution by converting unmethylated cytosines to thymines while leaving methylated cytosines intact. However, this C-to-T conversion presents a well-known challenge in conventional short-read aligners, which treat these conversions as substitutions. Many aligners that require seed sequences fail when frequent C-to-T conversions occur over short distances, resulting in reduced alignment accuracy. To address this challenge, two alignment methods have been well established: three-letter alignment and wildcard alignment. Three-letter alignment faces the significant issue of data loss by converting all thymines to cytosines, which obscures meaningful information. On the other hand, wildcard alignment introduces a biased alignment, failing to treat reads from unmethylated and methylated regions equally, leading to artifacts in methylation level estimation and inaccuracies in quantifying DNA methylation. This work introduces ARYANA-BS, a novel BS aligner that diverges from conventional DNA aligners by directly integrating BS-specific base alterations within its alignment engine. Leveraging known DNA methylation patterns across different genomic contexts, ARYANA-BS constructs five indexes from the reference genome, aligns each read to all indexes, and selects the alignment with the minimum penalty. To further refine alignment accuracy, an optional Expectation-Maximization (EM) step is incorporated, which integrates methylation probability information into the decision-making process for choosing the optimal index for each read. This approach aims to enhance BS read alignment accuracy by accommodating the complexities of DNA methylation patterns across diverse genomic contexts.

RESULTS

Experimental evaluations on both simulated and real data reveal that ARYANA-BS achieves state-of-the-art accuracy, maintaining competitive speed and memory efficiency.

CONCLUSIONS

ARYANA-BS significantly improves alignment accuracy for bisulfite sequencing data by effectively integrating DNA methylation-specific alterations and genomic context. It outperforms existing methods, such as BSMAP, bwa-meth, Bismark, BSBolt, and abismal, particularly in robustness against genomic biases and alignment of longer, higher-error reads, demonstrating suitability for cancer research and cell-free DNA studies. While the Expectation-Maximization (EM) algorithm provides only modest initial improvements, it establishes a valuable framework for future refinement and potential enhancements in sensitive applications.

摘要

背景

DNA甲基化在各种生物过程中至关重要,包括印记、发育、炎症以及许多疾病,如癌症。亚硫酸氢盐测序(BS)是通过将未甲基化的胞嘧啶转化为胸腺嘧啶,同时保持甲基化的胞嘧啶不变,从而在单碱基分辨率下测量DNA甲基化的金标准。然而,这种C到T的转化在传统的短读长比对工具中带来了一个众所周知的挑战,这些工具将这些转化视为替换。当短距离内频繁发生C到T的转化时,许多需要种子序列的比对工具会失败,导致比对准确性降低。为应对这一挑战,已经建立了两种比对方法:三字母比对和通配符比对。三字母比对面临着数据丢失的重大问题,即将所有胸腺嘧啶转化为胞嘧啶,这会掩盖有意义的信息。另一方面,通配符比对引入了有偏差的比对,不能平等对待来自未甲基化和甲基化区域的读段,导致甲基化水平估计中的假象以及DNA甲基化定量的不准确。这项工作引入了ARYANA-BS,一种新型的BS比对工具,它通过在其比对引擎中直接整合BS特异性的碱基改变,与传统的DNA比对工具不同。利用不同基因组背景下已知的DNA甲基化模式,ARYANA-BS从参考基因组构建五个索引,将每个读段与所有索引进行比对,并选择惩罚最小的比对。为了进一步提高比对准确性,纳入了一个可选的期望最大化(EM)步骤,该步骤将甲基化概率信息整合到为每个读段选择最佳索引的决策过程中。这种方法旨在通过适应不同基因组背景下DNA甲基化模式的复杂性来提高BS读段比对的准确性。

结果

对模拟数据和真实数据的实验评估表明,ARYANA-BS实现了最先进的准确性,同时保持了有竞争力的速度和内存效率。

结论

ARYANA-BS通过有效整合DNA甲基化特异性改变和基因组背景,显著提高了亚硫酸氢盐测序数据的比对准确性。它优于现有方法,如BSMAP、bwa-meth、Bismark、BSBolt和abismal,特别是在抵抗基因组偏差和比对更长、错误率更高的读段方面,证明适用于癌症研究和游离DNA研究。虽然期望最大化(EM)算法仅提供了适度的初始改进,但它为未来在敏感应用中的优化和潜在增强建立了一个有价值的框架。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验