Suppr超能文献

用于ATAC序列的无监督对比峰检测工具

Unsupervised Contrastive Peak Caller for ATAC-seq.

作者信息

Vu Ha T H, Zhang Yudi, Tuteja Geetu, Dorman Karin

机构信息

Bioinformatics and Computational Biology Program, Iowa State University, Ames IA 50011, USA.

Department of Genetics, Development and Cell Biology, Iowa State University, Ames IA 50011, USA.

出版信息

bioRxiv. 2023 Jan 8:2023.01.07.523108. doi: 10.1101/2023.01.07.523108.

Abstract

The assay for transposase-accessible chromatin with sequencing (ATAC-seq) is a common assay to identify chromatin accessible regions by using a Tn5 transposase that can access, cut, and ligate adapters to DNA fragments for subsequent amplification and sequencing. These sequenced regions are quantified and tested for enrichment in a process referred to as "peak calling". Most unsupervised peak calling methods are based on simple statistical models and suffer from elevated false positive rates. Newly developed supervised deep learning methods can be successful, but they rely on high quality labeled data for training, which can be difficult to obtain. Moreover, though biological replicates are recognized to be important, there are no established approaches for using replicates in the deep learning tools, and the approaches available for traditional methods either cannot be applied to ATAC-seq, where control samples may be unavailable, or are post-hoc and do not capitalize on potentially complex, but reproducible signal in the read enrichment data. Here, we propose a novel peak caller that uses unsupervised contrastive learning to extract shared signals from multiple replicates. Raw coverage data are encoded to obtain low-dimensional embeddings and optimized to minimize a contrastive loss over biological replicates. These embeddings are passed to another contrastive loss for learning and predicting peaks and decoded to denoised data under an autoencoder loss. We compared our Replicative Contrastive Learner (RCL) method with other existing methods on ATAC-seq data, using annotations from ChromHMM genome and transcription factor ChIP-seq as noisy truth. RCL consistently achieved the best performance.

摘要

转座酶可及染色质测序分析(ATAC-seq)是一种常用的分析方法,通过使用Tn5转座酶来识别染色质可及区域,该转座酶能够进入、切割DNA片段并连接接头,以便后续进行扩增和测序。这些测序区域在一个称为“峰检测”的过程中进行定量和富集测试。大多数无监督峰检测方法基于简单的统计模型,存在较高的假阳性率。新开发的有监督深度学习方法可能会成功,但它们依赖高质量的标记数据进行训练,而这些数据可能难以获得。此外,尽管生物学重复被认为很重要,但在深度学习工具中没有既定的方法来使用重复数据,而传统方法可用的方法要么不能应用于ATAC-seq(因为可能没有对照样本),要么是事后的,没有利用读取富集数据中潜在复杂但可重复的信号。在这里,我们提出了一种新颖的峰检测方法,该方法使用无监督对比学习从多个重复数据中提取共享信号。原始覆盖数据被编码以获得低维嵌入,并进行优化以最小化生物学重复之间的对比损失。这些嵌入被传递到另一个对比损失中进行峰的学习和预测,并在自动编码器损失下解码为去噪数据。我们使用来自ChromHMM基因组注释和转录因子ChIP-seq作为有噪声的真值,在ATAC-seq数据上,将我们的复制对比学习器(RCL)方法与其他现有方法进行了比较。RCL始终取得最佳性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/84fe/9881890/9397d63c90a3/nihpp-2023.01.07.523108v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验