Ren Jiayi, Liu Yuqian, Zhu Xiaoyan, Wang Xuwen, Li Yifei, Liu Yuxin, Hu Wenqing, Zhang Xuanping, Wang Jiayin
School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China.
Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an, China.
Front Genet. 2023 Jun 1;14:1184744. doi: 10.3389/fgene.2023.1184744. eCollection 2023.
Open chromatin regions are the genomic regions associated with basic cellular physiological activities, while chromatin accessibility is reported to affect gene expressions and functions. A basic computational problem is to efficiently estimate open chromatin regions, which could facilitate both genomic and epigenetic studies. Currently, ATAC-seq and cfDNA-seq (plasma cell-free DNA sequencing) are two popular strategies to detect OCRs. As cfDNA-seq can obtain more biomarkers in one round of sequencing, it is considered more effective and convenient. However, in processing cfDNA-seq data, due to the dynamically variable chromatin accessibility, it is quite difficult to obtain the training data with pure OCRs or non-OCRs, and leads to a noise problem for either feature-based approaches or learning-based approaches. In this paper, we propose a learning-based OCR estimation approach with a noise-tolerance design. The proposed approach, named OCRFinder, incorporates the ideas of ensemble learning framework and semi-supervised strategy to avoid potential overfitting of noisy labels, which are the false positives on OCRs and non-OCRs. Compared to different noise control strategies and state-of-the-art approaches, OCRFinder achieved higher accuracies and sensitivities in the experiments. In addition, OCRFinder also has an excellent performance in ATAC-seq or DNase-seq comparison experiments.
开放染色质区域是与基本细胞生理活动相关的基因组区域,而据报道染色质可及性会影响基因表达和功能。一个基本的计算问题是有效地估计开放染色质区域,这有助于基因组和表观遗传学研究。目前,ATAC-seq和cfDNA-seq(血浆游离DNA测序)是检测开放染色质区域(OCR)的两种常用策略。由于cfDNA-seq可以在一轮测序中获得更多生物标志物,因此它被认为更有效、更方便。然而,在处理cfDNA-seq数据时,由于染色质可及性动态变化,很难获得纯开放染色质区域或非开放染色质区域的训练数据,这给基于特征的方法或基于学习的方法带来了噪声问题。在本文中,我们提出了一种具有噪声容忍设计的基于学习的开放染色质区域估计方法。所提出的方法名为OCRFinder,它融合了集成学习框架和半监督策略的思想,以避免噪声标签(即开放染色质区域和非开放染色质区域上的假阳性)的潜在过拟合。与不同的噪声控制策略和现有技术方法相比,OCRFinder在实验中实现了更高的准确率和灵敏度。此外,OCRFinder在ATAC-seq或DNase-seq比较实验中也具有出色的性能。