Lai Xin, Liu Min, Liu Yuqian, Zhu Xiaoyan, Wang Jiayin
School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China.
Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an, China.
Front Genet. 2024 Dec 4;15:1400228. doi: 10.3389/fgene.2024.1400228. eCollection 2024.
Open chromatin regions (OCRs) play a crucial role in transcriptional regulation and gene expression. In recent years, there has been a growing interest in using plasma cell-free DNA (cfDNA) sequencing data to detect OCRs. By analyzing the characteristics of cfDNA fragments and their sequencing coverage, researchers can differentiate OCRs from non-OCRs. However, the presence of noise and variability in cfDNA-seq data poses challenges for the training data used in the noise-tolerance learning-based OCR estimation approach, as it contains numerous noisy labels that may impact the accuracy of the results. For current methods of detecting OCRs, they rely on statistical features derived from typical open and closed chromatin regions to determine whether a region is OCR or non-OCR. However, there are some atypical regions that exhibit statistical features that fall between the two categories, making it difficult to classify them definitively as either open or closed chromatin regions (CCRs). These regions should be considered as partially open chromatin regions (pOCRs). In this paper, we present OCRClassifier, a novel framework that combines control charts and machine learning to address the impact of high-proportion noisy labels in the training set and classify the chromatin open states into three classes accurately. Our method comprises two control charts. We first design a robust Hotelling T control chart and create new run rules to accurately identify reliable OCRs and CCRs within the initial training set. Then, we exclusively utilize the pure training set consisting of OCRs and CCRs to create and train a sensitized T control chart. This sensitized T control chart is specifically designed to accurately differentiate between the three categories of chromatin states: open, partially open, and closed. Experimental results demonstrate that under this framework, the model exhibits not only excellent performance in terms of three-class classification, but also higher accuracy and sensitivity in binary classification compared to the state-of-the-art models currently available.
开放染色质区域(OCRs)在转录调控和基因表达中起着至关重要的作用。近年来,利用浆细胞游离DNA(cfDNA)测序数据检测OCRs的兴趣日益浓厚。通过分析cfDNA片段的特征及其测序覆盖度,研究人员可以区分OCRs和非OCRs。然而,cfDNA测序数据中噪声和变异性的存在给基于耐噪声学习的OCR估计方法中使用的训练数据带来了挑战,因为它包含大量可能影响结果准确性的噪声标签。对于当前检测OCRs的方法,它们依赖于从典型的开放和封闭染色质区域导出的统计特征来确定一个区域是OCR还是非OCR。然而,存在一些非典型区域,其统计特征介于这两类之间,使得难以将它们明确分类为开放或封闭染色质区域(CCRs)。这些区域应被视为部分开放染色质区域(pOCRs)。在本文中,我们提出了OCRClassifier,这是一个新颖的框架,它结合控制图和机器学习来解决训练集中高比例噪声标签的影响,并准确地将染色质开放状态分为三类。我们的方法包括两个控制图。我们首先设计一个稳健的霍特林T控制图并创建新的运行规则,以准确识别初始训练集中可靠的OCRs和CCRs。然后,我们专门利用由OCRs和CCRs组成的纯训练集来创建和训练一个敏感T控制图。这个敏感T控制图专门设计用于准确区分染色质状态的三类:开放、部分开放和封闭。实验结果表明,在此框架下,该模型不仅在三类分类方面表现出色,而且在二分类中与目前可用的最先进模型相比具有更高的准确性和敏感性。