Suppr超能文献

OCR分类器:将统计控制图集成到机器学习框架中以更好地检测开放染色质区域。

OCRClassifier: integrating statistical control chart into machine learning framework for better detecting open chromatin regions.

作者信息

Lai Xin, Liu Min, Liu Yuqian, Zhu Xiaoyan, Wang Jiayin

机构信息

School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China.

Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an, China.

出版信息

Front Genet. 2024 Dec 4;15:1400228. doi: 10.3389/fgene.2024.1400228. eCollection 2024.

Abstract

Open chromatin regions (OCRs) play a crucial role in transcriptional regulation and gene expression. In recent years, there has been a growing interest in using plasma cell-free DNA (cfDNA) sequencing data to detect OCRs. By analyzing the characteristics of cfDNA fragments and their sequencing coverage, researchers can differentiate OCRs from non-OCRs. However, the presence of noise and variability in cfDNA-seq data poses challenges for the training data used in the noise-tolerance learning-based OCR estimation approach, as it contains numerous noisy labels that may impact the accuracy of the results. For current methods of detecting OCRs, they rely on statistical features derived from typical open and closed chromatin regions to determine whether a region is OCR or non-OCR. However, there are some atypical regions that exhibit statistical features that fall between the two categories, making it difficult to classify them definitively as either open or closed chromatin regions (CCRs). These regions should be considered as partially open chromatin regions (pOCRs). In this paper, we present OCRClassifier, a novel framework that combines control charts and machine learning to address the impact of high-proportion noisy labels in the training set and classify the chromatin open states into three classes accurately. Our method comprises two control charts. We first design a robust Hotelling T control chart and create new run rules to accurately identify reliable OCRs and CCRs within the initial training set. Then, we exclusively utilize the pure training set consisting of OCRs and CCRs to create and train a sensitized T control chart. This sensitized T control chart is specifically designed to accurately differentiate between the three categories of chromatin states: open, partially open, and closed. Experimental results demonstrate that under this framework, the model exhibits not only excellent performance in terms of three-class classification, but also higher accuracy and sensitivity in binary classification compared to the state-of-the-art models currently available.

摘要

开放染色质区域(OCRs)在转录调控和基因表达中起着至关重要的作用。近年来,利用浆细胞游离DNA(cfDNA)测序数据检测OCRs的兴趣日益浓厚。通过分析cfDNA片段的特征及其测序覆盖度,研究人员可以区分OCRs和非OCRs。然而,cfDNA测序数据中噪声和变异性的存在给基于耐噪声学习的OCR估计方法中使用的训练数据带来了挑战,因为它包含大量可能影响结果准确性的噪声标签。对于当前检测OCRs的方法,它们依赖于从典型的开放和封闭染色质区域导出的统计特征来确定一个区域是OCR还是非OCR。然而,存在一些非典型区域,其统计特征介于这两类之间,使得难以将它们明确分类为开放或封闭染色质区域(CCRs)。这些区域应被视为部分开放染色质区域(pOCRs)。在本文中,我们提出了OCRClassifier,这是一个新颖的框架,它结合控制图和机器学习来解决训练集中高比例噪声标签的影响,并准确地将染色质开放状态分为三类。我们的方法包括两个控制图。我们首先设计一个稳健的霍特林T控制图并创建新的运行规则,以准确识别初始训练集中可靠的OCRs和CCRs。然后,我们专门利用由OCRs和CCRs组成的纯训练集来创建和训练一个敏感T控制图。这个敏感T控制图专门设计用于准确区分染色质状态的三类:开放、部分开放和封闭。实验结果表明,在此框架下,该模型不仅在三类分类方面表现出色,而且在二分类中与目前可用的最先进模型相比具有更高的准确性和敏感性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8deb/11652186/abf27d8c199e/fgene-15-1400228-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验