Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, Charlotte, NC, 28223, USA.
BMC Biol. 2022 Oct 5;20(1):221. doi: 10.1186/s12915-022-01426-9.
Predicting cis-regulatory modules (CRMs) in a genome and their functional states in various cell/tissue types of the organism are two related challenging computational tasks. Most current methods attempt to simultaneously achieve both using data of multiple epigenetic marks in a cell/tissue type. Though conceptually attractive, they suffer high false discovery rates and limited applications. To fill the gaps, we proposed a two-step strategy to first predict a map of CRMs in the genome, and then predict functional states of all the CRMs in various cell/tissue types of the organism. We have recently developed an algorithm for the first step that was able to more accurately and completely predict CRMs in a genome than existing methods by integrating numerous transcription factor ChIP-seq datasets in the organism. Here, we presented machine-learning methods for the second step.
We showed that functional states in a cell/tissue type of all the CRMs in the genome could be accurately predicted using data of only 1~4 epigenetic marks by a variety of machine-learning classifiers. Our predictions are substantially more accurate than the best achieved so far. Interestingly, a model trained on a cell/tissue type in humans can accurately predict functional states of CRMs in different cell/tissue types of humans as well as of mice, and vice versa. Therefore, epigenetic code that defines functional states of CRMs in various cell/tissue types is universal at least in humans and mice. Moreover, we found that from tens to hundreds of thousands of CRMs were active in a human and mouse cell/tissue type, and up to 99.98% of them were reutilized in different cell/tissue types, while as small as 0.02% of them were unique to a cell/tissue type that might define the cell/tissue type.
Our two-step approach can accurately predict functional states in any cell/tissue type of all the CRMs in the genome using data of only 1~4 epigenetic marks. Our approach is also more cost-effective than existing methods that typically use data of more epigenetic marks. Our results suggest common epigenetic rules for defining functional states of CRMs in various cell/tissue types in humans and mice.
预测基因组中的顺式调控模块 (CRM) 及其在生物体各种细胞/组织类型中的功能状态是两个相关的具有挑战性的计算任务。目前大多数方法试图使用细胞/组织类型中的多种表观遗传标记的数据同时实现这两个目标。尽管概念上很有吸引力,但它们存在高假阳性率和应用有限的问题。为了填补这些空白,我们提出了一种两步策略,首先预测基因组中的 CRM 图谱,然后预测生物体各种细胞/组织类型中所有 CRM 的功能状态。我们最近开发了一种算法,用于第一步,该算法通过整合生物体中大量转录因子 ChIP-seq 数据集,能够比现有方法更准确和完整地预测基因组中的 CRM。在这里,我们提出了用于第二步的机器学习方法。
我们表明,仅使用 1 到 4 种表观遗传标记的数据,通过各种机器学习分类器,就可以准确预测基因组中所有 CRM 在细胞/组织类型中的功能状态。我们的预测比迄今为止取得的最佳结果更为准确。有趣的是,在人类细胞/组织类型上训练的模型可以准确预测人类和小鼠不同细胞/组织类型以及反之 CRM 的功能状态。因此,定义各种细胞/组织类型中 CRM 功能状态的表观遗传密码至少在人类和小鼠中是通用的。此外,我们发现,在人类和小鼠的细胞/组织类型中,有数十到数十万的 CRM 处于活跃状态,其中多达 99.98%在不同的细胞/组织类型中被重新利用,而只有 0.02%是特定于细胞/组织类型的,这些可能定义了细胞/组织类型。
我们的两步方法仅使用 1 到 4 种表观遗传标记的数据,就可以准确预测基因组中所有 CRM 在任何细胞/组织类型中的功能状态。与通常使用更多表观遗传标记数据的现有方法相比,我们的方法更具成本效益。我们的结果表明,在人类和小鼠的各种细胞/组织类型中,定义 CRM 功能状态存在共同的表观遗传规则。