亲和力分布的数学建模以及从全基因组结合谱估计转录因子的一般结合特性

Mathematical Modeling of Avidity Distribution and Estimating General Binding Properties of Transcription Factors from Genome-Wide Binding Profiles.

作者信息

Kuznetsov Vladimir A

机构信息

Bioinformatics Institute, Agency of Science, Technology and Research, 30 Biopolis Street, #07-01 Matrix, Singapore, 138671, Singapore.

School of Computer Science and Engineering, Nanyang Technological University, Singapore, 639798, Singapore.

出版信息

Methods Mol Biol. 2017;1613:193-276. doi: 10.1007/978-1-4939-7027-8_9.

DOI:10.1007/978-1-4939-7027-8_9

PMID:28849563

Abstract

The shape of the experimental frequency distributions (EFD) of diverse molecular interaction events quantifying genome-wide binding is often skewed to the rare but abundant quantities. Such distributions are systematically deviated from standard power-law functions proposed by scale-free network models suggesting that more explanatory and predictive probabilistic model(s) are needed. Identification of the mechanism-based data-driven statistical distributions that provide an estimation and prediction of binding properties of transcription factors from genome-wide binding profiles is the goal of this analytical survey. Here, we review and develop an analytical framework for modeling, analysis, and prediction of transcription factor (TF) DNA binding properties detected at the genome scale. We introduce a mixture probabilistic model of binding avidity function that includes nonspecific and specific binding events. A method for decomposition of specific and nonspecific TF-DNA binding events is proposed. We show that the Kolmogorov-Waring (KW) probability function (PF), modeling the steady state TF binding-dissociation stochastic process, fits well with the EFD for diverse TF-DNA binding datasets. Furthermore, this distribution predicts total number of TF-DNA binding sites (BSs), estimating specificity and sensitivity as well as other basic statistical features of DNA-TF binding when the experimental datasets are noise-rich and essentially incomplete. The KW distribution fits equally well to TF-DNA binding activity for different TFs including ERE, CREB, STAT1, Nanog, and Oct4. Our analysis reveals that the KW distribution and its generalized form provides the family of power-law-like distributions given in terms of hypergeometric series functions, including standard and generalized Pareto and Waring distributions, providing flexible and common skewed forms of the transcription factor binding site (TFBS) avidity distribution function. We suggest that the skewed binding events may be due to a wide range of evolutionary processes of creating weak avidity TFBS associated with random mutations, while the rare high-avidity binding sites (i.e., high-avidity evolutionarily conserved canonical e-boxes) rarely occurred. These, however, may be positively selected in microevolution.

摘要

量化全基因组结合的各种分子相互作用事件的实验频率分布（EFD）形状，往往偏向于数量稀少但出现频率高的情况。这种分布系统地偏离了无标度网络模型提出的标准幂律函数，这表明需要更多具有解释力和预测性的概率模型。从全基因组结合谱中识别基于机制的数据驱动统计分布，以估计和预测转录因子的结合特性，是本次分析研究的目标。在此，我们回顾并开发了一个用于建模、分析和预测在基因组规模上检测到的转录因子（TF）DNA结合特性的分析框架。我们引入了一个结合亲和力函数的混合概率模型，该模型包括非特异性和特异性结合事件。提出了一种分解特异性和非特异性TF-DNA结合事件的方法。我们表明，对稳态TF结合-解离随机过程进行建模的Kolmogorov-Waring（KW）概率函数（PF），与各种TF-DNA结合数据集的EFD拟合良好。此外，当实验数据集噪声丰富且基本不完整时，这种分布可以预测TF-DNA结合位点（BS）的总数，估计特异性和敏感性以及DNA-TF结合的其他基本统计特征。KW分布对包括雌激素反应元件（ERE）、环磷腺苷效应元件结合蛋白（CREB）、信号转导和转录激活因子1（STAT1）、Nanog和八聚体结合转录因子4（Oct4）在内的不同TF的TF-DNA结合活性拟合效果同样良好。我们的分析表明，KW分布及其广义形式提供了一类以超几何级数函数表示的幂律样分布，包括标准和广义帕累托分布以及Waring分布，提供了转录因子结合位点（TFBS）亲和力分布函数灵活且常见的偏态形式。我们认为，偏态结合事件可能是由于与随机突变相关的创建弱亲和力TFBS的广泛进化过程导致的，而罕见的高亲和力结合位点（即高亲和力进化保守的典型e盒）很少出现。然而，这些位点可能在微观进化中受到正选择。