Department of Biophysics, Johns Hopkins University, 3400 N. Charles Street, Baltimore, MD, 21218, USA.
10x Genomics, 6230 Stoneridge Mall Road, Pleasanton, CA, 94588-3260, USA.
BMC Bioinformatics. 2023 Mar 6;24(1):84. doi: 10.1186/s12859-022-05047-5.
A cell exhibits a variety of responses to internal and external cues. These responses are possible, in part, due to the presence of an elaborate gene regulatory network (GRN) in every single cell. In the past 20 years, many groups worked on reconstructing the topological structure of GRNs from large-scale gene expression data using a variety of inference algorithms. Insights gained about participating players in GRNs may ultimately lead to therapeutic benefits. Mutual information (MI) is a widely used metric within this inference/reconstruction pipeline as it can detect any correlation (linear and non-linear) between any number of variables (n-dimensions). However, the use of MI with continuous data (for example, normalized fluorescence intensity measurement of gene expression levels) is sensitive to data size, correlation strength and underlying distributions, and often requires laborious and, at times, ad hoc optimization.
In this work, we first show that estimating MI of a bi- and tri-variate Gaussian distribution using k-nearest neighbor (kNN) MI estimation results in significant error reduction as compared to commonly used methods based on fixed binning. Second, we demonstrate that implementing the MI-based kNN Kraskov-Stoögbauer-Grassberger (KSG) algorithm leads to a significant improvement in GRN reconstruction for popular inference algorithms, such as Context Likelihood of Relatedness (CLR). Finally, through extensive in-silico benchmarking we show that a new inference algorithm CMIA (Conditional Mutual Information Augmentation), inspired by CLR, in combination with the KSG-MI estimator, outperforms commonly used methods.
Using three canonical datasets containing 15 synthetic networks, the newly developed method for GRN reconstruction-which combines CMIA, and the KSG-MI estimator-achieves an improvement of 20-35% in precision-recall measures over the current gold standard in the field. This new method will enable researchers to discover new gene interactions or better choose gene candidates for experimental validations.
细胞对外界和内部信号会产生多种反应。这些反应之所以成为可能,部分原因在于每个细胞中都存在着一个精心设计的基因调控网络(GRN)。在过去的 20 年中,许多研究小组使用各种推断算法,从大规模基因表达数据中重建 GRN 的拓扑结构。对 GRN 中参与调控的分子的深入了解最终可能带来治疗上的益处。互信息(MI)是推断/重建管道中常用的指标,因为它可以检测任意数量变量(n 维)之间的任何线性和非线性相关。然而,使用 MI 处理连续数据(例如,基因表达水平的归一化荧光强度测量)时,数据大小、相关性强度和基础分布都会对其产生影响,通常需要进行繁琐的、有时是特定的优化。
在这项工作中,我们首先证明,与常用的基于固定分箱的方法相比,使用 k-最近邻(kNN)MI 估计来估计双变量和三变量高斯分布的 MI 会显著减少误差。其次,我们证明,实现基于 MI 的 kNN Kraskov-Stoögbauer-Grassberger(KSG)算法会显著提高流行推断算法(如上下文亲缘关系似然(CLR))的 GRN 重建。最后,通过广泛的模拟基准测试,我们表明,一种新的推断算法 CMIA(条件互信息增强),受 CLR 的启发,与 KSG-MI 估计器相结合,在精度-召回率方面优于常用方法。
使用包含 15 个合成网络的三个典型数据集,新开发的 GRN 重建方法——结合了 CMIA 和 KSG-MI 估计器——在精度-召回率方面比当前该领域的黄金标准提高了 20-35%。这种新方法将使研究人员能够发现新的基因相互作用,或者更好地选择候选基因进行实验验证。