Radhakrishnan Adityanarayanan, Jain Yajit, Uhler Caroline, Lander Eric S
Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA 02142.
Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139.
Proc Natl Acad Sci U S A. 2025 Aug 26;122(34):e2509860122. doi: 10.1073/pnas.2509860122. Epub 2025 Aug 20.
Large-scale scientific datasets today contain tens of thousands of random variables across millions of samples (for example, the RNA expression levels of 20,000 protein-coding genes across 30 million single cells). Being able to quantify dependencies between these variables would help us discover novel relationships between variables of interest. Simple measures of dependence, such as Pearson correlation, are fast to compute, but limited in that they are designed to detect linear relationships between variables. Complex measures are known with the ability to detect any kind of dependence, but they do not readily scale to many modern datasets of interest. We introduce the InterDependence Score (IDS), a scalable measure of dependence that captures linear and various nonlinear dependencies between random variables. Our IDS algorithm is motivated by a dependence measure defined in infinite-dimensional Hilbert spaces, capable of capturing any type of dependence, and a fast (linear time) algorithm that neural networks natively implement to compute dependencies between random variables. We apply IDS to identify 1) relevant variables for predictive modeling tasks, 2) sets of words forming topics from millions of documents, and 3) sets of genes related to "gene-expression programs" in tens of millions of cells. We provide an efficient implementation that computes IDS between billions of pairs of variables across millions of samples in several hours on a single GPU. Given its speed and effectiveness in identifying nonlinear dependencies, we envision IDS will be a valuable tool for uncovering insights from scientific data.
如今的大规模科学数据集包含数百万个样本中的数万个随机变量(例如,3000万个单细胞中2万个蛋白质编码基因的RNA表达水平)。能够量化这些变量之间的依赖性将有助于我们发现感兴趣变量之间的新关系。简单的依赖性度量,如皮尔逊相关性,计算速度快,但局限性在于它们旨在检测变量之间的线性关系。复杂的度量已知能够检测任何类型的依赖性,但它们不容易扩展到许多感兴趣的现代数据集。我们引入了相互依赖性得分(IDS),这是一种可扩展的依赖性度量,它捕获随机变量之间的线性和各种非线性依赖性。我们的IDS算法的灵感来自于在无限维希尔伯特空间中定义的一种依赖性度量,它能够捕获任何类型的依赖性,以及一种神经网络原生实现的快速(线性时间)算法,用于计算随机变量之间的依赖性。我们应用IDS来识别:1)预测建模任务的相关变量;2)从数百万文档中形成主题的单词集;3)数千万个细胞中与“基因表达程序”相关的基因集。我们提供了一种高效的实现方法,在单个GPU上只需几个小时就能计算数百万个样本中数十亿对变量之间的IDS。鉴于其在识别非线性依赖性方面的速度和有效性,我们设想IDS将成为从科学数据中挖掘见解的宝贵工具。