Suppr超能文献

使用相互依赖分数对海量科学数据集中的依赖关系进行有效量化。

Efficiently quantifying dependence in massive scientific datasets using InterDependence Scores.

作者信息

Radhakrishnan Adityanarayanan, Jain Yajit, Uhler Caroline, Lander Eric S

机构信息

Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA 02142.

Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139.

出版信息

Proc Natl Acad Sci U S A. 2025 Aug 26;122(34):e2509860122. doi: 10.1073/pnas.2509860122. Epub 2025 Aug 20.

Abstract

Large-scale scientific datasets today contain tens of thousands of random variables across millions of samples (for example, the RNA expression levels of 20,000 protein-coding genes across 30 million single cells). Being able to quantify dependencies between these variables would help us discover novel relationships between variables of interest. Simple measures of dependence, such as Pearson correlation, are fast to compute, but limited in that they are designed to detect linear relationships between variables. Complex measures are known with the ability to detect any kind of dependence, but they do not readily scale to many modern datasets of interest. We introduce the InterDependence Score (IDS), a scalable measure of dependence that captures linear and various nonlinear dependencies between random variables. Our IDS algorithm is motivated by a dependence measure defined in infinite-dimensional Hilbert spaces, capable of capturing any type of dependence, and a fast (linear time) algorithm that neural networks natively implement to compute dependencies between random variables. We apply IDS to identify 1) relevant variables for predictive modeling tasks, 2) sets of words forming topics from millions of documents, and 3) sets of genes related to "gene-expression programs" in tens of millions of cells. We provide an efficient implementation that computes IDS between billions of pairs of variables across millions of samples in several hours on a single GPU. Given its speed and effectiveness in identifying nonlinear dependencies, we envision IDS will be a valuable tool for uncovering insights from scientific data.

摘要

如今的大规模科学数据集包含数百万个样本中的数万个随机变量(例如,3000万个单细胞中2万个蛋白质编码基因的RNA表达水平)。能够量化这些变量之间的依赖性将有助于我们发现感兴趣变量之间的新关系。简单的依赖性度量,如皮尔逊相关性,计算速度快,但局限性在于它们旨在检测变量之间的线性关系。复杂的度量已知能够检测任何类型的依赖性,但它们不容易扩展到许多感兴趣的现代数据集。我们引入了相互依赖性得分(IDS),这是一种可扩展的依赖性度量,它捕获随机变量之间的线性和各种非线性依赖性。我们的IDS算法的灵感来自于在无限维希尔伯特空间中定义的一种依赖性度量,它能够捕获任何类型的依赖性,以及一种神经网络原生实现的快速(线性时间)算法,用于计算随机变量之间的依赖性。我们应用IDS来识别:1)预测建模任务的相关变量;2)从数百万文档中形成主题的单词集;3)数千万个细胞中与“基因表达程序”相关的基因集。我们提供了一种高效的实现方法,在单个GPU上只需几个小时就能计算数百万个样本中数十亿对变量之间的IDS。鉴于其在识别非线性依赖性方面的速度和有效性,我们设想IDS将成为从科学数据中挖掘见解的宝贵工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/afbf/12403096/ab4c160990ca/pnas.2509860122fig01.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验