使用相互依赖分数对海量科学数据集中的依赖关系进行有效量化。

Efficiently quantifying dependence in massive scientific datasets using InterDependence Scores.

作者信息

Radhakrishnan Adityanarayanan, Jain Yajit, Uhler Caroline, Lander Eric S

机构信息

Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA 02142.

Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139.

出版信息

Proc Natl Acad Sci U S A. 2025 Aug 26;122(34):e2509860122. doi: 10.1073/pnas.2509860122. Epub 2025 Aug 20.

DOI:10.1073/pnas.2509860122

PMID:40833404

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12403096/

Abstract

Large-scale scientific datasets today contain tens of thousands of random variables across millions of samples (for example, the RNA expression levels of 20,000 protein-coding genes across 30 million single cells). Being able to quantify dependencies between these variables would help us discover novel relationships between variables of interest. Simple measures of dependence, such as Pearson correlation, are fast to compute, but limited in that they are designed to detect linear relationships between variables. Complex measures are known with the ability to detect any kind of dependence, but they do not readily scale to many modern datasets of interest. We introduce the InterDependence Score (IDS), a scalable measure of dependence that captures linear and various nonlinear dependencies between random variables. Our IDS algorithm is motivated by a dependence measure defined in infinite-dimensional Hilbert spaces, capable of capturing any type of dependence, and a fast (linear time) algorithm that neural networks natively implement to compute dependencies between random variables. We apply IDS to identify 1) relevant variables for predictive modeling tasks, 2) sets of words forming topics from millions of documents, and 3) sets of genes related to "gene-expression programs" in tens of millions of cells. We provide an efficient implementation that computes IDS between billions of pairs of variables across millions of samples in several hours on a single GPU. Given its speed and effectiveness in identifying nonlinear dependencies, we envision IDS will be a valuable tool for uncovering insights from scientific data.

摘要

如今的大规模科学数据集包含数百万个样本中的数万个随机变量（例如，3000万个单细胞中2万个蛋白质编码基因的RNA表达水平）。能够量化这些变量之间的依赖性将有助于我们发现感兴趣变量之间的新关系。简单的依赖性度量，如皮尔逊相关性，计算速度快，但局限性在于它们旨在检测变量之间的线性关系。复杂的度量已知能够检测任何类型的依赖性，但它们不容易扩展到许多感兴趣的现代数据集。我们引入了相互依赖性得分（IDS），这是一种可扩展的依赖性度量，它捕获随机变量之间的线性和各种非线性依赖性。我们的IDS算法的灵感来自于在无限维希尔伯特空间中定义的一种依赖性度量，它能够捕获任何类型的依赖性，以及一种神经网络原生实现的快速（线性时间）算法，用于计算随机变量之间的依赖性。我们应用IDS来识别：1）预测建模任务的相关变量；2）从数百万文档中形成主题的单词集；3）数千万个细胞中与“基因表达程序”相关的基因集。我们提供了一种高效的实现方法，在单个GPU上只需几个小时就能计算数百万个样本中数十亿对变量之间的IDS。鉴于其在识别非线性依赖性方面的速度和有效性，我们设想IDS将成为从科学数据中挖掘见解的宝贵工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/afbf/12403096/ab4c160990ca/pnas.2509860122fig01.jpg

相似文献

Efficiently quantifying dependence in massive scientific datasets using InterDependence Scores.使用相互依赖分数对海量科学数据集中的依赖关系进行有效量化。

Proc Natl Acad Sci U S A. 2025 Aug 26;122(34):e2509860122. doi: 10.1073/pnas.2509860122. Epub 2025 Aug 20.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

A New Measure of Quantified Social Health Is Associated With Levels of Discomfort, Capability, and Mental and General Health Among Patients Seeking Musculoskeletal Specialty Care.一种新的量化社会健康指标与寻求肌肉骨骼专科护理的患者的不适程度、能力以及心理和总体健康水平相关。

Clin Orthop Relat Res. 2025 Apr 1;483(4):647-663. doi: 10.1097/CORR.0000000000003394. Epub 2025 Feb 5.

Short-Term Memory Impairment短期记忆障碍

A rapid and systematic review of the clinical effectiveness and cost-effectiveness of paclitaxel, docetaxel, gemcitabine and vinorelbine in non-small-cell lung cancer.对紫杉醇、多西他赛、吉西他滨和长春瑞滨在非小细胞肺癌中的临床疗效和成本效益进行的快速系统评价。

Health Technol Assess. 2001;5(32):1-195. doi: 10.3310/hta5320.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Sexual Harassment and Prevention Training性骚扰与预防培训

ConvexML: Fast and accurate branch length estimation under irreversible mutation models, illustrated through applications to CRISPR/Cas9-based lineage tracing.ConvexML：在不可逆突变模型下进行快速准确的分支长度估计，并通过基于CRISPR/Cas9的谱系追踪应用加以说明。

Syst Biol. 2025 Aug 8. doi: 10.1093/sysbio/syaf054.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Healthcare workers' informal uses of mobile phones and other mobile devices to support their work: a qualitative evidence synthesis.医护人员非正规使用手机和其他移动设备来支持工作：定性证据综合评价。

Cochrane Database Syst Rev. 2024 Aug 27;8(8):CD015705. doi: 10.1002/14651858.CD015705.pub2.

本文引用的文献

CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data.CZ CELLxGENE发现平台：一个用于对聚合数据进行可扩展探索、分析和建模的单细胞数据平台。

Nucleic Acids Res. 2025 Jan 6;53(D1):D886-D900. doi: 10.1093/nar/gkae1142.

Mechanism for feature learning in neural networks and backpropagation-free machine learning models.神经网络和无反向传播机器学习模型中的特征学习机制。

Science. 2024 Mar 29;383(6690):1461-1467. doi: 10.1126/science.adi5639. Epub 2024 Mar 7.

Supervised discovery of interpretable gene programs from single-cell data.基于监督学习的单细胞数据基因程序可解释性发现

Nat Biotechnol. 2024 Jul;42(7):1084-1095. doi: 10.1038/s41587-023-01940-3. Epub 2023 Sep 21.

Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

Mature dendritic cells enriched in immunoregulatory molecules (mregDCs): A novel population in the tumour microenvironment and immunotherapy target.富含免疫调节分子的成熟树突状细胞（mregDCs）：肿瘤微环境中的一个新群体和免疫治疗靶标。

Clin Transl Med. 2023 Feb;13(2):e1199. doi: 10.1002/ctm2.1199.

Parametric UMAP Embeddings for Representation and Semisupervised Learning.用于表示和半监督学习的参数化均匀流形近似投影嵌入

Neural Comput. 2021 Oct 12;33(11):2881-2907. doi: 10.1162/neco_a_01434.

Accurate prediction of protein structures and interactions using a three-track neural network.使用三轨神经网络准确预测蛋白质结构和相互作用。

Science. 2021 Aug 20;373(6557):871-876. doi: 10.1126/science.abj8754. Epub 2021 Jul 15.

Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。

Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

Hotspot identifies informative gene modules across modalities of single-cell genomics.热点识别单细胞基因组学多模态信息基因模块。

Cell Syst. 2021 May 19;12(5):446-456.e9. doi: 10.1016/j.cels.2021.04.005. Epub 2021 May 4.

Alveolar progenitor and stem cells in lung development, renewal and cancer.肺发育、更新和癌症中的肺泡祖细胞和干细胞。

Nature. 2014 Mar 13;507(7491):190-4. doi: 10.1038/nature12930. Epub 2014 Feb 5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用相互依赖分数对海量科学数据集中的依赖关系进行有效量化。

Efficiently quantifying dependence in massive scientific datasets using InterDependence Scores.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献