剖析基因表达异质性：广义皮尔逊相关平方与 - 线聚类算法

Dissecting gene expression heterogeneity: generalized Pearson correlation squares and the -lines clustering algorithm.

作者信息

Li Jingyi Jessica, Zhou Heather J, Bickel Peter J, Tong Xin

机构信息

Department of Statistics, University of California, Los Angeles.

Department of Statistics, University of California, Berkeley.

出版信息

J Am Stat Assoc. 2024;119(548):2450-2463. doi: 10.1080/01621459.2024.2342639. Epub 2024 May 24.

DOI:10.1080/01621459.2024.2342639

PMID:39697782

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11651632/

Abstract

Motivated by the pressing needs for dissecting heterogeneous relationships in gene expression data, here we generalize the squared Pearson correlation to capture a mixture of linear dependences between two real-valued variables, with or without an index variable that specifies the line memberships. We construct the generalized Pearson correlation squares by focusing on three aspects: variable exchangeability, no parametric model assumptions, and inference of population-level parameters. To compute the generalized Pearson correlation square from a sample without a line-membership specification, we develop a -lines clustering algorithm to find clusters that exhibit distinct linear dependences, where can be chosen in a data-adaptive way. To infer the population-level generalized Pearson correlation squares, we derive the asymptotic distributions of the sample-level statistics to enable efficient statistical inference. Simulation studies verify the theoretical results and show the power advantage of the generalized Pearson correlation squares in capturing mixtures of linear dependences. Gene expression data analyses demonstrate the effectiveness of the generalized Pearson correlation squares and the -lines clustering algorithm in dissecting complex but interpretable relationships. The estimation and inference procedures are implemented in the R package gR2 (https://github.com/lijy03/gR2).

摘要

出于剖析基因表达数据中异质关系的迫切需求，我们在此将平方皮尔逊相关性进行推广，以捕捉两个实值变量之间线性依赖关系的混合情况，无论是否存在指定线性成员关系的索引变量。我们通过关注三个方面来构建广义皮尔逊相关平方：变量可交换性、无参数模型假设以及总体水平参数的推断。为了从没有线性成员关系指定的样本中计算广义皮尔逊相关平方，我们开发了一种k - 线聚类算法来找到表现出不同线性依赖关系的k个聚类，其中k可以以数据自适应的方式选择。为了推断总体水平的广义皮尔逊相关平方，我们推导样本水平统计量的渐近分布以实现有效的统计推断。模拟研究验证了理论结果，并展示了广义皮尔逊相关平方在捕捉线性依赖关系混合方面的功效优势。基因表达数据分析证明了广义皮尔逊相关平方和k - 线聚类算法在剖析复杂但可解释关系方面的有效性。估计和推断程序在R包gR2（https://github.com/lijy03/gR2）中实现。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

剖析基因表达异质性：广义皮尔逊相关平方与 - 线聚类算法

Dissecting gene expression heterogeneity: generalized Pearson correlation squares and the -lines clustering algorithm.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

剖析基因表达异质性：广义皮尔逊相关平方与 - 线聚类算法

Dissecting gene expression heterogeneity: generalized Pearson correlation squares and the -lines clustering algorithm.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献