当观测值并非独立同分布时会出现什么问题：关于计算来自不同实验或条件的组合数据集相关性的警示说明。

What can go wrong when observations are not independently and identically distributed: A cautionary note on calculating correlations on combined data sets from different experiments or conditions.

作者信息

Saccenti Edoardo

机构信息

Laboratory of Systems and Synthetic Biology, Wageningen University and Research, Wageningen, Netherlands.

出版信息

Front Syst Biol. 2023 Jan 30;3:1042156. doi: 10.3389/fsysb.2023.1042156. eCollection 2023.

DOI:10.3389/fsysb.2023.1042156

PMID:40809477

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12341968/

Abstract

In the scientific literature data analysis results are often presented when samples from different experiments or different conditions, technical replicates or times series are merged to increase the sample size before calculating the correlation coefficient. This way of proceeding violates two basic assumptions underlying the use of the correlation coefficient: sampling from one population and independence of the observations (independence of errors). Since correlations are used to measure and infer associations between biological entities, this has tremendous implications on the reliability of scientific results, as the violation of these assumption leads to wrong and biased results. In this technical note, I review some basic properties of the Pearson's correlation coefficient and illustrate some exemplary problems with simulated and experimental data, taking a didactic approach with the use of supporting graphical examples.

摘要

在科学文献中，当来自不同实验、不同条件、技术重复或时间序列的样本合并在一起以增加样本量，然后再计算相关系数时，常常会呈现数据分析结果。这种做法违反了使用相关系数的两个基本假设：从一个总体中抽样以及观测值的独立性（误差独立性）。由于相关性用于衡量和推断生物实体之间的关联，这对科学结果的可靠性有巨大影响，因为违反这些假设会导致错误和有偏差的结果。在本技术说明中，我回顾了皮尔逊相关系数的一些基本特性，并通过使用支持性的图形示例，以教学的方式用模拟数据和实验数据来说明一些典型问题。