分析生物标志物发现：估计生物标志物集的可重复性。

Analyzing biomarker discovery: Estimating the reproducibility of biomarker sets.

机构信息

Department of Computing Science, University of Alberta, Edmonton, Canada.

Department of Pure Math, University of Waterloo, Waterloo, ON, Canada.

出版信息

PLoS One. 2022 Jul 28;17(7):e0252697. doi: 10.1371/journal.pone.0252697. eCollection 2022.

DOI:10.1371/journal.pone.0252697

PMID:35901020

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9333302/

Abstract

Many researchers try to understand a biological condition by identifying biomarkers. This is typically done using univariate hypothesis testing over a labeled dataset, declaring a feature to be a biomarker if there is a significant statistical difference between its values for the subjects with different outcomes. However, such sets of proposed biomarkers are often not reproducible - subsequent studies often fail to identify the same sets. Indeed, there is often only a very small overlap between the biomarkers proposed in pairs of related studies that explore the same phenotypes over the same distribution of subjects. This paper first defines the Reproducibility Score for a labeled dataset as a measure (taking values between 0 and 1) of the reproducibility of the results produced by a specified fixed biomarker discovery process for a given distribution of subjects. We then provide ways to reliably estimate this score by defining algorithms that produce an over-bound and an under-bound for this score for a given dataset and biomarker discovery process, for the case of univariate hypothesis testing on dichotomous groups. We confirm that these approximations are meaningful by providing empirical results on a large number of datasets and show that these predictions match known reproducibility results. To encourage others to apply this technique to analyze their biomarker sets, we have also created a publicly available website, https://biomarker.shinyapps.io/BiomarkerReprod/, that produces these Reproducibility Score approximations for any given dataset (with continuous or discrete features and binary class labels).

摘要

许多研究人员试图通过识别生物标志物来了解生物状况。这通常是通过在标记数据集上进行单变量假设检验来完成的，如果在具有不同结果的受试者中，特征值之间存在显著的统计学差异，则宣布该特征为生物标志物。然而，这样的一组提出的生物标志物通常是不可重现的 - 随后的研究往往无法识别相同的组。事实上，在探索相同表型分布的相关研究中，提出的生物标志物之间通常只有很小的重叠。本文首先将标记数据集的可重复性得分定义为（取值在 0 到 1 之间）用于指定固定生物标志物发现过程的结果的可重复性的度量值，用于给定的受试者分布。然后，我们通过定义算法来可靠地估计该得分，该算法为给定的数据集和生物标志物发现过程产生该得分的上限和下限，用于二分类组的单变量假设检验。我们通过提供大量数据集的经验结果来确认这些逼近是有意义的，并表明这些预测与已知的可重现性结果相匹配。为了鼓励其他人将这种技术应用于分析他们的生物标志物组，我们还创建了一个公共可用的网站，https://biomarker.shinyapps.io/BiomarkerReprod/，该网站可以为任何给定的数据集（具有连续或离散特征和二进制类标签）生成这些可重复性得分逼近值。