使用诱饵排列的无空值错误发现率控制

Null-free False Discovery Rate Control Using Decoy Permutations.

作者信息

He Kun, Li Meng-Jie, Fu Yan, Gong Fu-Zhou, Sun Xiao-Ming

机构信息

Iinstitute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190 China.

University of Chinese Academy of Sciences, Beijing, 100049 China.

出版信息

Acta Math Appl Sin. 2022;38(2):235-253. doi: 10.1007/s10255-022-1077-5. Epub 2022 Apr 9.

DOI:10.1007/s10255-022-1077-5

PMID:35431377

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8994022/

Abstract

The traditional approaches to false discovery rate (FDR) control in multiple hypothesis testing are usually based on the null distribution of a test statistic. However, all types of null distributions, including the theoretical, permutation-based and empirical ones, have some inherent drawbacks. For example, the theoretical null might fail because of improper assumptions on the sample distribution. Here, we propose a null distribution-free approach to FDR control for multiple hypothesis testing in the case-control study. This approach, named , simply builds on the ordering of tests by some statistic or score, the null distribution of which is not required to be known. Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries. We prove that this approach controls the FDR when the score function is symmetric and the scores are independent between different tests. Simulation demonstrates that it is more stable and powerful than two popular traditional approaches, even in the existence of dependency. Evaluation is also made on two real datasets, including an arabidopsis genomics dataset and a COVID-19 proteomics dataset.

摘要

在多重假设检验中，传统的错误发现率（FDR）控制方法通常基于检验统计量的零分布。然而，所有类型的零分布，包括理论型、基于排列型和经验型的，都有一些固有缺陷。例如，理论零假设可能会因为对样本分布的假设不当而失效。在此，我们提出一种在病例对照研究中用于多重假设检验的FDR控制的无零分布方法。这种方法，名为，简单地基于某个统计量或分数对检验进行排序构建，其零分布无需已知。竞争性诱饵检验由原始样本的排列构建而成，并用于估计错误的目标发现。我们证明，当分数函数对称且不同检验之间的分数相互独立时，该方法可控制FDR。模拟表明，即使存在依赖性，它也比两种流行的传统方法更稳定、更强大。我们还对两个真实数据集进行了评估，包括一个拟南芥基因组学数据集和一个COVID - 19蛋白质组学数据集。