OASIS：一种可解释的、有限样本有效的替代 Pearson 的方法，用于科学发现。

OASIS: An interpretable, finite-sample valid alternative to Pearson's for scientific discovery.

机构信息

Eric and Wendy Schmidt Center, Broad Institute, Cambridge, MA 02142.

Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02115.

出版信息

Proc Natl Acad Sci U S A. 2024 Apr 9;121(15):e2304671121. doi: 10.1073/pnas.2304671121. Epub 2024 Apr 2.

DOI:10.1073/pnas.2304671121

PMID:38564640

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11009617/

Abstract

Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference [K. Chaung ., , 5440-5456 (2023)], we develop Optimized Adaptive Statistic for Inferring Structure (OASIS), a family of statistical tests for contingency tables. OASIS constructs a test statistic which is linear in the normalized data matrix, providing closed-form -value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's -value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. Using OASIS, we develop a method that can detect SARS-CoV-2 and strains de novo, which existing approaches cannot achieve. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single-cell RNA sequencing, where under accepted noise models OASIS provides good control of the false discovery rate, while Pearson's [Formula: see text] consistently rejects the null. Additionally, we show in simulations that OASIS is more powerful than Pearson's [Formula: see text] in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.

摘要

列联表，以计数矩阵表示的数据，在定量研究和数据科学应用中无处不在。然而，现有的统计检验方法并不充分，因为没有一种方法在有限的观测次数下同时具有计算效率和统计有效性。在这项工作中，受最近在无参考基因组推断中的应用的启发[K. Chaung., , 5440-5456 (2023)]，我们开发了一种用于推断结构的优化自适应统计量（OASIS），这是一种用于列联表的统计检验方法。OASIS 构建了一个测试统计量，它与归一化数据矩阵线性相关，通过经典的集中不等式提供了闭式[Formula: see text]值界。在这个过程中，OASIS 对表格进行了分解，使其对零假设的拒绝具有可解释性。我们推导出了 OASIS 测试统计量的渐近分布，表明这些有限样本界在方差项的限制下正确地描述了测试统计量的[Formula: see text]值。对基因组测序数据的实验突出了 OASIS 的强大功能和可解释性。使用 OASIS，我们开发了一种可以从头检测 SARS-CoV-2 和株的方法，而现有方法无法实现。我们在模拟中证明，OASIS 对过度分散具有鲁棒性，过度分散是单细胞 RNA 测序等基因组数据中的常见特征，在接受的噪声模型下，OASIS 提供了对错误发现率的良好控制，而 Pearson 的[Formula: see text]则一致拒绝零假设。此外，我们在模拟中表明，在某些情况下，OASIS 比 Pearson 的[Formula: see text]更有效，包括某些重要的两组替代情况，我们通过近似功效计算进行了验证。