绿洲（OASIS）：一种可解释的、有限样本有效的皮尔逊检验替代方法，用于科学发现。

OASIS: An interpretable, finite-sample valid alternative to Pearson's for scientific discovery.

作者信息

Baharav Tavor Z, Tse David, Salzman Julia

机构信息

Department of Electrical Engineering, Stanford University, Stanford, CA 94305.

Department of Biomedical Data Science, Stanford University, Stanford, CA 94305.

出版信息

bioRxiv. 2023 Nov 3:2023.03.16.533008. doi: 10.1101/2023.03.16.533008.

DOI:10.1101/2023.03.16.533008

PMID:37961606

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10634974/

Abstract

Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference (1), we develop OASIS (Optimized Adaptive Statistic for Inferring Structure), a family of statistical tests for contingency tables. OASIS constructs a test-statistic which is linear in the normalized data matrix, providing closed form p-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's p-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. The same method based on OASIS significance calls detects SARS-CoV-2 and Mycobacterium Tuberculosis strains de novo, which cannot be achieved with current approaches. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single cell RNA-sequencing, where under accepted noise models OASIS still provides good control of the false discovery rate, while Pearson's test consistently rejects the null. Additionally, we show on synthetic data that OASIS is more powerful than Pearson's test in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.

摘要

列联表，即表示为计数矩阵的数据，在定量研究和数据科学应用中无处不在。然而，现有的统计检验并不充分，因为对于有限数量的观测值，没有一种检验能同时在计算上高效且在统计上有效。在这项工作中，受近期无参考基因组推断应用（1）的启发，我们开发了OASIS（用于推断结构的优化自适应统计量），这是一族用于列联表的统计检验。OASIS构建了一个在归一化数据矩阵中呈线性的检验统计量，通过经典的集中不等式提供封闭形式的p值界限。在此过程中，OASIS对表格进行了分解，使其对原假设的拒绝具有可解释性。我们推导了OASIS检验统计量的渐近分布，表明这些有限样本界限在一个方差项范围内正确地刻画了检验统计量的p值。对基因组测序数据的实验突出了OASIS的功效和可解释性。基于OASIS显著性调用的相同方法能够从头检测出新型冠状病毒和结核分枝杆菌菌株，这是当前方法无法实现的。我们在模拟中证明，OASIS对过度离散具有鲁棒性，过度离散是单细胞RNA测序等基因组数据中的常见特征，在公认的噪声模型下，OASIS仍能很好地控制错误发现率，而皮尔逊检验则持续拒绝原假设。此外，我们在合成数据上表明，在某些情况下，OASIS比皮尔逊检验更具功效，包括对于一些重要的两组备择假设，我们通过近似功效计算证实了这一点。