高斯镜与 Model-X 伪影模型的系统比较。

The systematic comparison between Gaussian mirror and Model-X knockoff models.

机构信息

Department of Health Statistics, School of Public Health, Shanxi Medical University, No 56 Xinjian South Road, Yingze District, Taiyuan, Shanxi Province, China.

Department of Statistics, University of Auckland, 38 Princes Street, Auckland Central, Auckland, New Zealand, 1010.

出版信息

Sci Rep. 2023 Apr 4;13(1):5478. doi: 10.1038/s41598-023-32605-5.

DOI:10.1038/s41598-023-32605-5

PMID:37015993

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10073103/

Abstract

While the high-dimensional biological data have provided unprecedented data resources for the identification of biomarkers, consensus is still lacking on how to best analyze them. The recently developed Gaussian mirror (GM) and Model-X (MX) knockoff-based methods have much related model assumptions, which makes them appealing for the detection of new biomarkers. However, there are no guidelines for their practical use. In this research, we systematically compared the performance of MX-based and GM methods, where the impacts of the distribution of explanatory variables, their relatedness and the signal-to-noise ratio were evaluated. MX with knockoff generated using the second-order approximates (MX-SO) has the best performance as compared to other MX-based methods. MX-SO and GM have similar levels of power and computational speed under most of the simulations, but GM is more robust in the control of false discovery rate (FDR). In particular, MX-SO can only control the FDR well when there are weak correlations among explanatory variables and the sample size is at least moderate. On the contrary, GM can have the desired FDR as long as explanatory variables are not highly correlated. We further used GM and MX-based methods to detect biomarkers that are associated with the Alzheimer's disease-related PET-imaging trait and the Parkinson's disease-related T-tau of cerebrospinal fluid. We found that MX-based and GM methods are both powerful for the analysis of big biological data. Although genes selected from MX-based methods are more similar as compared to those from the GM method, both MX-based and GM methods can identify the well-known disease-associated genes for each disease. While MX-based methods can have a slightly higher power than that of the GM method, it is less robust, especially for data with small sample sizes, unknown distributions, and high correlations.

摘要

虽然高维生物学数据为识别生物标志物提供了前所未有的数据资源，但如何最好地分析这些数据仍缺乏共识。最近开发的高斯镜 (GM) 和基于 Model-X (MX) knockoff 的方法具有许多相关的模型假设，这使得它们非常适合检测新的生物标志物。然而，目前还没有关于它们实际使用的指南。在这项研究中，我们系统地比较了基于 MX 和 GM 方法的性能，评估了解释变量的分布、相关性和信噪比的影响。与其他基于 MX 的方法相比，使用二阶近似 (MX-SO) 生成 knockoff 的 MX 表现最佳。在大多数模拟中，MX-SO 和 GM 的性能和计算速度相似，但 GM 在控制假发现率 (FDR) 方面更稳健。特别是，只有当解释变量之间相关性较弱且样本量至少适中时，MX-SO 才能很好地控制 FDR。相反，只要解释变量之间不高度相关，GM 就可以达到期望的 FDR。我们进一步使用 GM 和基于 MX 的方法来检测与阿尔茨海默病相关的 PET 成像特征和帕金森病相关的脑脊液 T-tau 相关的生物标志物。我们发现，基于 MX 和 GM 的方法都非常适合分析大型生物学数据。虽然基于 MX 的方法选择的基因与 GM 方法相比更相似，但基于 MX 和 GM 的方法都可以识别出每种疾病的已知与疾病相关的基因。虽然基于 MX 的方法的功效略高于 GM 方法，但它的稳健性较差，尤其是对于样本量较小、分布未知和相关性较高的数据。