Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, 301 University Blvd, Galveston, Texas 77555, United States.
J Proteome Res. 2024 Jun 7;23(6):2298-2305. doi: 10.1021/acs.jproteome.3c00842. Epub 2024 May 29.
Multiple hypothesis testing is an integral component of data analysis for large-scale technologies such as proteomics, transcriptomics, or metabolomics, for which the false discovery rate (FDR) and positive FDR (pFDR) have been accepted as error estimation and control measures. The pFDR is the expectation of false discovery proportion (FDP), which refers to the ratio of the number of null hypotheses to that of all rejected hypotheses. In practice, the expectation of ratio is approximated by the ratio of expectation; however, the conditions for transforming the former into the latter have not been investigated. This work derives exact integral expressions for the expectation (pFDR) and variance of FDP. The widely used approximation (ratio of expectations) is shown to be a particular case (in the limit of a large sample size) of the integral formula for pFDR. A recurrence formula is provided to compute the pFDR for a predefined number of null hypotheses. The variance of FDP was approximated for a practical application in peptide identification using forward and reversed protein sequences. The simulations demonstrate that the integral expression exhibits better accuracy than the approximate formula in the case of a small number of hypotheses. For large sample sizes, the pFDRs obtained by the integral expression and approximation do not differ substantially. Applications to proteomics data sets are included.
多假设检验是蛋白质组学、转录组学或代谢组学等大规模技术数据分析的一个组成部分,假发现率(FDR)和阳性 FDR(pFDR)已被接受为错误估计和控制措施。pFDR 是错误发现比例(FDP)的期望,它是指零假设数与所有拒绝假设数的比值。在实践中,通过期望的比值来近似期望的比值;然而,将前者转化为后者的条件尚未得到研究。这项工作推导出了 FDP 的期望(pFDR)和方差的精确积分表达式。广泛使用的近似值(期望的比值)是 pFDR 积分公式的一个特例(在大样本量的极限情况下)。提供了一个递归公式来计算预定数量的零假设的 pFDR。使用正向和反向蛋白质序列在肽鉴定的实际应用中对 FDP 的方差进行了近似。模拟结果表明,在假设数量较少的情况下,积分表达式比近似公式具有更好的准确性。对于大样本量,积分表达式和近似得到的 pFDR 没有显著差异。包括对蛋白质组学数据集的应用。