Ochoa Alejandro, Storey John D, Llinás Manuel, Singh Mona
Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America.
Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America.
PLoS Comput Biol. 2015 Nov 17;11(11):e1004509. doi: 10.1371/journal.pcbi.1004509. eCollection 2015 Nov.
E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for "stratified" multiple hypothesis testing problems-that is, those in which statistical tests can be partitioned naturally-controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning.
在过去二十年中,E值一直是蛋白质序列分析的主要统计量:从识别具有统计学意义的局部序列比对到评估与描述蛋白质结构域家族的隐马尔可夫模型的匹配情况。在此,我们正式表明,对于“分层”多重假设检验问题,即那些统计检验可自然划分的问题,控制每个层次或分区的局部错误发现率(lFDR),在所有层次组合的FDR或E值的任何给定阈值下,能在整个数据中产生最多的预测。对于蛋白质结构域预测这一表征蛋白质结构、功能和进化的关键步骤,我们表明按结构域家族对统计检验进行分层可产生优异结果。我们开发了首个用于结构域预测的FDR估计算法,并使用在此背景下估计经验FDR的五种互补方法,评估基于q值、E值和lFDR的阈值在结构域预测中的表现。我们表明分层q值阈值显著优于E值。与我们的理论结果相悖的是,q值也优于lFDR;然而,我们的测试揭示了一小部分但连贯的结构域家族子集,这些子集偏向于特定重复模式的模型,对于这些模型,随机序列模型的弱点导致统计显著性度量明显不准确。对于其余具有预期噪声的家族,使用lFDR阈值优于q值,这表明通过改进随机序列建模可进一步改善结构域预测。总体而言,我们的理论和实证研究结果表明,使用分层q值和lFDR可改善生物信息学中出现的一系列结构化多重假设检验问题,包括全基因组关联研究、直系同源预测和基序扫描。