从高维基因组数据中逐步验证显著特征的最优策略。

Optimal strategies for sequential validation of significant features from high-dimensional genomic data.

机构信息

Department of Statistics, TU Dortmund University, Germany.

出版信息

J Toxicol Environ Health A. 2012;75(8-10):447-60. doi: 10.1080/15287394.2012.674912.

DOI:10.1080/15287394.2012.674912

Abstract

High-dimensional genomic studies play a key role in identifying critical features that are significantly associated with a phenotypic outcome. The two most important examples are the detection of (1) differentially expressed genes from genome-wide gene expression studies and (2) single-nucleotide polymorphisms (SNPs) from genome-wide association studies. Such experiments are often associated with high noise levels, and the validity of statistical conclusions suffers from low sample size compared to large number of features. The corresponding multiple testing problem calls for the identification of optimal strategies for controlling the numbers of false discoveries and false nondiscoveries. In addition, a frequent validation problem is that features identified as important in one study are often less so in another study. Adjustment for multiple testing in both studies separately increases the risk of missing the crucial features even further. These problems can be addressed by sequential validation strategies, where only significant features identified in one study enter as candidates in the next study. The quality associated with different studies, for example, in terms of noise levels, may vary considerably. By performing simulation studies it is possible to demonstrate that the optimal order for this stepwise procedure is to sort experimental studies according to their quality in descending order. The impact of the method for multiple testing adjustment (Bonferroni-Holm, FDR) was also analyzed. Finally, the sequential validation strategy was applied to three large breast cancer studies with gene expression measurements, confirming the crucial impact of the order of the validation steps in a real-world application.

摘要

高维基因组学研究在识别与表型结果显著相关的关键特征方面起着关键作用。最重要的两个例子是（1）从全基因组基因表达研究中检测到差异表达基因，以及（2）从全基因组关联研究中检测到单核苷酸多态性（SNP）。此类实验通常与高噪声水平相关，与大量特征相比，统计结论的有效性受到样本量小的影响。相应的多重检验问题需要确定控制假发现和假未发现数量的最佳策略。此外，一个常见的验证问题是，在一项研究中被确定为重要的特征，在另一项研究中往往不那么重要。在两项研究中分别进行多重检验调整会进一步增加错过关键特征的风险。通过顺序验证策略可以解决这些问题，其中只有在一项研究中确定为显著的特征才会作为候选特征进入下一项研究。不同研究的质量（例如，噪声水平）可能有很大差异。通过进行模拟研究，可以证明这种逐步过程的最佳顺序是根据研究的质量按降序对实验研究进行排序。还分析了多重检验调整方法（Bonferroni-Holm、FDR）的影响。最后，将顺序验证策略应用于三个具有基因表达测量的大型乳腺癌研究，证实了在实际应用中验证步骤顺序的关键影响。

相似文献

Optimal strategies for sequential validation of significant features from high-dimensional genomic data.从高维基因组数据中逐步验证显著特征的最优策略。

J Toxicol Environ Health A. 2012;75(8-10):447-60. doi: 10.1080/15287394.2012.674912.

Imputing missing genotypes with weighted k nearest neighbors.使用加权最近邻法推断缺失基因型。

J Toxicol Environ Health A. 2012;75(8-10):438-46. doi: 10.1080/15287394.2012.674910.

Impaired performance of FDR-based strategies in whole-genome association studies when SNPs are excluded prior to the analysis.在全基因组关联研究中，当单核苷酸多态性（SNPs）在分析前被排除时，基于错误发现率（FDR）的策略表现受损。

Genet Epidemiol. 2009 Jan;33(1):45-53. doi: 10.1002/gepi.20355.

Empirical Bayes screening of many p-values with applications to microarray studies.用于微阵列研究的多p值经验贝叶斯筛选。

Bioinformatics. 2005 May 1;21(9):1987-94. doi: 10.1093/bioinformatics/bti301. Epub 2005 Feb 2.

Meta-analysis in genome-wide association datasets: strategies and application in Parkinson disease.全基因组关联数据集的荟萃分析：在帕金森病中的策略与应用。

PLoS One. 2007 Feb 7;2(2):e196. doi: 10.1371/journal.pone.0000196.

Hidden Markov models for controlling false discovery rate in genome-wide association analysis.用于全基因组关联分析中控制错误发现率的隐马尔可夫模型

Methods Mol Biol. 2012;802:337-44. doi: 10.1007/978-1-61779-400-1_22.

Cell adhesion molecules contribute to Alzheimer's disease: multiple pathway analyses of two genome-wide association studies.细胞黏附分子与阿尔茨海默病有关：两项全基因组关联研究的多途径分析。

J Neurochem. 2012 Jan;120(1):190-8. doi: 10.1111/j.1471-4159.2011.07547.x. Epub 2011 Nov 17.

Group additive regression models for genomic data analysis.用于基因组数据分析的分组加法回归模型。

Biostatistics. 2008 Jan;9(1):100-13. doi: 10.1093/biostatistics/kxm015. Epub 2007 May 18.

INTERSNP: genome-wide interaction analysis guided by a priori information.基于先验信息的全基因组交互分析

Bioinformatics. 2009 Dec 15;25(24):3275-81. doi: 10.1093/bioinformatics/btp596. Epub 2009 Oct 16.

Using biological knowledge to discover higher order interactions in genetic association studies.利用生物学知识发现遗传关联研究中的高阶相互作用。

Genet Epidemiol. 2010 Dec;34(8):863-78. doi: 10.1002/gepi.20542.

引用本文的文献

Elevated mRNA Levels of AURKA, CDC20 and TPX2 are associated with poor prognosis of smoking related lung adenocarcinoma using bioinformatics analysis.利用生物信息学分析，AURKA、CDC20 和 TPX2 的 mRNA 水平升高与吸烟相关肺腺癌的不良预后相关。

Int J Med Sci. 2018 Nov 5;15(14):1676-1685. doi: 10.7150/ijms.28728. eCollection 2018.

Epsin Family Member 3 and Ribosome-Related Genes Are Associated with Late Metastasis in Estrogen Receptor-Positive Breast Cancer and Long-Term Survival in Non-Small Cell Lung Cancer Using a Genome-Wide Identification and Validation Strategy.使用全基因组鉴定和验证策略，Epsin家族成员3和核糖体相关基因与雌激素受体阳性乳腺癌的晚期转移及非小细胞肺癌的长期生存相关。

PLoS One. 2016 Dec 7;11(12):e0167585. doi: 10.1371/journal.pone.0167585. eCollection 2016.

Prognostic and predictive values of CDK1 and MAD2L1 in lung adenocarcinoma.CDK1和MAD2L1在肺腺癌中的预后及预测价值

Oncotarget. 2016 Dec 20;7(51):85235-85243. doi: 10.18632/oncotarget.13252.

Transcriptomics in developmental toxicity testing.发育毒性测试中的转录组学

EXCLI J. 2013 Dec 12;12:1027-9. eCollection 2013.

Highlight report: Validation of prognostic genes in lung cancer.重点报告：肺癌预后基因的验证

EXCLI J. 2014 May 6;13:457-60. eCollection 2014.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

从高维基因组数据中逐步验证显著特征的最优策略。

Optimal strategies for sequential validation of significant features from high-dimensional genomic data.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献