Cheng Cheng
Department of Biostatistics, St. Jude Children's Research Hospital, 332 N. Lauderdale Street, Memphis, TN 38105-2794.
Comput Stat Data Anal. 2009 Jan 15;53(3):788-800. doi: 10.1016/j.csda.2008.07.004.
Although validation of classification and prediction models has been a long-standing topic in Statistics and computer learning, the concept of statistical validation in genome-wide screening studies has been vague. Internal validation generally refers to validation procedures solely based on the study dataset. A popular approach to internal validation of identified genomic features has been the split-dataset validation. Contrast to this approach, internal validation in genome-wide association screening studies is precisely defined through the concepts of association profile and profile significance. A general procedure and two specific profile significance measures are developed and are compared with the split-dataset validation approach by a simulation study. The simulation results clearly demonstrate the strength and limitations of the profile significance approach to internal validation, especially its enormous gain in sensitivity (power) and stability over the split-dataset validation. The proposed methodology is illustrated by an example of genome-wide SNP associaiton analysis in genetic epidemiology.
尽管分类和预测模型的验证在统计学和计算机学习领域一直是个长期话题,但全基因组筛选研究中的统计验证概念却一直模糊不清。内部验证通常指仅基于研究数据集的验证程序。一种常用于已识别基因组特征内部验证的流行方法是数据集拆分验证。与这种方法形成对比的是,全基因组关联筛选研究中的内部验证是通过关联概况和概况显著性的概念来精确界定的。本文开发了一种通用程序和两种特定的概况显著性度量方法,并通过模拟研究将其与数据集拆分验证方法进行比较。模拟结果清楚地展示了概况显著性方法用于内部验证的优势和局限性,尤其是相较于数据集拆分验证,它在灵敏度(功效)和稳定性方面有巨大提升。本文通过遗传流行病学中全基因组单核苷酸多态性关联分析的实例来说明所提出的方法。