Biostatistics and Biomathematics Program, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, M2-B500, Seattle, WA 98105, USA.
Cancer Epidemiol Biomarkers Prev. 2010 Mar;19(3):655-65. doi: 10.1158/1055-9965.EPI-09-0510. Epub 2010 Feb 16.
Advances in biotechnology have raised expectations that biomarkers, including genetic profiles, will yield information to accurately predict outcomes for individuals. However, results to date have been disappointing. In addition, statistical methods to quantify the predictive information in markers have not been standardized.
We discuss statistical techniques to summarize predictive information, including risk distribution curves and measures derived from them, that relate to decision making. Attributes of these measures are contrasted with alternatives such as receiver operating characteristic curves, R(2), percent reclassification, and net reclassification index. Data are generated from simple models of risk conferred by genetic profiles for individuals in a population. Statistical techniques are illustrated, and the risk prediction capacities of different risk models are quantified.
Risk distribution curves are most informative and relevant to clinical practice. They show proportions of subjects classified into clinically relevant risk categories. In a population in which 10% have the outcome event and subjects are categorized as high risk if their risk exceeds 20%, we identified some settings where more than half of those destined to have an event were classified as high risk by the risk model. Either 150 genes each with odds ratio of 1.5 or 250 genes each with odds ratio of 1.25 were required when the minor allele frequencies are 10%. We show that conclusions based on receiver operating characteristic curves may not be the same as conclusions based on risk distribution curves.
Many highly predictive genes will be required to identify substantial numbers of subjects at high risk.
生物技术的进步使得人们期望生物标志物(包括基因谱)能够提供信息,从而准确预测个体的结果。然而,迄今为止的结果令人失望。此外,用于量化标志物中预测信息的统计方法尚未标准化。
我们讨论了用于总结预测信息的统计技术,包括与决策相关的风险分布曲线和从中得出的度量标准,这些度量标准的属性与其他替代方法(如接收者操作特征曲线、R²、百分比再分类和净再分类指数)进行了对比。数据来自个体遗传谱风险的简单模型,展示了不同风险模型的风险预测能力。
风险分布曲线最具信息量和相关性,最符合临床实践。它们显示了分类为临床相关风险类别的受试者比例。在一个 10%的人有结局事件的人群中,如果受试者的风险超过 20%,则被归类为高风险,我们确定了一些情况下,超过一半注定要发生事件的人被风险模型归类为高风险。当次要等位基因频率为 10%时,需要 150 个每个具有 1.5 倍优势比的基因,或者需要 250 个每个具有 1.25 倍优势比的基因。我们表明,基于接收者操作特征曲线的结论可能与基于风险分布曲线的结论不同。
需要大量具有高度预测性的基因才能识别出大量处于高风险的受试者。