Pepe M S, Thompson M L
Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, PO Box 19024, Seattle, WA 98109-1024, USA.
Biostatistics. 2000 Jun;1(2):123-40. doi: 10.1093/biostatistics/1.2.123.
When multiple diagnostic tests are performed on an individual or multiple disease markers are available it may be possible to combine the information to diagnose disease. We consider how to choose linear combinations of markers in order to optimize diagnostic accuracy. The accuracy index to be maximized is the area or partial area under the receiver operating characteristic (ROC) curve. We propose a distribution-free rank-based approach for optimizing the area under the ROC curve and compare it with logistic regression and with classic linear discriminant analysis (LDA). It has been shown that the latter method optimizes the area under the ROC curve when test results have a multivariate normal distribution for diseased and non-diseased populations. Simulation studies suggest that the proposed non-parametric method is efficient when data are multivariate normal.The distribution-free method is generalized to a smooth distribution-free approach to: (i) accommodate some reasonable smoothness assumptions; (ii) incorporate covariate effects; and (iii) yield optimized partial areas under the ROC curve. This latter feature is particularly important since it allows one to focus on a region of the ROC curve which is of most relevance to clinical practice. Neither logistic regression nor LDA necessarily maximize partial areas. The approaches are illustrated on two cancer datasets, one involving serum antigen markers for pancreatic cancer and the other involving longitudinal prostate specific antigen data.
当对个体进行多项诊断测试或有多种疾病标志物可用时,有可能将这些信息结合起来以诊断疾病。我们考虑如何选择标志物的线性组合以优化诊断准确性。要最大化的准确性指标是受试者工作特征(ROC)曲线下的面积或部分面积。我们提出一种基于秩的无分布方法来优化ROC曲线下的面积,并将其与逻辑回归和经典线性判别分析(LDA)进行比较。已经表明,当患病和未患病群体的测试结果具有多元正态分布时,后一种方法可优化ROC曲线下的面积。模拟研究表明,当数据为多元正态时,所提出的非参数方法是有效的。无分布方法被推广到一种平滑的无分布方法,以:(i)适应一些合理的平滑假设;(ii)纳入协变量效应;(iii)得出ROC曲线下的优化部分面积。后一个特征尤为重要,因为它允许人们关注与临床实践最相关的ROC曲线区域。逻辑回归和LDA都不一定能使部分面积最大化。这些方法在两个癌症数据集上进行了说明,一个涉及胰腺癌的血清抗原标志物,另一个涉及纵向前列腺特异性抗原数据。