Ye Jingjing, Liu Hao, Kirmiz Crystal, Lebrilla Carlito B, Rocke David M
Department of Statistics, University of California, Davis, Davis, CA, 95616, USA.
BMC Bioinformatics. 2007 Dec 12;8:477. doi: 10.1186/1471-2105-8-477.
Novel molecular and statistical methods are in rising demand for disease diagnosis and prognosis with the help of recent advanced biotechnology. High-resolution mass spectrometry (MS) is one of those biotechnologies that are highly promising to improve health outcome. Previous literatures have identified some proteomics biomarkers that can distinguish healthy patients from cancer patients using MS data. In this paper, an MS study is demonstrated which uses glycomics to identify ovarian cancer. Glycomics is the study of glycans and glycoproteins. The glycans on the proteins may deviate between a cancer cell and a normal cell and may be visible in the blood. High-resolution MS has been applied to measure relative abundances of potential glycan biomarkers in human serum. Multiple potential glycan biomarkers are measured in MS spectra. With the objection of maximizing the empirical area under the ROC curve (AUC), an analysis method was considered which combines potential glycan biomarkers for the diagnosis of cancer.
Maximizing the empirical AUC of glycomics MS data is a large-dimensional optimization problem. The technical difficulty is that the empirical AUC function is not continuous. Instead, it is in fact an empirical 0-1 loss function with a large number of linear predictors. An approach was investigated that regularizes the area under the ROC curve while replacing the 0-1 loss function with a smooth surrogate function. The constrained threshold gradient descent regularization algorithm was applied, where the regularization parameters were chosen by the cross-validation method, and the confidence intervals of the regression parameters were estimated by the bootstrap method. The method is called TGDR-AUC algorithm. The properties of the approach were studied through a numerical simulation study, which incorporates the positive values of mass spectrometry data with the correlations between measurements within person. The simulation proved asymptotic properties that estimated AUC approaches the true AUC. Finally, mass spectrometry data of serum glycan for ovarian cancer diagnosis was analyzed. The optimal combination based on TGDR-AUC algorithm yields plausible result and the detected biomarkers are confirmed based on biological evidence.
The TGDR-AUC algorithm relaxes the normality and independence assumptions from previous literatures. In addition to its flexibility and easy interpretability, the algorithm yields good performance in combining potential biomarkers and is computationally feasible. Thus, the approach of TGDR-AUC is a plausible algorithm to classify disease status on the basis of multiple biomarkers.
借助近期先进的生物技术,新型分子和统计方法在疾病诊断和预后方面的需求不断增加。高分辨率质谱(MS)是那些极有希望改善健康结果的生物技术之一。先前的文献已经鉴定出一些蛋白质组学生物标志物,可利用质谱数据区分健康患者和癌症患者。本文展示了一项利用糖组学鉴定卵巢癌的质谱研究。糖组学是对聚糖和糖蛋白的研究。蛋白质上的聚糖在癌细胞和正常细胞之间可能会有所不同,并且可能在血液中可见。高分辨率质谱已被用于测量人血清中潜在聚糖生物标志物的相对丰度。在质谱图中测量多种潜在的聚糖生物标志物。为了最大化经验性受试者工作特征曲线下面积(AUC),考虑了一种结合潜在聚糖生物标志物进行癌症诊断的分析方法。
最大化糖组学质谱数据的经验性AUC是一个高维优化问题。技术难点在于经验性AUC函数不连续。实际上,它是一个具有大量线性预测变量的经验性0 - 1损失函数。研究了一种方法,该方法在使用平滑替代函数代替0 - 1损失函数的同时,对受试者工作特征曲线下面积进行正则化。应用了约束阈值梯度下降正则化算法,其中正则化参数通过交叉验证方法选择,回归参数的置信区间通过自助法估计。该方法称为TGDR - AUC算法。通过数值模拟研究对该方法的性质进行了研究,该研究将质谱数据的正值与个体内测量值之间的相关性结合起来。模拟证明了估计的AUC接近真实AUC的渐近性质。最后,对用于卵巢癌诊断的血清聚糖质谱数据进行了分析。基于TGDR - AUC算法的最佳组合产生了合理的结果,并且基于生物学证据对检测到的生物标志物进行了确认。
TGDR - AUC算法放宽了先前文献中的正态性和独立性假设。除了其灵活性和易于解释性之外,该算法在组合潜在生物标志物方面表现良好且计算可行。因此,TGDR - AUC方法是一种基于多种生物标志物对疾病状态进行分类的合理算法。