Suppr超能文献

为什么如此之少的蛋白质组学生物标志物“通过”了验证?(样本量和独立验证考量)

Why have so few proteomic biomarkers "survived" validation? (Sample size and independent validation considerations).

作者信息

Hernández Belinda, Parnell Andrew, Pennington Stephen R

机构信息

Complex and Adaptive Systems Laboratory, School of Mathematical Sciences (Statistics), University College Dublin, Dublin, Ireland; School of Medicine and Medical Science, UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland.

出版信息

Proteomics. 2014 Jul;14(13-14):1587-92. doi: 10.1002/pmic.201300377. Epub 2014 May 16.

Abstract

Proteomic biomarker discovery has led to the identification of numerous potential candidates for disease diagnosis, prognosis, and prediction of response to therapy. However, very few of these identified candidate biomarkers reach clinical validation and go on to be routinely used in clinical practice. One particular issue with biomarker discovery is the identification of significantly changing proteins in the initial discovery experiment that do not validate when subsequently tested on separate patient sample cohorts. Here, we seek to highlight some of the statistical challenges surrounding the analysis of LC-MS proteomic data for biomarker candidate discovery. We show that common statistical algorithms run on data with low sample sizes can overfit and yield misleading misclassification rates and AUC values. A common solution to this problem is to prefilter variables (via, e.g. ANOVA and or use of correction methods such as Bonferonni or false discovery rate) to give a smaller dataset and reduce the size of the apparent statistical challenge. However, we show that this exacerbates the problem yielding even higher performance metrics while reducing the predictive accuracy of the biomarker panel. To illustrate some of these limitations, we have run simulation analyses with known biomarkers. For our chosen algorithm (random forests), we show that the above problems are substantially reduced if a sufficient number of samples are analyzed and the data are not prefiltered. Our view is that LC-MS proteomic biomarker discovery data should be analyzed without prefiltering and that increasing the sample size in biomarker discovery experiments should be a very high priority.

摘要

蛋白质组学生物标志物的发现已促成了众多疾病诊断、预后以及治疗反应预测潜在候选物的识别。然而,这些已识别的候选生物标志物中仅有极少数能通过临床验证并进而在临床实践中常规使用。生物标志物发现的一个特殊问题是,在初始发现实验中识别出的显著变化的蛋白质,在随后对不同患者样本队列进行测试时却无法得到验证。在此,我们旨在强调围绕用于生物标志物候选物发现的液相色谱 - 质谱蛋白质组学数据分析的一些统计挑战。我们表明,在小样本数据上运行的常见统计算法可能会过度拟合,并产生误导性的错误分类率和曲线下面积(AUC)值。解决此问题的一个常见方法是对变量进行预筛选(例如通过方差分析和/或使用如邦费罗尼校正或错误发现率等校正方法),以得到一个较小的数据集并降低表面上的统计挑战规模。然而,我们表明这会加剧问题,在降低生物标志物组预测准确性的同时产生更高的性能指标。为了说明其中一些局限性,我们使用已知生物标志物进行了模拟分析。对于我们选择的算法(随机森林),我们表明如果分析足够数量的样本且不对数据进行预筛选,上述问题会大幅减少。我们的观点是,液相色谱 - 质谱蛋白质组学生物标志物发现数据应在不进行预筛选的情况下进行分析,并且在生物标志物发现实验中增加样本量应是非常优先的事项。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验