Anesthesiology and Intensive Care Medicine, University Hospital Greifswald, Ferdinand-Sauerbruch-Straße, D-17475 Greifswald, Germany and.
Epidemiology and Biostatistics, School of Public Health, Imperial College London, Norfolk Place, London, W2 1PG, UK.
Bioinformatics. 2015 Oct 1;31(19):3156-62. doi: 10.1093/bioinformatics/btv334. Epub 2015 May 28.
Proteomic mass spectrometry analysis is becoming routine in clinical diagnostics, for example to monitor cancer biomarkers using blood samples. However, differential proteomics and identification of peaks relevant for class separation remains challenging.
Here, we introduce a simple yet effective approach for identifying differentially expressed proteins using binary discriminant analysis. This approach works by data-adaptive thresholding of protein expression values and subsequent ranking of the dichotomized features using a relative entropy measure. Our framework may be viewed as a generalization of the 'peak probability contrast' approach of Tibshirani et al. (2004) and can be applied both in the two-group and the multi-group setting. Our approach is computationally inexpensive and shows in the analysis of a large-scale drug discovery test dataset equivalent prediction accuracy as a random forest. Furthermore, we were able to identify in the analysis of mass spectrometry data from a pancreas cancer study biological relevant and statistically predictive marker peaks unrecognized in the original study.
The methodology for binary discriminant analysis is implemented in the R package binda, which is freely available under the GNU General Public License (version 3 or later) from CRAN at URL http://cran.r-project.org/web/packages/binda/. R scripts reproducing all described analyzes are available from the web page http://strimmerlab.org/software/binda/.
蛋白质组学质谱分析在临床诊断中已成为常规,例如使用血液样本监测癌症生物标志物。然而,差异蛋白质组学和鉴定与分类分离相关的峰仍然具有挑战性。
在这里,我们介绍了一种使用二元判别分析识别差异表达蛋白的简单而有效的方法。该方法通过对蛋白表达值进行数据自适应阈值处理,并使用相对熵度量对二分类特征进行排序,从而实现对差异表达蛋白的识别。我们的方法可以看作是 Tibshirani 等人(2004 年)提出的“峰概率对比”方法的推广,可以应用于两组和多组情况。我们的方法计算成本低,在对大规模药物发现测试数据集的分析中,其预测准确性与随机森林相当。此外,我们还能够在胰腺癌细胞研究的质谱数据分析中,识别出在原始研究中未被识别的生物学相关和统计学上具有预测性的标记峰。
二元判别分析的方法在 R 包 binda 中实现,该包可在 GNU 通用公共许可证(版本 3 或更高版本)下从 CRAN 网址 http://cran.r-project.org/web/packages/binda/ 免费获得。重现所有描述性分析的 R 脚本可从网页 http://strimmerlab.org/software/binda/ 获得。