Department of Biological and Environmental Sciences, University of Sannio, Via Port'Arsa 11, Benevento, Italy.
BMC Bioinformatics. 2009 Oct 15;10 Suppl 12(Suppl 12):S9. doi: 10.1186/1471-2105-10-S12-S9.
Mass spectrometry spectra, widely used in proteomics studies as a screening tool for protein profiling and to detect discriminatory signals, are high dimensional data. A large number of local maxima (a.k.a. peaks) have to be analyzed as part of computational pipelines aimed at the realization of efficient predictive and screening protocols. With this kind of data dimensions and samples size the risk of over-fitting and selection bias is pervasive. Therefore the development of bio-informatics methods based on unsupervised feature extraction can lead to general tools which can be applied to several fields of predictive proteomics.
We propose a method for feature selection and extraction grounded on the theory of multi-scale spaces for high resolution spectra derived from analysis of serum. Then we use support vector machines for classification. In particular we use a database containing 216 samples spectra divided in 115 cancer and 91 control samples. The overall accuracy averaged over a large cross validation study is 98.18. The area under the ROC curve of the best selected model is 0.9962.
We improved previous known results on the problem on the same data, with the advantage that the proposed method has an unsupervised feature selection phase. All the developed code, as MATLAB scripts, can be downloaded from http://medeaserver.isa.cnr.it/dacierno/spectracode.htm.
质谱光谱广泛应用于蛋白质组学研究,作为蛋白质分析和检测判别信号的筛选工具,它是一种高维数据。在旨在实现高效预测和筛选方案的计算管道中,必须分析大量的局部最大值(也称为峰)。由于数据维度和样本数量庞大,过拟合和选择偏差的风险普遍存在。因此,基于无监督特征提取的生物信息学方法的发展可以带来通用工具,可应用于预测蛋白质组学的多个领域。
我们提出了一种基于多尺度空间理论的特征选择和提取方法,用于分析血清得到的高分辨率光谱。然后,我们使用支持向量机进行分类。特别是,我们使用包含 216 个样本光谱的数据库,其中 115 个为癌症样本,91 个为对照样本。在大规模交叉验证研究中,平均整体准确率为 98.18%。最佳选择模型的 ROC 曲线下面积为 0.9962。
我们在同一数据上改进了先前已知的结果,并且我们的方法具有无监督的特征选择阶段。所有开发的代码,作为 MATLAB 脚本,可以从 http://medeaserver.isa.cnr.it/dacierno/spectracode.htm 下载。