Department of Chemistry, Tongji University, Shanghai, 200092, China.
BMC Bioinformatics. 2010 Feb 27;11:109. doi: 10.1186/1471-2105-11-109.
Recent advances in proteomics technologies such as SELDI-TOF mass spectrometry has shown promise in the detection of early stage cancers. However, dimensionality reduction and classification are considerable challenges in statistical machine learning. We therefore propose a novel approach for dimensionality reduction and tested it using published high-resolution SELDI-TOF data for ovarian cancer.
We propose a method based on statistical moments to reduce feature dimensions. After refining and t-testing, SELDI-TOF data are divided into several intervals. Four statistical moments (mean, variance, skewness and kurtosis) are calculated for each interval and are used as representative variables. The high dimensionality of the data can thus be rapidly reduced. To improve efficiency and classification performance, the data are further used in kernel PLS models. The method achieved average sensitivity of 0.9950, specificity of 0.9916, accuracy of 0.9935 and a correlation coefficient of 0.9869 for 100 five-fold cross validations. Furthermore, only one control was misclassified in leave-one-out cross validation.
The proposed method is suitable for analyzing high-throughput proteomics data.
SELDI-TOF 质谱等蛋白质组学技术的最新进展显示出在检测早期癌症方面的潜力。然而,在统计机器学习中,降维和分类是相当大的挑战。因此,我们提出了一种新的降维方法,并使用已发表的卵巢癌高分辨率 SELDI-TOF 数据对其进行了测试。
我们提出了一种基于统计矩的方法来降低特征维度。经过精炼和 t 检验后,将 SELDI-TOF 数据分为几个区间。为每个区间计算四个统计矩(均值、方差、偏度和峰度),并用作代表变量。因此,可以快速降低数据的高维性。为了提高效率和分类性能,进一步将数据用于核 PLS 模型。该方法在 100 次五重交叉验证中实现了平均灵敏度为 0.9950、特异性为 0.9916、准确性为 0.9935 和相关系数为 0.9869。此外,在留一法交叉验证中只有一个对照被错误分类。
所提出的方法适用于分析高通量蛋白质组学数据。