Han Henry
Department of Mathematics and Bioinformatics, Eastern Michigan University, Ypsilanti, MI 48197, USA.
BMC Syst Biol. 2011;5 Suppl 2(Suppl 2):S5. doi: 10.1186/1752-0509-5-S2-S5. Epub 2011 Dec 14.
Although mass spectrometry based proteomics demonstrates an exciting promise in complex diseases diagnosis, it remains an important research field rather than an applicable clinical routine for its diagnostic accuracy and data reproducibility. Relatively less investigation has been done yet in attaining high-performance proteomic pattern classification compared with the amount of endeavours in enhancing data reproducibility.
In this study, we present a novel machine learning approach to achieve a clinical level disease diagnosis for mass spectral data. We propose multi-resolution independent component analysis, a novel feature selection algorithm to tackle the large dimensionality of mass spectra, by following our local and global feature selection framework. We also develop high-performance classifiers by embedding multi-resolution independent component analysis in linear discriminant analysis and support vector machines.
Our multi-resolution independent component based support vector machines not only achieve clinical level classification accuracy, but also overcome the weakness in traditional peak-selection based biomarker discovery. In addition to rigorous theoretical analysis, we demonstrate our method's superiority by comparing it with nine state-of-the-art classification and regression algorithms on six heterogeneous mass spectral profiles.
Our work not only suggests an alternative direction from machine learning to accelerate mass spectral proteomic technologies into a clinical routine by treating an input profile as a 'profile-biomarker', but also has positive impacts on large scale 'omics' data mining. Related source codes and data sets can be found at: https://sites.google.com/site/heyaumbioinformatics/home/proteomics.
尽管基于质谱的蛋白质组学在复杂疾病诊断方面展现出令人兴奋的前景,但由于其诊断准确性和数据可重复性,它仍然是一个重要的研究领域,而非可应用于临床的常规方法。与提高数据可重复性的大量努力相比,在实现高性能蛋白质组学模式分类方面的研究相对较少。
在本研究中,我们提出了一种新颖的机器学习方法,用于对质谱数据进行临床水平的疾病诊断。我们提出了多分辨率独立成分分析,这是一种新颖的特征选择算法,通过遵循我们的局部和全局特征选择框架来处理质谱的高维度问题。我们还通过将多分辨率独立成分分析嵌入线性判别分析和支持向量机中,开发了高性能分类器。
我们基于多分辨率独立成分的支持向量机不仅实现了临床水平的分类准确率,还克服了传统基于峰选择的生物标志物发现方法的弱点。除了严格的理论分析外,我们通过在六个异质质谱图谱上与九种最先进的分类和回归算法进行比较,证明了我们方法的优越性。
我们的工作不仅从机器学习的角度提出了一个替代方向,即通过将输入图谱视为“图谱生物标志物”,加速质谱蛋白质组学技术进入临床常规,而且对大规模“组学”数据挖掘也有积极影响。相关源代码和数据集可在以下网址找到:https://sites.google.com/site/heyaumbioinformatics/home/proteomics 。