Conrad Tim O F, Genzel Martin, Cvetkovic Nada, Wulkow Niklas, Leichtle Alexander, Vybiral Jan, Kutyniok Gitta, Schütte Christof
Department of Mathematics, Freie Universität Berlin, Arnimallee 6, Berlin, Germany.
Zuse Institute Berlin, Takustr. 7, Berlin, Germany.
BMC Bioinformatics. 2017 Mar 9;18(1):160. doi: 10.1186/s12859-017-1565-4.
High-throughput proteomics techniques, such as mass spectrometry (MS)-based approaches, produce very high-dimensional data-sets. In a clinical setting one is often interested in how mass spectra differ between patients of different classes, for example spectra from healthy patients vs. spectra from patients having a particular disease. Machine learning algorithms are needed to (a) identify these discriminating features and (b) classify unknown spectra based on this feature set. Since the acquired data is usually noisy, the algorithms should be robust against noise and outliers, while the identified feature set should be as small as possible.
We present a new algorithm, Sparse Proteomics Analysis (SPA), based on the theory of compressed sensing that allows us to identify a minimal discriminating set of features from mass spectrometry data-sets. We show (1) how our method performs on artificial and real-world data-sets, (2) that its performance is competitive with standard (and widely used) algorithms for analyzing proteomics data, and (3) that it is robust against random and systematic noise. We further demonstrate the applicability of our algorithm to two previously published clinical data-sets.
高通量蛋白质组学技术,如基于质谱(MS)的方法,会产生非常高维的数据集。在临床环境中,人们通常感兴趣的是不同类别患者的质谱如何不同,例如健康患者的光谱与患有特定疾病患者的光谱之间的差异。需要机器学习算法来(a)识别这些区分特征,以及(b)基于此特征集对未知光谱进行分类。由于获取的数据通常有噪声,算法应能抵御噪声和异常值,同时识别出的特征集应尽可能小。
我们基于压缩感知理论提出了一种新算法,即稀疏蛋白质组学分析(SPA),它使我们能够从质谱数据集中识别出一组最小的区分特征。我们展示了(1)我们的方法在人工和真实世界数据集上的表现,(2)其性能与用于分析蛋白质组学数据的标准(且广泛使用)算法具有竞争力,以及(3)它对随机和系统噪声具有鲁棒性。我们进一步证明了我们的算法对两个先前发表过的临床数据集的适用性。