Tissue Repair and Regeneration Program, Institute of Health and Biomedical Innovation, Queensland University of Technology, Kelvin Grove, Queensland, Australia.
PLoS One. 2011;6(9):e24973. doi: 10.1371/journal.pone.0024973. Epub 2011 Sep 28.
The discovery of protein variation is an important strategy in disease diagnosis within the biological sciences. The current benchmark for elucidating information from multiple biological variables is the so called "omics" disciplines of the biological sciences. Such variability is uncovered by implementation of multivariable data mining techniques which come under two primary categories, machine learning strategies and statistical based approaches. Typically proteomic studies can produce hundreds or thousands of variables, p, per observation, n, depending on the analytical platform or method employed to generate the data. Many classification methods are limited by an n≪p constraint, and as such, require pre-treatment to reduce the dimensionality prior to classification. Recently machine learning techniques have gained popularity in the field for their ability to successfully classify unknown samples. One limitation of such methods is the lack of a functional model allowing meaningful interpretation of results in terms of the features used for classification. This is a problem that might be solved using a statistical model-based approach where not only is the importance of the individual protein explicit, they are combined into a readily interpretable classification rule without relying on a black box approach. Here we incorporate statistical dimension reduction techniques Partial Least Squares (PLS) and Principal Components Analysis (PCA) followed by both statistical and machine learning classification methods, and compared them to a popular machine learning technique, Support Vector Machines (SVM). Both PLS and SVM demonstrate strong utility for proteomic classification problems.
蛋白质变异的发现是生物科学领域疾病诊断的重要策略。目前,阐明多个生物学变量信息的基准是所谓的“组学”生物学学科。通过实施多变量数据挖掘技术可以发现这种可变性,这些技术主要分为两类,机器学习策略和基于统计的方法。通常,蛋白质组学研究可以产生数百或数千个变量,p,每个观察值,n,具体取决于用于生成数据的分析平台或方法。许多分类方法受到 n≪p 约束的限制,因此,在分类之前需要进行预处理以降低维度。最近,机器学习技术因其能够成功分类未知样本而在该领域获得了普及。这些方法的一个限制是缺乏功能模型,无法根据用于分类的特征对结果进行有意义的解释。这是一个可以通过基于统计模型的方法解决的问题,其中不仅单个蛋白质的重要性是明确的,而且它们被组合成一个易于解释的分类规则,而无需依赖黑盒方法。在这里,我们结合了统计降维技术偏最小二乘(PLS)和主成分分析(PCA),然后使用统计和机器学习分类方法进行比较,并将其与一种流行的机器学习技术,支持向量机(SVM)进行比较。PLS 和 SVM 都为蛋白质组学分类问题提供了强大的实用工具。