Control and Computer Engineering Department, Politecnico di Torino, Corso Duca degli Abruzzi 24,10129, Torino, Italy.
BMC Bioinformatics. 2011;12 Suppl 13(Suppl 13):S3. doi: 10.1186/1471-2105-12-S13-S3. Epub 2011 Nov 30.
The collection of gene expression profiles from DNA microarrays and their analysis with pattern recognition algorithms is a powerful technology applied to several biological problems. Common pattern recognition systems classify samples assigning them to a set of known classes. However, in a clinical diagnostics setup, novel and unknown classes (new pathologies) may appear and one must be able to reject those samples that do not fit the trained model. The problem of implementing a rejection option in a multi-class classifier has not been widely addressed in the statistical literature. Gene expression profiles represent a critical case study since they suffer from the curse of dimensionality problem that negatively reflects on the reliability of both traditional rejection models and also more recent approaches such as one-class classifiers.
This paper presents a set of empirical decision rules that can be used to implement a rejection option in a set of multi-class classifiers widely used for the analysis of gene expression profiles. In particular, we focus on the classifiers implemented in the R Language and Environment for Statistical Computing (R for short in the remaining of this paper). The main contribution of the proposed rules is their simplicity, which enables an easy integration with available data analysis environments. Since in the definition of a rejection model tuning of the involved parameters is often a complex and delicate task, in this paper we exploit an evolutionary strategy to automate this process. This allows the final user to maximize the rejection accuracy with minimum manual intervention.
This paper shows how the use of simple decision rules can be used to help the use of complex machine learning algorithms in real experimental setups. The proposed approach is almost completely automated and therefore a good candidate for being integrated in data analysis flows in labs where the machine learning expertise required to tune traditional classifiers might not be available.
从 DNA 微阵列中收集基因表达谱,并使用模式识别算法对其进行分析,这是一种应用于多个生物学问题的强大技术。常见的模式识别系统通过将样本分配给一组已知的类别来对样本进行分类。然而,在临床诊断环境中,可能会出现新的未知类别(新的病理),因此必须能够拒绝那些不符合训练模型的样本。在多类分类器中实现拒绝选项的问题在统计文献中尚未得到广泛解决。基因表达谱是一个关键的案例研究,因为它们受到维度诅咒问题的影响,这对传统的拒绝模型以及最近的方法(如单类分类器)的可靠性产生了负面影响。
本文提出了一组经验决策规则,可用于在一组广泛用于分析基因表达谱的多类分类器中实现拒绝选项。特别是,我们专注于 R 语言和环境中的分类器实现(在本文的其余部分中简称 R)。所提出规则的主要贡献在于其简单性,这使得它们可以轻松集成到可用的数据分析环境中。由于在拒绝模型的定义中,涉及参数的调整通常是一项复杂而微妙的任务,因此在本文中,我们利用进化策略来自动化该过程。这允许最终用户以最小的人工干预最大程度地提高拒绝准确性。
本文展示了如何使用简单的决策规则来帮助在实际实验设置中使用复杂的机器学习算法。所提出的方法几乎完全自动化,因此非常适合集成到缺乏传统分类器调优所需的机器学习专业知识的实验室的数据分析流程中。