CIFASIS-Conicet Institute, Bv, 27 de Febrero 210 Bis, Rosario, Argentina.
BMC Bioinformatics. 2011 Feb 22;12:59. doi: 10.1186/1471-2105-12-59.
Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained.
A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples.
A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples.
在生物信息学研究中,用较少的基因对微阵列数据样本进行多类分类是一个丰富而具有挑战性的问题。随着类别的数量增加,问题变得更加困难。此外,大多数分类器的性能与强制性基因选择方法的有效性密切相关。基因选择的关键是能否获得任何分类算法可以处理的最大基因数量的估计。缺乏这些估计可能会导致对具有数千个维度的搜索空间进行计算密集型探索,或者基于不受限制大小的基因集的分类模型。在前一种情况下,可能会出现无偏但可能过拟合的分类模型。在后一种情况下,可能会得到有偏差的分类模型,无法支持具有统计学意义的发现。
提出了一种用于微阵列数据样本的二进制介导多类分类算法中二进制分类器可以处理的最大基因数量的新边界。该边界表明,高维二进制输出域可能有利于存在用于微阵列数据样本的准确和稀疏的二进制介导多类分类器。
全面的实验工作表明,该边界确实可用于诱导用于微阵列数据样本的准确和稀疏的多类分类器。