Buza Krisztian
Brain Imaging Center, Research Center for Natural Sciences, Hungarian Academy of Sciences, Budapest, Hungary.
Comput Methods Programs Biomed. 2016 Apr;127:105-13. doi: 10.1016/j.cmpb.2016.01.016. Epub 2016 Feb 11.
Classification of gene expression data is the common denominator of various biomedical recognition tasks. However, obtaining class labels for large training samples may be difficult or even impossible in many cases. Therefore, semi-supervised classification techniques are required as semi-supervised classifiers take advantage of unlabeled data.
Gene expression data is high-dimensional which gives rise to the phenomena known under the umbrella of the curse of dimensionality, one of its recently explored aspects being the presence of hubs or hubness for short. Therefore, hubness-aware classifiers have been developed recently, such as Naive Hubness-Bayesian k-Nearest Neighbor (NHBNN). In this paper, we propose a semi-supervised extension of NHBNN which follows the self-training schema. As one of the core components of self-training is the certainty score, we propose a new hubness-aware certainty score.
We performed experiments on publicly available gene expression data. These experiments show that the proposed classifier outperforms its competitors. We investigated the impact of each of the components (classification algorithm, semi-supervised technique, hubness-aware certainty score) separately and showed that each of these components are relevant to the performance of the proposed approach.
Our results imply that our approach may increase classification accuracy and reduce computational costs (i.e., runtime). Based on the promising results presented in the paper, we envision that hubness-aware techniques will be used in various other biomedical machine learning tasks. In order to accelerate this process, we made an implementation of hubness-aware machine learning techniques publicly available in the PyHubs software package (http://www.biointelligence.hu/pyhubs) implemented in Python, one of the most popular programming languages of data science.
基因表达数据分类是各种生物医学识别任务的共同特征。然而,在许多情况下,为大量训练样本获取类别标签可能很困难甚至不可能。因此,需要半监督分类技术,因为半监督分类器可利用未标记数据。
基因表达数据是高维的,这会引发在维度诅咒这一范畴下已知的现象,其最近被探索的一个方面是存在枢纽点,简称为枢纽性。因此,最近已开发出具有枢纽性意识的分类器,如朴素枢纽性贝叶斯k近邻(NHBNN)。在本文中,我们提出了一种遵循自训练模式的NHBNN半监督扩展方法。由于自训练的核心组件之一是确定性分数,我们提出了一种新的具有枢纽性意识的确定性分数。
我们对公开可用的基因表达数据进行了实验。这些实验表明,所提出的分类器优于其竞争对手。我们分别研究了每个组件(分类算法、半监督技术、具有枢纽性意识的确定性分数)的影响,并表明这些组件中的每一个都与所提出方法的性能相关。
我们的结果表明,我们的方法可能会提高分类准确率并降低计算成本(即运行时间)。基于本文中呈现的有前景的结果,我们设想具有枢纽性意识的技术将用于各种其他生物医学机器学习任务。为了加速这一进程,我们在数据科学中最流行的编程语言之一Python实现的PyHubs软件包(http://www.biointelligence.hu/pyhubs)中公开提供了具有枢纽性意识的机器学习技术的实现。