Yu Jiangsheng, Chen Xue-Wen
School of Electronics Engineering and Computer Science, Peking University China.
Bioinformatics. 2005 Jun;21 Suppl 1:i487-94. doi: 10.1093/bioinformatics/bti1030.
The classification of high-dimensional data is always a challenge to statistical machine learning. We propose a novel method named shallow feature selection that assigns each feature a probability of being selected based on the structure of training data itself. Independent of particular classifiers, the high dimension of biodata can be fleetly reduced to an applicable case for consequential processing. Moreover, to improve both efficiency and performance of classification, these prior probabilities are further used to specify the distributions of top-level hyperparameters in hierarchical models of Bayesian neural network (BNN), as well as the parameters in Gaussian process models.
Three BNN approaches were derived and then applied to identify ovarian cancer from NCI's high-resolution mass spectrometry data, which yielded an excellent performance in 1000 independent k-fold cross validations (k = 2,...,10). For instance, indices of average sensitivity and specificity of 98.56 and 98.42%, respectively, were achieved in the 2-fold cross validations. Furthermore, only one control and one cancer were misclassified in the leave-one-out cross validation. Some other popular classifiers were also tested for comparison.
The programs implemented in MatLab, R and Neal's fbm.2004-11-10.
高维数据的分类一直是统计机器学习面临的挑战。我们提出了一种名为浅层特征选择的新方法,该方法基于训练数据本身的结构为每个特征分配一个被选中的概率。独立于特定的分类器,生物数据的高维性可以迅速降低到适用于后续处理的情况。此外,为了提高分类的效率和性能,这些先验概率被进一步用于指定贝叶斯神经网络(BNN)层次模型中顶级超参数的分布,以及高斯过程模型中的参数。
推导了三种BNN方法,然后将其应用于从美国国立癌症研究所的高分辨率质谱数据中识别卵巢癌,在1000次独立的k折交叉验证(k = 2,...,10)中表现出色。例如,在2折交叉验证中,平均灵敏度和特异性指标分别达到98.56%和98.42%。此外,在留一法交叉验证中只有一个对照和一个癌症被误分类。还测试了其他一些流行的分类器进行比较。
该程序用MatLab、R和Neal的fbm.2004 - 11 - 10实现。