Department of Computer Engineering, Ajou University, Suwon, 16499, South Korea.
Jan and Dan Duncan Neurological Research Institute, Texas Children's Hospital, Houston, TX, 77030, USA.
BMC Med Genomics. 2019 Jul 11;12(Suppl 5):95. doi: 10.1186/s12920-019-0512-9.
Feature selection or scoring methods for the detection of biomarkers are essential in bioinformatics. Various feature selection methods have been developed for the detection of biomarkers, and several studies have employed information-theoretic approaches. However, most of these methods generally require a long processing time. In addition, information-theoretic methods discretize continuous features, which is a drawback that can lead to the loss of information.
In this paper, a novel supervised feature scoring method named ClearF is proposed. The proposed method is suitable for continuous-valued data, which is similar to the principle of feature selection using mutual information, with the added advantage of a reduced computation time. The proposed score calculation is motivated by the association between the reconstruction error and the information-theoretic measurement. Our method is based on class-wise low-dimensional embedding and the resulting reconstruction error. Given multi-class datasets such as a case-control study dataset, low-dimensional embedding is first applied to each class to obtain a compressed representation of the class, and also for the entire dataset. Reconstruction is then performed to calculate the error of each feature and the final score for each feature is defined in terms of the reconstruction errors. The correlation between the information theoretic measurement and the proposed method is demonstrated using a simulation. For performance validation, we compared the classification performance of the proposed method with those of various algorithms on benchmark datasets.
The proposed method showed higher accuracy and lower execution time than the other established methods. Moreover, an experiment was conducted on the TCGA breast cancer dataset, and it was confirmed that the genes with the highest scores were highly associated with subtypes of breast cancer.
在生物信息学中,用于检测生物标志物的特征选择或评分方法是必不可少的。已经开发了各种特征选择方法来检测生物标志物,并且已经有几项研究采用了信息论方法。但是,这些方法通常需要很长的处理时间。此外,信息论方法将连续特征离散化,这是一个缺点,可能导致信息丢失。
本文提出了一种名为 ClearF 的新的有监督特征评分方法。该方法适用于连续值数据,类似于使用互信息进行特征选择的原理,具有减少计算时间的优点。提出的评分计算方法的动机是重建误差与信息论度量之间的关联。我们的方法基于类别的低维嵌入和由此产生的重建误差。对于多类数据集(例如病例对照研究数据集),首先将低维嵌入应用于每个类,以获得类的压缩表示,以及整个数据集的压缩表示。然后进行重建以计算每个特征的误差,并且根据重建误差定义每个特征的最终得分。使用仿真演示了信息论度量与所提出的方法之间的相关性。为了验证性能,我们将所提出的方法与基准数据集上的各种算法的分类性能进行了比较。
与其他已建立的方法相比,该方法的准确性更高,执行时间更短。此外,还在 TCGA 乳腺癌数据集上进行了实验,证实得分最高的基因与乳腺癌亚型高度相关。