Research Center for Biomedical Information, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China.
Health Inf Sci Syst. 2014 Oct 16;2:7. doi: 10.1186/2047-2501-2-7. eCollection 2014.
Feature selection techniques have become an apparent need in biomarker discoveries with the development of microarray. However, the high dimensional nature of microarray made feature selection become time-consuming. To overcome such difficulties, filter data according to the background knowledge before applying feature selection techniques has become a hot topic in microarray analysis. Different methods may affect final results greatly, thus it is important to evaluate these pre-filter methods in a system way.
In this paper, we compared the performance of statistical-based, biological-based pre-filter methods and the combination of them on microRNA-mRNA parallel expression profiles using L1 logistic regression as feature selection techniques. Four types of data were built for both microRNA and mRNA expression profiles.
Results showed that pre-filter methods could reduce the number of features greatly for both mRNA and microRNA expression datasets. The features selected after pre-filter procedures were shown to be significant in biological levels such as biology process and microRNA functions. Analyses of classification performance based on precision showed the pre-filter methods were necessary when the number of raw features was much bigger than that of samples. All the computing time was greatly shortened after pre-filter procedures.
With similar or better classification improvements, less but biological significant features, pre-filter-based feature selection should be taken into consideration if researchers need fast results when facing complex computing problems in bioinformatics.
随着微阵列技术的发展,特征选择技术已成为生物标志物发现的明显需求。然而,微阵列的高维性质使得特征选择变得耗时。为了克服这些困难,在应用特征选择技术之前,根据背景知识对数据进行过滤已成为微阵列分析中的一个热门话题。不同的方法可能会对最终结果产生很大的影响,因此,系统地评估这些预过滤方法非常重要。
在本文中,我们比较了基于统计、基于生物学的预过滤方法及其组合在微 RNA-mRNA 平行表达谱上的性能,使用 L1 逻辑回归作为特征选择技术。为微 RNA 和 mRNA 表达谱构建了四种类型的数据。
结果表明,预过滤方法可以大大减少微 RNA 和 mRNA 表达数据集的特征数量。经过预过滤程序选择的特征在生物学过程和微 RNA 功能等生物学水平上具有显著意义。基于精度的分类性能分析表明,当原始特征的数量远大于样本数量时,预过滤方法是必要的。所有的计算时间在预过滤程序后都大大缩短了。
在具有相似或更好的分类改进的情况下,较少但具有生物学意义的特征,如果研究人员在生物信息学中面临复杂的计算问题时需要快速的结果,基于预过滤的特征选择应该被考虑。