Radovic Milos, Ghalwash Mohamed, Filipovic Nenad, Obradovic Zoran
Center for Data Analytics and Biomedical Informatics, College of Science and Technology, Temple University, North 12th Street, Philadelphia, 19122, PA, USA.
Bioengineering Research and Development Center - BioIRC, Prvoslava Stojanovica 6, Kragujevac, 34000, Serbia.
BMC Bioinformatics. 2017 Jan 3;18(1):9. doi: 10.1186/s12859-016-1423-9.
Feature selection, aiming to identify a subset of features among a possibly large set of features that are relevant for predicting a response, is an important preprocessing step in machine learning. In gene expression studies this is not a trivial task for several reasons, including potential temporal character of data. However, most feature selection approaches developed for microarray data cannot handle multivariate temporal data without previous data flattening, which results in loss of temporal information. We propose a temporal minimum redundancy - maximum relevance (TMRMR) feature selection approach, which is able to handle multivariate temporal data without previous data flattening. In the proposed approach we compute relevance of a gene by averaging F-statistic values calculated across individual time steps, and we compute redundancy between genes by using a dynamical time warping approach.
The proposed method is evaluated on three temporal gene expression datasets from human viral challenge studies. Obtained results show that the proposed method outperforms alternatives widely used in gene expression studies. In particular, the proposed method achieved improvement in accuracy in 34 out of 54 experiments, while the other methods outperformed it in no more than 4 experiments.
We developed a filter-based feature selection method for temporal gene expression data based on maximum relevance and minimum redundancy criteria. The proposed method incorporates temporal information by combining relevance, which is calculated as an average F-statistic value across different time steps, with redundancy, which is calculated by employing dynamical time warping approach. As evident in our experiments, incorporating the temporal information into the feature selection process leads to selection of more discriminative features.
特征选择旨在从可能大量的特征中识别出与预测响应相关的特征子集,是机器学习中的一个重要预处理步骤。在基因表达研究中,由于多种原因,这并非易事,包括数据潜在的时间特性。然而,大多数为微阵列数据开发的特征选择方法在没有事先数据扁平化的情况下无法处理多变量时间数据,这会导致时间信息的丢失。我们提出了一种时间最小冗余 - 最大相关性(TMRMR)特征选择方法,该方法能够在不进行事先数据扁平化的情况下处理多变量时间数据。在所提出的方法中,我们通过对各个时间步计算的F统计值进行平均来计算基因的相关性,并使用动态时间规整方法计算基因之间的冗余度。
在所提出的方法在来自人类病毒攻击研究的三个时间基因表达数据集上进行了评估。获得的结果表明,所提出的方法优于基因表达研究中广泛使用的其他方法。特别是,所提出的方法在54个实验中的34个实验中提高了准确性,而其他方法在不超过4个实验中表现优于它。
我们基于最大相关性和最小冗余标准为时间基因表达数据开发了一种基于过滤的特征选择方法。所提出的方法通过将作为不同时间步平均F统计值计算的相关性与通过采用动态时间规整方法计算的冗余度相结合,纳入了时间信息。正如我们实验中所表明的,将时间信息纳入特征选择过程会导致选择更具判别力的特征。