IEEE J Biomed Health Inform. 2019 Nov;23(6):2670-2676. doi: 10.1109/JBHI.2019.2894374. Epub 2019 Jan 21.
Classification of samples of gene expression profile plays a significant role in prediction and diagnosis of diseases. In the task of sample classification, a robust feature selection algorithm is very much essential to identify the important genes from the high dimensional gene expression data. This paper explores the information of protein-protein interaction with a graph mining technique for finding a proper subset of features (genes), which further takes part in sample classification. Here, our contribution for feature selection is three-fold: first, all the genes are grouped into different clusters based on the integrated information of the gene expression values and their protein interactions using a multi-objective optimization based clustering approach. Second, the confidence scores of the protein interactions are incorporated in a popular graph mining algorithm namely Goldberg algorithm to find out the relevant features. These features are the topologically and functionally significant genes, named as hub genes. Finally, these hub genes are identified varying the degrees of the nodes, and those are utilized for the sample classification task. Different machine learning classifiers are exploited for this purpose, and the classification performance is measured with respect to various performance metrics namely accuracy, sensitivity, specificity, precision, F-measure, and Mathews coefficient correlation. Comparative analysis with respect to two baselines and several existing approaches proves the efficiency of the proposed approach. Furthermore, the robustness of the identified hub-gene modules is endorsed using some strong biological significance analysis.
基因表达谱样本的分类在疾病的预测和诊断中起着重要作用。在样本分类任务中,稳健的特征选择算法对于从高维基因表达数据中识别重要基因非常重要。本文利用图挖掘技术探索蛋白质-蛋白质相互作用的信息,以找到适当的特征(基因)子集,进一步参与样本分类。在这里,我们的特征选择贡献有三方面:首先,使用基于多目标优化的聚类方法,根据基因表达值及其蛋白质相互作用的综合信息将所有基因分为不同的簇。其次,将蛋白质相互作用的置信分数纳入流行的图挖掘算法 Goldberg 算法中,以找出相关特征。这些特征是拓扑和功能上重要的基因,称为枢纽基因。最后,通过改变节点的度数来识别这些枢纽基因,并将其用于样本分类任务。为此目的利用了不同的机器学习分类器,并根据各种性能指标(即准确性、敏感性、特异性、精度、F 度量和 Matthews 系数相关性)来衡量分类性能。与两个基线和几个现有方法的比较分析证明了所提出方法的效率。此外,还使用一些强大的生物学意义分析来证明所识别的枢纽基因模块的稳健性。