Xie Juanying, Wang Mingzhao, Xu Shengquan, Huang Zhao, Grant Philip W
School of Computer Science, Shaanxi Normal University, Xi'an, China.
College of Life Sciences, Shaanxi Normal University, Xi'an, China.
Front Genet. 2021 May 13;12:684100. doi: 10.3389/fgene.2021.684100. eCollection 2021.
To tackle the challenges in genomic data analysis caused by their tens of thousands of dimensions while having a small number of examples and unbalanced examples between classes, the technique of unsupervised feature selection based on standard deviation and cosine similarity is proposed in this paper. We refer to this idea as SCFS (Standard deviation and Cosine similarity based Feature Selection). It defines the discernibility and independence of a feature to value its distinguishable capability between classes and its redundancy to other features, respectively. A 2-dimensional space is constructed using discernibility as x-axis and independence as y-axis to represent all features where the upper right corner features have both comparatively high discernibility and independence. The importance of a feature is defined as the product of its discernibility and its independence (i.e., the area of the rectangular enclosed by the feature's coordinate lines and axes). The upper right corner features are by far the most important, comprising the optimal feature subset. Based on different definitions of independence using cosine similarity, there are three feature selection algorithms derived from SCFS. These are SCEFS (Standard deviation and Exponent Cosine similarity based Feature Selection), SCRFS (Standard deviation and Reciprocal Cosine similarity based Feature Selection) and SCAFS (Standard deviation and Anti-Cosine similarity based Feature Selection), respectively. The KNN and SVM classifiers are built based on the optimal feature subsets detected by these feature selection algorithms, respectively. The experimental results on 18 genomic datasets of cancers demonstrate that the proposed unsupervised feature selection algorithms SCEFS, SCRFS and SCAFS can detect the stable biomarkers with strong classification capability. This shows that the idea proposed in this paper is powerful. The functional analysis of these biomarkers show that the occurrence of the cancer is closely related to the biomarker gene regulation level. This fact will benefit cancer pathology research, drug development, early diagnosis, treatment and prevention.
为应对基因组数据分析中因数据维度高达数万,同时示例数量少且类别间示例不均衡所带来的挑战,本文提出了基于标准差和余弦相似度的无监督特征选择技术。我们将这一理念称为SCFS(基于标准差和余弦相似度的特征选择)。它分别定义了一个特征的可辨别性和独立性,以评估其在类别间的区分能力及其与其他特征的冗余性。使用可辨别性作为x轴,独立性作为y轴构建二维空间来表示所有特征,其中右上角的特征具有相对较高的可辨别性和独立性。一个特征的重要性定义为其可辨别性与其独立性的乘积(即该特征的坐标线与坐标轴所围成矩形的面积)。右上角的特征是迄今为止最重要的,构成了最优特征子集。基于使用余弦相似度对独立性的不同定义,从SCFS衍生出三种特征选择算法。它们分别是SCEFS(基于标准差和指数余弦相似度的特征选择)、SCRFS(基于标准差和倒数余弦相似度的特征选择)和SCAFS(基于标准差和反余弦相似度的特征选择)。分别基于这些特征选择算法检测到的最优特征子集构建KNN和SVM分类器。对18个癌症基因组数据集的实验结果表明,所提出的无监督特征选择算法SCEFS、SCRFS和SCAFS能够检测出具有强大分类能力的稳定生物标志物。这表明本文提出的理念很强大。对这些生物标志物的功能分析表明,癌症的发生与生物标志物基因调控水平密切相关。这一事实将有利于癌症病理学研究、药物开发、早期诊断、治疗和预防。