基于标准差和余弦相似度的无监督特征选择算法在基因组数据分析中的应用

The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis.

作者信息

Xie Juanying, Wang Mingzhao, Xu Shengquan, Huang Zhao, Grant Philip W

机构信息

School of Computer Science, Shaanxi Normal University, Xi'an, China.

College of Life Sciences, Shaanxi Normal University, Xi'an, China.

出版信息

Front Genet. 2021 May 13;12:684100. doi: 10.3389/fgene.2021.684100. eCollection 2021.

DOI:10.3389/fgene.2021.684100

PMID:34054930

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8155687/

Abstract

To tackle the challenges in genomic data analysis caused by their tens of thousands of dimensions while having a small number of examples and unbalanced examples between classes, the technique of unsupervised feature selection based on standard deviation and cosine similarity is proposed in this paper. We refer to this idea as SCFS (Standard deviation and Cosine similarity based Feature Selection). It defines the discernibility and independence of a feature to value its distinguishable capability between classes and its redundancy to other features, respectively. A 2-dimensional space is constructed using discernibility as x-axis and independence as y-axis to represent all features where the upper right corner features have both comparatively high discernibility and independence. The importance of a feature is defined as the product of its discernibility and its independence (i.e., the area of the rectangular enclosed by the feature's coordinate lines and axes). The upper right corner features are by far the most important, comprising the optimal feature subset. Based on different definitions of independence using cosine similarity, there are three feature selection algorithms derived from SCFS. These are SCEFS (Standard deviation and Exponent Cosine similarity based Feature Selection), SCRFS (Standard deviation and Reciprocal Cosine similarity based Feature Selection) and SCAFS (Standard deviation and Anti-Cosine similarity based Feature Selection), respectively. The KNN and SVM classifiers are built based on the optimal feature subsets detected by these feature selection algorithms, respectively. The experimental results on 18 genomic datasets of cancers demonstrate that the proposed unsupervised feature selection algorithms SCEFS, SCRFS and SCAFS can detect the stable biomarkers with strong classification capability. This shows that the idea proposed in this paper is powerful. The functional analysis of these biomarkers show that the occurrence of the cancer is closely related to the biomarker gene regulation level. This fact will benefit cancer pathology research, drug development, early diagnosis, treatment and prevention.

摘要

为应对基因组数据分析中因数据维度高达数万，同时示例数量少且类别间示例不均衡所带来的挑战，本文提出了基于标准差和余弦相似度的无监督特征选择技术。我们将这一理念称为SCFS（基于标准差和余弦相似度的特征选择）。它分别定义了一个特征的可辨别性和独立性，以评估其在类别间的区分能力及其与其他特征的冗余性。使用可辨别性作为x轴，独立性作为y轴构建二维空间来表示所有特征，其中右上角的特征具有相对较高的可辨别性和独立性。一个特征的重要性定义为其可辨别性与其独立性的乘积（即该特征的坐标线与坐标轴所围成矩形的面积）。右上角的特征是迄今为止最重要的，构成了最优特征子集。基于使用余弦相似度对独立性的不同定义，从SCFS衍生出三种特征选择算法。它们分别是SCEFS（基于标准差和指数余弦相似度的特征选择）、SCRFS（基于标准差和倒数余弦相似度的特征选择）和SCAFS（基于标准差和反余弦相似度的特征选择）。分别基于这些特征选择算法检测到的最优特征子集构建KNN和SVM分类器。对18个癌症基因组数据集的实验结果表明，所提出的无监督特征选择算法SCEFS、SCRFS和SCAFS能够检测出具有强大分类能力的稳定生物标志物。这表明本文提出的理念很强大。对这些生物标志物的功能分析表明，癌症的发生与生物标志物基因调控水平密切相关。这一事实将有利于癌症病理学研究、药物开发、早期诊断、治疗和预防。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d5e9/8155687/1182dc27cffd/fgene-12-684100-g001.jpg

相似文献

The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis.基于标准差和余弦相似度的无监督特征选择算法在基因组数据分析中的应用

Front Genet. 2021 May 13;12:684100. doi: 10.3389/fgene.2021.684100. eCollection 2021.

A novel method detecting the key clinic factors of portal vein system thrombosis of splenectomy & cardia devascularization patients for cirrhosis & portal hypertension.一种检测肝硬化和门静脉高压症脾切除术和贲门周围血管离断术患者门静脉系统血栓形成关键临床因素的新方法。

BMC Bioinformatics. 2019 Dec 30;20(Suppl 22):720. doi: 10.1186/s12859-019-3233-3.

IBGJO: Improved Binary Golden Jackal Optimization with Chaotic Tent Map and Cosine Similarity for Feature Selection.IBGJO：基于混沌帐篷映射和余弦相似度的改进二进制金豺优化算法用于特征选择

Entropy (Basel). 2023 Jul 27;25(8):1128. doi: 10.3390/e25081128.

Absolute cosine-based SVM-RFE feature selection method for prostate histopathological grading.基于绝对余弦的 SVM-RFE 特征选择方法在前列腺组织病理分级中的应用。

Artif Intell Med. 2018 May;87:78-90. doi: 10.1016/j.artmed.2018.04.002. Epub 2018 Apr 19.

Opposition-based sine cosine optimizer utilizing refraction learning and variable neighborhood search for feature selection.基于对立的正弦余弦优化器，利用折射学习和可变邻域搜索进行特征选择。

Appl Intell (Dordr). 2023;53(11):13224-13260. doi: 10.1007/s10489-022-04201-z. Epub 2022 Oct 8.

An Adaptive Unsupervised Feature Selection Algorithm Based on MDS for Tumor Gene Data Classification.基于 MDS 的肿瘤基因数据分类自适应无监督特征选择算法。

Sensors (Basel). 2021 May 23;21(11):3627. doi: 10.3390/s21113627.

Differentiation of fat-poor angiomyolipoma from clear cell renal cell carcinoma in contrast-enhanced MDCT images using quantitative feature classification.基于定量特征分类的 MDCT 增强图像鉴别乏脂性血管平滑肌脂肪瘤与透明细胞肾细胞癌

Med Phys. 2017 Jul;44(7):3604-3614. doi: 10.1002/mp.12258. Epub 2017 Jun 9.

Improved cosine similarity measures of simplified neutrosophic sets for medical diagnoses.改进的简化 Neutrosophic 集余弦相似度度量在医学诊断中的应用。

Artif Intell Med. 2015 Mar;63(3):171-9. doi: 10.1016/j.artmed.2014.12.007. Epub 2014 Dec 26.

Locality preserving score for joint feature weights learning.局部保持评分的联合特征权重学习。

Neural Netw. 2015 Sep;69:126-34. doi: 10.1016/j.neunet.2015.06.001. Epub 2015 Jun 15.

Structured feature selection using coordinate descent optimization.使用坐标下降优化的结构化特征选择

BMC Bioinformatics. 2016 Apr 8;17:158. doi: 10.1186/s12859-016-0954-4.

引用本文的文献

Algorithms Mol Biol. 2025 May 15;20(1):8. doi: 10.1186/s13015-025-00276-8.

Bayesian estimation of shared polygenicity identifies drug targets and repurposable medicines for human complex diseases.共享多基因性的贝叶斯估计确定了人类复杂疾病的药物靶点和可重新利用的药物。

medRxiv. 2025 Mar 17:2025.03.17.25324106. doi: 10.1101/2025.03.17.25324106.

Adoption of K-means clustering algorithm in smart city security analysis and mythical experience analysis of urban image.K均值聚类算法在智慧城市安全分析及城市形象的虚拟体验分析中的应用。

PLoS One. 2025 Mar 10;20(3):e0319620. doi: 10.1371/journal.pone.0319620. eCollection 2025.

MDFGNN-SMMA: prediction of potential small molecule-miRNA associations based on multi-source data fusion and graph neural networks.MDFGNN-SMMA：基于多源数据融合和图神经网络的潜在小分子- miRNA关联预测

BMC Bioinformatics. 2025 Jan 13;26(1):13. doi: 10.1186/s12859-025-06040-4.

MIFAM-DTI: a drug-target interactions predicting model based on multi-source information fusion and attention mechanism.MIFAM-DTI：一种基于多源信息融合和注意力机制的药物-靶点相互作用预测模型。

Front Genet. 2024 May 6;15:1381997. doi: 10.3389/fgene.2024.1381997. eCollection 2024.

Identification of survival-associated biomarkers based on three datasets by bioinformatics analysis in gastric cancer.基于三个数据集通过生物信息学分析鉴定胃癌中与生存相关的生物标志物

World J Clin Cases. 2023 Jul 16;11(20):4763-4787. doi: 10.12998/wjcc.v11.i20.4763.

Automated Dashboards for the Identification of Pathogenic Circulating Tumor DNA Mutations in Longitudinal Blood Draws of Cancer Patients.用于识别癌症患者纵向采血中致病性循环肿瘤DNA突变的自动化仪表盘

Methods Protoc. 2023 May 1;6(3):46. doi: 10.3390/mps6030046.

A Machine Learning Method with Filter-Based Feature Selection for Improved Prediction of Chronic Kidney Disease.一种基于滤波器特征选择的机器学习方法用于改善慢性肾脏病的预测

Bioengineering (Basel). 2022 Jul 28;9(8):350. doi: 10.3390/bioengineering9080350.

本文引用的文献

Global, Regional, and National Cancer Incidence, Mortality, Years of Life Lost, Years Lived With Disability, and Disability-Adjusted Life-Years for 29 Cancer Groups, 1990 to 2016: A Systematic Analysis for the Global Burden of Disease Study.全球、区域和国家癌症发病率、死亡率、生命损失年数、失能生存年数以及 29 种癌症组别的伤残调整生命年数，1990 年至 2016 年：全球疾病负担研究的系统分析。

JAMA Oncol. 2018 Nov 1;4(11):1553-1568. doi: 10.1001/jamaoncol.2018.2706.

Deep learning based tissue analysis predicts outcome in colorectal cancer.基于深度学习的组织分析预测结直肠癌的预后。

Sci Rep. 2018 Feb 21;8(1):3395. doi: 10.1038/s41598-018-21758-3.

Applying Data-driven Imaging Biomarker in Mammography for Breast Cancer Screening: Preliminary Study.应用数据驱动的乳腺成像生物标志物进行乳腺癌筛查：初步研究。

Sci Rep. 2018 Feb 9;8(1):2762. doi: 10.1038/s41598-018-21215-1.

RIFS: a randomly restarted incremental feature selection algorithm.RIFS：一种随机重启的增量特征选择算法。

Sci Rep. 2017 Oct 12;7(1):13013. doi: 10.1038/s41598-017-13259-6.

Rapid intraoperative histology of unprocessed surgical specimens via fibre-laser-based stimulated Raman scattering microscopy.通过基于光纤激光的受激拉曼散射显微镜对未处理手术标本进行快速术中组织学检查。

Nat Biomed Eng. 2017;1. doi: 10.1038/s41551-016-0027. Epub 2017 Feb 6.

Actin Gamma 1, a new skin cancer pathogenic gene, identified by the biological feature-based classification.基于生物特征分类鉴定的新型皮肤癌致病基因 Actin Gamma 1

J Cell Biochem. 2018 Feb;119(2):1406-1419. doi: 10.1002/jcb.26301. Epub 2017 Oct 5.

Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts.使用一种采用人工智能概念的新进化方法进行微阵列癌症分类的基因选择。

Genomics. 2017 Mar;109(2):91-107. doi: 10.1016/j.ygeno.2017.01.004. Epub 2017 Feb 1.

Dermatologist-level classification of skin cancer with deep neural networks.基于深度神经网络的皮肤癌皮肤科医生级分类。

Nature. 2017 Feb 2;542(7639):115-118. doi: 10.1038/nature21056. Epub 2017 Jan 25.

Antioxidants Abrogate Alpha-Tocopherylquinone-Mediated Down-Regulation of the Androgen Receptor in Androgen-Responsive Prostate Cancer Cells.抗氧化剂可消除α-生育酚醌介导的雄激素反应性前列腺癌细胞中雄激素受体的下调。

PLoS One. 2016 Mar 17;11(3):e0151525. doi: 10.1371/journal.pone.0151525. eCollection 2016.

Microseminoprotein-Beta Expression in Different Stages of Prostate Cancer.微小精液蛋白-β在前列腺癌不同阶段的表达

PLoS One. 2016 Mar 3;11(3):e0150241. doi: 10.1371/journal.pone.0150241. eCollection 2016.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于标准差和余弦相似度的无监督特征选择算法在基因组数据分析中的应用

The Unsupervised Feature Selection Algorithms Based on Standard Deviation and Cosine Similarity for Genomic Data Analysis.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献