School of Computer Science & Technology, Dalian University of Technology, No. 2 Linggong Road, Dalian, Liaoning Province 116024, P. R. China.
J Bioinform Comput Biol. 2024 Feb;22(1):2450002. doi: 10.1142/S0219720024500021. Epub 2024 Mar 25.
Identifying valuable features from complex omics data is of great significance for disease diagnosis study. This paper proposes a new feature selection algorithm based on sample network (FS-SN) to mine important information from omics data. The sample network is constructed according to the sample neighbor relationship at the molecular (feature) expression level, and the distinguishing ability of the feature is evaluated based on the topology of the sample network. The sample network established on a feature with a strong discriminating ability tends to have many edges between the same group samples and few edges between the different group samples. At the same time, FS-SN removes redundant features according to the gravitational interaction between features. To show the validation of FS-SN, it was compared on ten public datasets with ERGS, mRMR, ReliefF, ATSD-DN, and INDEED which are efficient in omics data analysis. Experimental results show that FS-SN performed better than the compared methods in accuracy, sensitivity and specificity in most cases. Hence, FS-SN making use of the topology of the sample network is effective for analyzing omics data, it can identify key features that reflect the occurrence and development of diseases, and reveal the underlying biological mechanism.
从复杂的组学数据中识别有价值的特征对于疾病诊断研究具有重要意义。本文提出了一种基于样本网络的新特征选择算法(FS-SN),用于从组学数据中挖掘重要信息。样本网络是根据分子(特征)表达水平上的样本邻居关系构建的,特征的区分能力是根据样本网络的拓扑结构来评估的。在具有强区分能力的特征上建立的样本网络倾向于在同一组样本之间具有许多边,而在不同组样本之间具有很少的边。同时,FS-SN 根据特征之间的引力相互作用去除冗余特征。为了展示 FS-SN 的有效性,将其与 ERGS、mRMR、ReliefF、ATSD-DN 和 INDEED 等在组学数据分析中效率较高的方法在十个公共数据集上进行了比较。实验结果表明,在大多数情况下,FS-SN 在准确性、灵敏度和特异性方面均优于比较方法。因此,FS-SN 利用样本网络的拓扑结构来分析组学数据是有效的,它可以识别反映疾病发生和发展的关键特征,并揭示潜在的生物学机制。