Wang Zeyuan, Gu Hong, Zhao Minghui, Li Dan, Wang Jia
Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China.
Department of Breast Surgery, Second Hospital of Dalian Medical University, Dalian, Liaoning, China.
Front Genet. 2023 Feb 27;14:1135260. doi: 10.3389/fgene.2023.1135260. eCollection 2023.
Many clustering techniques have been proposed to group genes based on gene expression data. Among these methods, semi-supervised clustering techniques aim to improve clustering performance by incorporating supervisory information in the form of pairwise constraints. However, noisy constraints inevitably exist in the constraint set obtained on the practical unlabeled dataset, which degenerates the performance of semi-supervised clustering. Moreover, multiple information sources are not integrated into multi-source constraints to improve clustering quality. To this end, the research proposes a new multi-objective semi-supervised clustering algorithm based on constraints selection and multi-source constraints (MSC-CSMC) for unlabeled gene expression data. The proposed method first uses the gene expression data and the gene ontology (GO) that describes gene annotation information to form multi-source constraints. Then, the multi-source constraints are applied to the clustering by improving the constraint violation penalty weight in the semi-supervised clustering objective function. Furthermore, the constraints selection and cluster prototypes are put into the multi-objective evolutionary framework by adopting a mixed chromosome encoding strategy, which can select pairwise constraints suitable for clustering tasks through synergistic optimization to reduce the negative influence of noisy constraints. The proposed MSC-CSMC algorithm is testified using five benchmark gene expression datasets, and the results show that the proposed algorithm achieves superior performance.
已经提出了许多聚类技术,用于根据基因表达数据对基因进行分组。在这些方法中,半监督聚类技术旨在通过纳入成对约束形式的监督信息来提高聚类性能。然而,在实际未标记数据集上获得的约束集中不可避免地存在噪声约束,这会降低半监督聚类的性能。此外,多源信息未被整合到多源约束中以提高聚类质量。为此,该研究针对未标记的基因表达数据提出了一种基于约束选择和多源约束的新型多目标半监督聚类算法(MSC-CSMC)。所提出的方法首先使用基因表达数据和描述基因注释信息的基因本体(GO)来形成多源约束。然后,通过改进半监督聚类目标函数中的约束违反惩罚权重,将多源约束应用于聚类。此外,通过采用混合染色体编码策略,将约束选择和聚类原型放入多目标进化框架中,该策略可以通过协同优化选择适合聚类任务的成对约束,以减少噪声约束的负面影响。使用五个基准基因表达数据集对所提出的MSC-CSMC算法进行了验证,结果表明所提出的算法具有优异的性能。