一种用于生物信息学中不完整数据集的多核密度聚类算法。

A multiple kernel density clustering algorithm for incomplete datasets in bioinformatics.

作者信息

Liao Longlong, Li Kenli, Li Keqin, Yang Canqun, Tian Qi

机构信息

College of Computer, National University of Defense Technology, Sanyi Road, Changsha, China.

State Key Laboratory of High Performance Computing, Sanyi Road, Changsha, China.

出版信息

BMC Syst Biol. 2018 Nov 22;12(Suppl 6):111. doi: 10.1186/s12918-018-0630-6.

DOI:10.1186/s12918-018-0630-6

PMID:30463619

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6249732/

Abstract

BACKGROUND

While there are a large number of bioinformatics datasets for clustering, many of them are incomplete, i.e., missing attribute values in some data samples needed by clustering algorithms. A variety of clustering algorithms have been proposed in the past years, but they usually are limited to cluster on the complete dataset. Besides, conventional clustering algorithms cannot obtain a trade-off between accuracy and efficiency of the clustering process since many essential parameters are determined by the human user's experience.

RESULTS

The paper proposes a Multiple Kernel Density Clustering algorithm for Incomplete datasets called MKDCI. The MKDCI algorithm consists of recovering missing attribute values of input data samples, learning an optimally combined kernel for clustering the input dataset, reducing dimensionality with the optimal kernel based on multiple basis kernels, detecting cluster centroids with the Isolation Forests method, assigning clusters with arbitrary shape and visualizing the results.

CONCLUSIONS

Extensive experiments on several well-known clustering datasets in bioinformatics field demonstrate the effectiveness of the proposed MKDCI algorithm. Compared with existing density clustering algorithms and parameter-free clustering algorithms, the proposed MKDCI algorithm tends to automatically produce clusters of better quality on the incomplete dataset in bioinformatics.

摘要

背景

虽然有大量用于聚类的生物信息学数据集，但其中许多是不完整的，即聚类算法所需的一些数据样本中缺少属性值。在过去几年中已经提出了各种聚类算法，但它们通常仅限于在完整数据集上进行聚类。此外，传统的聚类算法无法在聚类过程的准确性和效率之间取得平衡，因为许多关键参数是由人类用户的经验决定的。

结果

本文提出了一种用于不完整数据集的多核密度聚类算法，称为MKDCI。MKDCI算法包括恢复输入数据样本的缺失属性值、学习用于对输入数据集进行聚类的最优组合核、基于多个基核使用最优核进行降维、使用孤立森林方法检测聚类中心、分配任意形状的聚类并可视化结果。

结论

在生物信息学领域的几个著名聚类数据集上进行的大量实验证明了所提出的MKDCI算法的有效性。与现有的密度聚类算法和无参数聚类算法相比，所提出的MKDCI算法在生物信息学的不完整数据集上倾向于自动产生质量更好的聚类。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b458/6249732/c42ce66fb28d/12918_2018_630_Fig1_HTML.jpg

相似文献

A multiple kernel density clustering algorithm for incomplete datasets in bioinformatics.

BMC Syst Biol. 2018 Nov 22;12(Suppl 6):111. doi: 10.1186/s12918-018-0630-6.

Multiple Kernel k-Means with Incomplete Kernels.

IEEE Trans Pattern Anal Mach Intell. 2020 May;42(5):1191-1204. doi: 10.1109/TPAMI.2019.2892416. Epub 2019 Jan 14.

Evolutionary Multiobjective Clustering and Its Applications to Patient Stratification.

IEEE Trans Cybern. 2019 May;49(5):1680-1693. doi: 10.1109/TCYB.2018.2817480. Epub 2018 Apr 2.

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets.

Bioinformatics. 2009 May 1;25(9):1152-7. doi: 10.1093/bioinformatics/btp123. Epub 2009 Mar 4.

Boosting k-means clustering with symbiotic organisms search for automatic clustering problems.

PLoS One. 2022 Aug 11;17(8):e0272861. doi: 10.1371/journal.pone.0272861. eCollection 2022.

A generalized fuzzy clustering framework for incomplete data by integrating feature weighted and kernel learning.

PeerJ Comput Sci. 2023 Oct 5;9:e1600. doi: 10.7717/peerj-cs.1600. eCollection 2023.

Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.

PLoS Comput Biol. 2017 Oct 16;13(10):e1005781. doi: 10.1371/journal.pcbi.1005781. eCollection 2017 Oct.

Towards clustering of incomplete microarray data without the use of imputation.

Bioinformatics. 2007 Jan 1;23(1):107-13. doi: 10.1093/bioinformatics/btl555. Epub 2006 Oct 31.

A cluster validity measure with outlier detection for support vector clustering.

IEEE Trans Syst Man Cybern B Cybern. 2008 Feb;38(1):78-89. doi: 10.1109/TSMCB.2007.908862.

A trace ratio maximization approach to multiple kernel-based dimensionality reduction.

Neural Netw. 2014 Jan;49:96-106. doi: 10.1016/j.neunet.2013.09.004. Epub 2013 Oct 9.

引用本文的文献

Cluster validity indices for automatic clustering: A comprehensive review.

Heliyon. 2025 Jan 15;11(2):e41953. doi: 10.1016/j.heliyon.2025.e41953. eCollection 2025 Jan 30.

A Drug Repurposing Pipeline Based on Bladder Cancer Integrated Proteotranscriptomics Signatures.

Methods Mol Biol. 2023;2684:59-99. doi: 10.1007/978-1-0716-3291-8_4.

Could Artificial Intelligence Prevent Intraoperative Anaphylaxis? Reference Review and Proof of Concept.

Medicina (Kaunas). 2022 Oct 26;58(11):1530. doi: 10.3390/medicina58111530.

Performance Evaluation of Hospital Economic Management with the Clustering Algorithm Oriented towards Electronic Health Management.

J Healthc Eng. 2022 Apr 6;2022:3603353. doi: 10.1155/2022/3603353. eCollection 2022.

本文引用的文献

Unsupervised multiple kernel learning for heterogeneous data integration.

Bioinformatics. 2018 Mar 15;34(6):1009-1015. doi: 10.1093/bioinformatics/btx682.

A Truncated Nuclear Norm Regularization Method Based on Weighted Residual Error for Matrix Completion.

IEEE Trans Image Process. 2016 Jan;25(1):316-30. doi: 10.1109/TIP.2015.2503238. Epub 2015 Nov 23.

Machine learning. Clustering by fast search and find of density peaks.

Science. 2014 Jun 27;344(6191):1492-6. doi: 10.1126/science.1242072.

PFClust: a novel parameter free clustering algorithm.

BMC Bioinformatics. 2013 Jul 3;14:213. doi: 10.1186/1471-2105-14-213.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于生物信息学中不完整数据集的多核密度聚类算法。

A multiple kernel density clustering algorithm for incomplete datasets in bioinformatics.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献