• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

类别序列中确定聚类数量的聚类验证方法。

Cluster Validation Method for Determining the Number of Clusters in Categorical Sequences.

出版信息

IEEE Trans Neural Netw Learn Syst. 2017 Dec;28(12):2936-2948. doi: 10.1109/TNNLS.2016.2608354. Epub 2016 Sep 27.

DOI:10.1109/TNNLS.2016.2608354
PMID:28114078
Abstract

Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Different from previous studies, which mainly focused on attribute-value data, in this paper, we work on the cluster validation problem for categorical sequences. The evaluation of sequences clustering is currently difficult due to the lack of an internal validation criterion defined with regard to the structural features hidden in sequences. To solve this problem, in this paper, a novel cluster validity index (CVI) is proposed as a function of clustering, with the intracluster structural compactness and intercluster structural separation linearly combined to measure the quality of sequence clusters. A partition-based algorithm for robust clustering of categorical sequences is also proposed, which provides the new measure with high-quality clustering results by the deterministic initialization and the elimination of noise clusters using an information theoretic method. The new clustering algorithm and the CVI are then assembled within the common model selection procedure to determine the number of clusters in categorical sequence sets. A case study on commonly used protein sequences and the experimental results on some real-world sequence sets from different domains are given to demonstrate the performance of the proposed method.

摘要

聚类验证是评估聚类结果质量的过程,在实际机器学习系统中起着重要作用。类别序列,如计算生物学中的生物序列,在实际应用中已经变得很常见。与之前主要关注属性值数据的研究不同,本文针对类别序列的聚类验证问题展开研究。由于缺乏针对序列中隐藏结构特征的内部验证标准,因此目前对序列聚类的评估较为困难。为了解决这个问题,本文提出了一种新的聚类有效性指数(CVI),它是聚类的函数,将簇内结构紧凑性和簇间结构分离性线性组合起来,以衡量序列簇的质量。还提出了一种基于划分的类别序列稳健聚类算法,该算法通过确定性初始化和使用信息论方法消除噪声簇,为新的度量方法提供了高质量的聚类结果。新的聚类算法和 CVI 随后被组装在通用的模型选择过程中,以确定类别序列集中的聚类数量。通过常用蛋白质序列的案例研究和来自不同领域的一些真实序列集的实验结果,验证了所提出方法的性能。

相似文献

1
Cluster Validation Method for Determining the Number of Clusters in Categorical Sequences.类别序列中确定聚类数量的聚类验证方法。
IEEE Trans Neural Netw Learn Syst. 2017 Dec;28(12):2936-2948. doi: 10.1109/TNNLS.2016.2608354. Epub 2016 Sep 27.
2
Subspace Clustering of Categorical and Numerical Data With an Unknown Number of Clusters.具有未知聚类数的分类数据和数值数据的子空间聚类
IEEE Trans Neural Netw Learn Syst. 2018 Aug;29(8):3308-3325. doi: 10.1109/TNNLS.2017.2728138. Epub 2017 Aug 3.
3
Extensions of kmeans-type algorithms: a new clustering framework by integrating intracluster compactness and intercluster separation.k 均值型算法的扩展:一种通过整合簇内紧凑性和簇间分离来实现聚类的新框架。
IEEE Trans Neural Netw Learn Syst. 2014 Aug;25(8):1433-46. doi: 10.1109/TNNLS.2013.2293795.
4
Rough set based information theoretic approach for clustering uncertain categorical data.基于粗糙集的信息论聚类不确定分类数据方法。
PLoS One. 2022 May 13;17(5):e0265190. doi: 10.1371/journal.pone.0265190. eCollection 2022.
5
Canonical PSO Based K-Means Clustering Approach for Real Datasets.基于规范粒子群优化算法的K均值聚类方法用于实际数据集
Int Sch Res Notices. 2014 Nov 12;2014:414013. doi: 10.1155/2014/414013. eCollection 2014.
6
Metric for measuring the effectiveness of clustering of DNA microarray expression.用于测量 DNA 微阵列表达聚类有效性的度量。
BMC Bioinformatics. 2006 Sep 6;7 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2105-7-S2-S5.
7
Entropy-based cluster validation and estimation of the number of clusters in gene expression data.基于熵的基因表达数据聚类验证及聚类数量估计
J Bioinform Comput Biol. 2012 Oct;10(5):1250011. doi: 10.1142/S0219720012500114. Epub 2012 Jun 26.
8
VIASCKDE Index: A Novel Internal Cluster Validity Index for Arbitrary-Shaped Clusters Based on the Kernel Density Estimation.VIASCKDE指标:一种基于核密度估计的用于任意形状聚类的新型内部聚类有效性指标。
Comput Intell Neurosci. 2022 Jun 8;2022:4059302. doi: 10.1155/2022/4059302. eCollection 2022.
9
An Algorithm for Clustering Categorical Data With Set-Valued Features.一种用于对具有集值特征的分类数据进行聚类的算法。
IEEE Trans Neural Netw Learn Syst. 2018 Oct;29(10):4593-4606. doi: 10.1109/TNNLS.2017.2770167. Epub 2017 Nov 29.
10
MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores.MeShClust v3.0:使用均值漂移算法和无比对身份分数对 DNA 序列进行高质量聚类。
BMC Genomics. 2022 Jun 6;23(1):423. doi: 10.1186/s12864-022-08619-0.

引用本文的文献

1
Deep-embedded clustering by relevant scales and genome-wide association study in autism.基于相关尺度的深度嵌入式聚类与自闭症全基因组关联研究
PLoS One. 2025 May 29;20(5):e0322698. doi: 10.1371/journal.pone.0322698. eCollection 2025.
2
Predictive maintenance in Industry 4.0: a survey of planning models and machine learning techniques.工业4.0中的预测性维护:规划模型与机器学习技术综述
PeerJ Comput Sci. 2024 May 14;10:e2016. doi: 10.7717/peerj-cs.2016. eCollection 2024.
3
Unsupervised optimal model bank for multiple model control systems: Genetic-based automatic clustering approach.
多模型控制系统的无监督最优模型库:基于遗传算法的自动聚类方法。
Heliyon. 2024 Feb 11;10(4):e25986. doi: 10.1016/j.heliyon.2024.e25986. eCollection 2024 Feb 29.
4
Clustering by phenotype and genome-wide association study in autism.孤独症的表型和全基因组关联研究聚类分析。
Transl Psychiatry. 2020 Aug 17;10(1):290. doi: 10.1038/s41398-020-00951-x.
5
Automatic Annotation of Unlabeled Data from Smartphone-Based Motion and Location Sensors.基于智能手机运动和位置传感器的未标记数据的自动标注。
Sensors (Basel). 2018 Jul 3;18(7):2134. doi: 10.3390/s18072134.