He Kedan, Massena Djenerly G
Department of Physical Sciences, School of Arts and Sciences, Eastern Connecticut State University, Willimantic, CT, 06226, USA.
J Comput Aided Mol Des. 2023 Jan;37(1):17-37. doi: 10.1007/s10822-022-00488-9. Epub 2022 Nov 21.
One solution to the challenge of choosing an appropriate clustering algorithm is to combine different clusterings into a single consensus clustering result, known as cluster ensemble (CE). This ensemble learning strategy can provide more robust and stable solutions across different domains and datasets. Unfortunately, not all clusterings in the ensemble contribute to the final data partition. Cluster ensemble selection (CES) aims at selecting a subset from a large library of clustering solutions to form a smaller cluster ensemble that performs as well as or better than the set of all available clustering solutions. In this paper, we investigate four CES methods for the categorization of structurally distinct organic compounds using high-dimensional IR and Raman spectroscopy data. Single quality selection (SQI) forms a subset of the ensemble by selecting the highest quality ensemble members. The Single Quality Selection (SQI) method is used with various quality indices to select subsets by including the highest quality ensemble members. The Bagging method, usually applied in supervised learning, ranks ensemble members by calculating the normalized mutual information (NMI) between ensemble members and consensus solutions generated from a randomly sampled subset of the full ensemble. The hierarchical cluster and select method (HCAS-SQI) uses the diversity matrix of ensemble members to select a diverse set of ensemble members with the highest quality. Furthermore, a combining strategy can be used to combine subsets selected using multiple quality indices (HCAS-MQI) for the refinement of clustering solutions in the ensemble. The IR + Raman hybrid ensemble library is created by merging two complementary "views" of the organic compounds. This inherently more diverse library gives the best full ensemble consensus results. Overall, the Bagging method is recommended because it provides the most robust results that are better than or comparable to the full ensemble consensus solutions.
应对选择合适聚类算法这一挑战的一种方法是将不同的聚类结果合并为一个单一的共识聚类结果,即聚类集成(CE)。这种集成学习策略可以在不同领域和数据集上提供更稳健、更稳定的解决方案。不幸的是,集成中的并非所有聚类都对最终的数据划分有贡献。聚类集成选择(CES)旨在从大量聚类解决方案库中选择一个子集,以形成一个较小的聚类集成,其性能与所有可用聚类解决方案集相同或更好。在本文中,我们研究了四种CES方法,用于使用高维红外和拉曼光谱数据对结构不同的有机化合物进行分类。单质量选择(SQI)通过选择质量最高的集成成员来形成集成的一个子集。单质量选择(SQI)方法与各种质量指标一起使用,通过纳入质量最高的集成成员来选择子集。Bagging方法通常应用于监督学习,通过计算集成成员与从整个集成的随机采样子集中生成的共识解决方案之间的归一化互信息(NMI)来对集成成员进行排名。层次聚类和选择方法(HCAS-SQI)使用集成成员的多样性矩阵来选择一组质量最高的不同集成成员。此外,可以使用一种组合策略来组合使用多个质量指标(HCAS-MQI)选择的子集,以优化集成中的聚类解决方案。红外+拉曼混合集成库是通过合并有机化合物的两个互补“视图”创建的。这个本质上更多样化的库给出了最佳的全集成共识结果。总体而言,推荐Bagging方法,因为它提供了最稳健的结果,优于或与全集成共识解决方案相当。