Suppr超能文献

DiviK:用于生物大数据无监督聚类的可分离智能 K 均值算法。

DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data.

机构信息

Department of Data Science and Engineering, Silesian University of Technology, Akademicka 16, 44-100, Gliwice, Poland.

Netguru, Małe Garbary 9, 61-756, Poznań, Poland.

出版信息

BMC Bioinformatics. 2022 Dec 12;23(1):538. doi: 10.1186/s12859-022-05093-z.

Abstract

BACKGROUND

Investigating molecular heterogeneity provides insights into tumour origin and metabolomics. The increasing amount of data gathered makes manual analyses infeasible-therefore, automated unsupervised learning approaches are utilised for discovering tissue heterogeneity. However, automated analyses require experience setting the algorithms' hyperparameters and expert knowledge about the analysed biological processes. Moreover, feature engineering is needed to obtain valuable results because of the numerous features measured.

RESULTS

We propose DiviK: a scalable stepwise algorithm with local data-driven feature space adaptation for segmenting high-dimensional datasets. The algorithm is compared to the optional solutions (regular k-means, spatial and spectral approaches) combined with different feature engineering techniques (None, PCA, EXIMS, UMAP, Neural Ions). Three quality indices: Dice Index, Rand Index and EXIMS score, focusing on the overall composition of the clustering, coverage of the tumour region and spatial cluster consistency, are used to assess the quality of unsupervised analyses. Algorithms were validated on mass spectrometry imaging (MSI) datasets-2D human cancer tissue samples and 3D mouse kidney images. DiviK algorithm performed the best among the four clustering algorithms compared (overall quality score 1.24, 0.58 and 162 for d(0, 0, 0), d(1, 1, 1) and the sum of ranks, respectively), with spectral clustering being mostly second. Feature engineering techniques impact the overall clustering results less than the algorithms themselves (partial [Formula: see text] effect size: 0.141 versus 0.345, Kendall's concordance index: 0.424 versus 0.138 for d(0, 0, 0)).

CONCLUSIONS

DiviK could be the default choice in the exploration of MSI data. Thanks to its unique, GMM-based local optimisation of the feature space and deglomerative schema, DiviK results do not strongly depend on the feature engineering technique applied and can reveal the hidden structure in a tissue sample. Additionally, DiviK shows high scalability, and it can process at once the big omics data with more than 1.5 mln instances and a few thousand features. Finally, due to its simplicity, DiviK is easily generalisable to an even more flexible framework. Therefore, it is helpful for other -omics data (as single cell spatial transcriptomic) or tabular data in general (including medical images after appropriate embedding). A generic implementation is freely available under Apache 2.0 license at https://github.com/gmrukwa/divik .

摘要

背景

研究分子异质性可以深入了解肿瘤的起源和代谢组学。由于数据量的增加,手动分析变得不可行,因此,人们利用无监督的自动学习方法来发现组织异质性。然而,自动分析需要经验来设置算法的超参数,并需要对所分析的生物过程有专业知识。此外,由于测量的特征数量众多,因此需要进行特征工程才能获得有价值的结果。

结果

我们提出了 DiviK:一种可扩展的逐步算法,具有局部数据驱动的特征空间自适应功能,用于分割高维数据集。该算法与可选解决方案(常规 k-均值、空间和光谱方法)相结合,并结合了不同的特征工程技术(无、PCA、EXIMS、UMAP、Neural Ions)进行比较。三个质量指标:Dice 指数、Rand 指数和 EXIMS 评分,重点关注聚类的整体组成、肿瘤区域的覆盖范围和空间聚类的一致性,用于评估无监督分析的质量。该算法在质谱成像 (MSI) 数据集(二维人癌症组织样本和三维小鼠肾脏图像)上进行了验证。与比较的四种聚类算法相比,DiviK 算法的性能最好(整体质量评分分别为 1.24、0.58 和 162,对于 d(0, 0, 0)、d(1, 1, 1) 和秩和),而光谱聚类大多排在第二位。特征工程技术对整体聚类结果的影响小于算法本身(偏 [公式:见正文] 效应大小:0.141 与 0.345,Kendall 一致性指数:0.424 与 0.138,对于 d(0, 0, 0))。

结论

DiviK 可以成为探索 MSI 数据的默认选择。由于其独特的基于 GMM 的特征空间局部优化和去聚类方案,DiviK 的结果不太依赖于应用的特征工程技术,可以揭示组织样本中的隐藏结构。此外,DiviK 具有很高的可扩展性,它可以一次处理超过 150 万实例和几千个特征的大型组学数据。最后,由于其简单性,DiviK 很容易推广到更灵活的框架。因此,它有助于处理其他组学数据(如单细胞空间转录组学)或一般的表格数据(包括适当嵌入后的医学图像)。一个通用的实现可以在 Apache 2.0 许可证下免费获得,网址为 https://github.com/gmrukwa/divik。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3307/9743550/008bcb01094f/12859_2022_5093_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验