Terry Fox Laboratory, BC Cancer Research, Vancouver, British Columbia, Canada.
Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada.
Cytometry A. 2023 Nov;103(11):889-901. doi: 10.1002/cyto.a.24776. Epub 2023 Aug 29.
The analysis of large amounts of data is important for the development of machine learning (ML) models. flowSim is the first algorithm designed to visualize, detect and remove highly redundant information in flow cytometry (FCM) training sets to decrease the computational time for training and increase the performance of ML algorithms by reducing overfitting. flowSim performs near duplicate image detection by combining community detection algorithms with the density analysis of the marker expression values. flowSim clustering compared to consensus manual clustering on a dataset composed of 160 images of bivariate FCM data had a mean Adjusted Rand Index of 0.90, demonstrating its efficiency in identifying similar patterns. flowSim selectively discarded near duplicate files in datasets constructed with known redundancy, and removed 92.6% of FCM images in a dataset of over 500,000 drawn from public repositories.
大量数据分析对于机器学习 (ML) 模型的发展很重要。flowSim 是第一个旨在可视化、检测和去除流式细胞术 (FCM) 训练集中高度冗余信息的算法,以减少训练的计算时间,并通过减少过拟合来提高 ML 算法的性能。flowSim 通过将社区检测算法与标记表达值的密度分析相结合来执行近重复图像检测。flowSim 聚类与由 160 个双变量 FCM 数据图像组成的数据集上的共识手动聚类相比,调整后的 Rand 指数平均值为 0.90,表明其在识别相似模式方面的效率。flowSim 在构建已知冗余数据集时选择性地丢弃近重复文件,并在从公共存储库中提取的超过 500,000 张 FCM 图像的数据集上删除了 92.6%的图像。