Suppr超能文献

用于大型生物和医学数据集聚类分析的主题建模

Topic modeling for cluster analysis of large biological and medical datasets.

作者信息

Zhao Weizhong, Zou Wen, Chen James J

出版信息

BMC Bioinformatics. 2014;15 Suppl 11(Suppl 11):S11. doi: 10.1186/1471-2105-15-S11-S11. Epub 2014 Oct 21.

Abstract

BACKGROUND

The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets.

RESULTS

In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths.

CONCLUSION

Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets.

摘要

背景

大数据这一称谓用来描述生物和医学数据集不断增长的庞大性和复杂性再合适不过了。需要新的方法来生成和检验假设、促进生物学解释以及构建经过验证的预测模型。尽管诸如聚类分析等多元技术可能使研究人员能够识别相关变量的组或聚类,但对于大型和超高维数据集,传统聚类方法的准确性和有效性会降低。主题建模是机器学习中的一个活跃研究领域,主要用作一种分析工具来构建大型文本语料库以进行数据挖掘。它将高维数据降维为少数潜在变量的能力使其适合作为一种聚类手段或克服大型生物和医学数据集中聚类困难的方法。

结果

在本研究中,提出了三种源自主题模型的聚类方法,即最高概率主题分配、特征选择和特征提取,并在三个大型数据集的聚类分析中进行了测试:沙门氏菌脉冲场凝胶电泳(PFGE)数据集、肺癌数据集和乳腺癌数据集,这些数据集代表了各种类型的大型生物或医学数据集。与传统方法相比,所有这三种不同的方法都显示出提高了三个数据集聚类结果的功效/有效性。基于重现已知的生物学事实,为这三个数据集中的每一个都出现了一种更优的聚类分析方法。

结论

主题建模可以有利地应用于生物或医学研究的大型数据集。所提出的三种源自主题模型的聚类方法,即最高概率主题分配、特征选择和特征提取,对三种不同的数据类型都产生了聚类改进。与传统方法相比,聚类更有效地代表了数据中真实的分组和子分组,这表明基于主题模型的方法可以在大型生物或医学数据集的分析中提供分析上的进步。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e85a/4251039/2b46a8a9cfc7/1471-2105-15-S11-S11-1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验