文本大数据聚类算法的研究与应用

Research and Application of Clustering Algorithm for Text Big Data.

作者信息

Chen Zi Li

机构信息

Institute of General Aviation Industry, Fujian Chuanzheng Communications College, Fuzhou 350007, China.

出版信息

Comput Intell Neurosci. 2022 Jun 8;2022:7042778. doi: 10.1155/2022/7042778. eCollection 2022.

DOI:10.1155/2022/7042778

PMID:35720917

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9200521/

Abstract

In the era of big data, text as an information reserve database is very important, in all walks of life. From humanities research to government decision-making, from precision medicine to quantitative finance, from customer management to marketing, massive text, as one of the most important information carriers, plays an important role everywhere. The text data generated in these practical problems of humanities research, financial industry, marketing, and other fields often has obvious domain characteristics, often containing the professional vocabulary and unique language patterns in these fields and often accompanied by a variety of "noise." Dealing with such texts is a great challenge for the current technical conditions, especially for Chinese texts. A clustering algorithm provides a better solution for text big data information processing. Clustering algorithm is the main body of cluster analysis, K-means algorithm with its implementation principle is simple, low time complexity is widely used in the field of cluster analysis, but its value needs to be preset, initial clustering center random selection into local optimal solution, other clustering algorithm, such as mean drift clustering, K-means clustering in mining text big data. In view of the problems of the above algorithm, this paper first extracts and analyzes the text big data and then does experiments with the clustering algorithm. Experimental conclusion: by analyzing large-scale text data limited to large-scale and simple data set, the traditional K-means algorithm has low efficiency and reduced accuracy, and the K-means algorithm is susceptible to the influence of initial center and abnormal data. According to the above problems, the K-means cluster analysis algorithm for data sets with large data volumes is analyzed and improved to improve its execution efficiency and accuracy on data sets with large data volume set. Mean shift clustering can be regarded as making many random centers move towards the direction of maximum density gradually, that is, moving their mean centroid continuously according to the probability density of data and finally obtaining multiple maximum density centers. It can also be said that mean shift clustering is a kernel density estimation algorithm.

摘要

在大数据时代，文本作为一种信息储备数据库，在各行各业都非常重要。从人文研究到政府决策，从精准医疗到量化金融，从客户管理到市场营销，海量文本作为最重要的信息载体之一，在各个领域都发挥着重要作用。在人文研究、金融行业、市场营销等领域的这些实际问题中产生的文本数据往往具有明显的领域特征，常常包含这些领域的专业词汇和独特的语言模式，并且常常伴随着各种“噪声”。处理这样的文本对于当前的技术条件来说是一个巨大的挑战，尤其是对于中文文本。聚类算法为文本大数据信息处理提供了一个更好的解决方案。聚类算法是聚类分析的主体，K均值算法因其实现原理简单、时间复杂度低而在聚类分析领域被广泛应用，但其聚类数需要预先设定，初始聚类中心随机选取易陷入局部最优解，其他聚类算法，如均值漂移聚类，在挖掘文本大数据中与K均值聚类各有优劣。针对上述算法存在的问题，本文首先对文本大数据进行提取和分析，然后采用聚类算法进行实验。实验结论：通过对局限于大规模简单数据集的大规模文本数据进行分析，传统的K均值算法效率低下且准确率降低，并且K均值算法容易受到初始中心和异常数据的影响。针对上述问题，对适用于大数据量数据集的K均值聚类分析算法进行分析和改进，以提高其在大数据量数据集上的执行效率和准确率。均值漂移聚类可以看作是让许多随机中心逐渐朝着最大密度的方向移动，即根据数据的概率密度不断移动它们的均值质心，最终得到多个最大密度中心。也可以说均值漂移聚类是一种核密度估计算法。

相似文献

Research and Application of Clustering Algorithm for Text Big Data.

Comput Intell Neurosci. 2022 Jun 8;2022:7042778. doi: 10.1155/2022/7042778. eCollection 2022.

Research on Literature Clustering Algorithm for Massive Scientific and Technical Literature Query Service.

Comput Intell Neurosci. 2022 Aug 21;2022:3392489. doi: 10.1155/2022/3392489. eCollection 2022.

Optimization of Data Mining and Analysis System for Chinese Language Teaching Based on Convolutional Neural Network.

Comput Intell Neurosci. 2021 Dec 3;2021:1148954. doi: 10.1155/2021/1148954. eCollection 2021.

A Fast Projection-Based Algorithm for Clustering Big Data.

Interdiscip Sci. 2019 Sep;11(3):360-366. doi: 10.1007/s12539-018-0294-3. Epub 2018 Jun 7.

Does Determination of Initial Cluster Centroids Improve the Performance of -Means Clustering Algorithm? Comparison of Three Hybrid Methods by Genetic Algorithm, Minimum Spanning Tree, and Hierarchical Clustering in an Applied Study.

Comput Math Methods Med. 2020 Aug 1;2020:7636857. doi: 10.1155/2020/7636857. eCollection 2020.

Research on Data Analysis of Traditional Chinese Medicine with Improved Differential Evolution Clustering Algorithm.

J Healthc Eng. 2021 Sep 4;2021:4468741. doi: 10.1155/2021/4468741. eCollection 2021.

RFID Data Analysis and Evaluation Based on Big Data and Data Clustering.

Comput Intell Neurosci. 2022 Mar 26;2022:3432688. doi: 10.1155/2022/3432688. eCollection 2022.

Moth-Flame Optimization-Bat Optimization: Map-Reduce Framework for Big Data Clustering Using the Moth-Flame Bat Optimization and Sparse Fuzzy C-Means.

Big Data. 2020 Jun;8(3):203-217. doi: 10.1089/big.2019.0125. Epub 2020 May 19.

A Novel Model on Reinforce K-Means Using Location Division Model and Outlier of Initial Value for Lowering Data Cost.

Entropy (Basel). 2020 Aug 17;22(8):902. doi: 10.3390/e22080902.

Emotional analysis of evaluation discourse in business English translation based on language big data mining of public health environment.

Front Public Health. 2022 Oct 20;10:981182. doi: 10.3389/fpubh.2022.981182. eCollection 2022.

引用本文的文献

Classifying and fact-checking health-related information about COVID-19 on Twitter/X using machine learning and deep learning models.

BMC Med Inform Decis Mak. 2025 Feb 11;25(1):73. doi: 10.1186/s12911-025-02895-y.

Analysis and prediction of research hotspots and trends in heart failure research.

J Transl Int Med. 2024 Jul 27;12(3):263-273. doi: 10.2478/jtim-2023-0117. eCollection 2024 Jun.

Analysis and prediction of research hotspots and trends in pediatric medicine from 2,580,642 studies published between 1940 and 2021.

World J Pediatr. 2023 Aug;19(8):793-797. doi: 10.1007/s12519-023-00731-9. Epub 2023 Jun 9.

Recurrence Risk Evaluation in Patients with Papillary Thyroid Carcinoma: Multicenter Machine Learning Evaluation of Lymph Node Variables.

Cancers (Basel). 2023 Jan 16;15(2):550. doi: 10.3390/cancers15020550.

本文引用的文献

Analysis of big data job requirements based on K-means text clustering in China.

PLoS One. 2021 Aug 5;16(8):e0255419. doi: 10.1371/journal.pone.0255419. eCollection 2021.

SPICi: a fast clustering algorithm for large biological networks.

Bioinformatics. 2010 Apr 15;26(8):1105-11. doi: 10.1093/bioinformatics/btq078. Epub 2010 Feb 24.

An adaptive spatial fuzzy clustering algorithm for 3-D MR image segmentation.

IEEE Trans Med Imaging. 2003 Sep;22(9):1063-75. doi: 10.1109/TMI.2003.816956.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

文本大数据聚类算法的研究与应用

Research and Application of Clustering Algorithm for Text Big Data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献