Chen Zi Li
Institute of General Aviation Industry, Fujian Chuanzheng Communications College, Fuzhou 350007, China.
Comput Intell Neurosci. 2022 Jun 8;2022:7042778. doi: 10.1155/2022/7042778. eCollection 2022.
In the era of big data, text as an information reserve database is very important, in all walks of life. From humanities research to government decision-making, from precision medicine to quantitative finance, from customer management to marketing, massive text, as one of the most important information carriers, plays an important role everywhere. The text data generated in these practical problems of humanities research, financial industry, marketing, and other fields often has obvious domain characteristics, often containing the professional vocabulary and unique language patterns in these fields and often accompanied by a variety of "noise." Dealing with such texts is a great challenge for the current technical conditions, especially for Chinese texts. A clustering algorithm provides a better solution for text big data information processing. Clustering algorithm is the main body of cluster analysis, K-means algorithm with its implementation principle is simple, low time complexity is widely used in the field of cluster analysis, but its value needs to be preset, initial clustering center random selection into local optimal solution, other clustering algorithm, such as mean drift clustering, K-means clustering in mining text big data. In view of the problems of the above algorithm, this paper first extracts and analyzes the text big data and then does experiments with the clustering algorithm. Experimental conclusion: by analyzing large-scale text data limited to large-scale and simple data set, the traditional K-means algorithm has low efficiency and reduced accuracy, and the K-means algorithm is susceptible to the influence of initial center and abnormal data. According to the above problems, the K-means cluster analysis algorithm for data sets with large data volumes is analyzed and improved to improve its execution efficiency and accuracy on data sets with large data volume set. Mean shift clustering can be regarded as making many random centers move towards the direction of maximum density gradually, that is, moving their mean centroid continuously according to the probability density of data and finally obtaining multiple maximum density centers. It can also be said that mean shift clustering is a kernel density estimation algorithm.
在大数据时代,文本作为一种信息储备数据库,在各行各业都非常重要。从人文研究到政府决策,从精准医疗到量化金融,从客户管理到市场营销,海量文本作为最重要的信息载体之一,在各个领域都发挥着重要作用。在人文研究、金融行业、市场营销等领域的这些实际问题中产生的文本数据往往具有明显的领域特征,常常包含这些领域的专业词汇和独特的语言模式,并且常常伴随着各种“噪声”。处理这样的文本对于当前的技术条件来说是一个巨大的挑战,尤其是对于中文文本。聚类算法为文本大数据信息处理提供了一个更好的解决方案。聚类算法是聚类分析的主体,K均值算法因其实现原理简单、时间复杂度低而在聚类分析领域被广泛应用,但其聚类数需要预先设定,初始聚类中心随机选取易陷入局部最优解,其他聚类算法,如均值漂移聚类,在挖掘文本大数据中与K均值聚类各有优劣。针对上述算法存在的问题,本文首先对文本大数据进行提取和分析,然后采用聚类算法进行实验。实验结论:通过对局限于大规模简单数据集的大规模文本数据进行分析,传统的K均值算法效率低下且准确率降低,并且K均值算法容易受到初始中心和异常数据的影响。针对上述问题,对适用于大数据量数据集的K均值聚类分析算法进行分析和改进,以提高其在大数据量数据集上的执行效率和准确率。均值漂移聚类可以看作是让许多随机中心逐渐朝着最大密度的方向移动,即根据数据的概率密度不断移动它们的均值质心,最终得到多个最大密度中心。也可以说均值漂移聚类是一种核密度估计算法。