Yuan Soe-Tsyr, Sun Jerry
MIS Department, National Chengchi University, Taipei, Taiwan.
IEEE Trans Syst Man Cybern B Cybern. 2005 Oct;35(5):1028-40. doi: 10.1109/tsmcb.2005.850153.
Development of algorithms for automated text categorization in massive text document sets is an important research area of data mining and knowledge discovery. Most of the text-clustering methods were grounded in the term-based measurement of distance or similarity, ignoring the structure of the documents. In this paper, we present a novel method named structured cosine similarity (SCS) that furnishes document clustering with a new way of modeling on document summarization, considering the structure of the documents so as to improve the performance of document clustering in terms of quality, stability, and efficiency. This study was motivated by the problem of clustering speech documents (of no rich document features) attained from the wireless experience oral sharing conducted by mobile workforce of enterprises, fulfilling audio-based knowledge management. In other words, this problem aims to facilitate knowledge acquisition and sharing by speech. The evaluations also show fairly promising results on our method of structured cosine similarity.
海量文本文档集中自动文本分类算法的开发是数据挖掘和知识发现的一个重要研究领域。大多数文本聚类方法基于基于术语的距离或相似度度量,忽略了文档的结构。在本文中,我们提出了一种名为结构化余弦相似度(SCS)的新方法,该方法为文档聚类提供了一种基于文档摘要建模的新方式,考虑了文档的结构,从而在质量、稳定性和效率方面提高文档聚类的性能。本研究的动机源于对企业移动员工进行无线经验口头分享所获得的语音文档(文档特征不丰富)进行聚类的问题,以实现基于音频的知识管理。换句话说,这个问题旨在通过语音促进知识获取和共享。评估结果也表明我们的结构化余弦相似度方法取得了相当可观的成果。