Suppr超能文献

基于主题的中文短文本自动摘要算法。

Topic-based automatic summarization algorithm for Chinese short text.

机构信息

Nanjing University of Information Science and Technology, Nanjing 210044, China.

Nanjing Institute of Technology, Nanjing 211167, China.

出版信息

Math Biosci Eng. 2020 May 12;17(4):3582-3600. doi: 10.3934/mbe.2020202.

Abstract

Most current automatic summarization methods are for English texts. The distinction between words in Chinese text is large, the types of parts of speech are many and complex, and polysemy or ambiguous words appear frequently. Therefore, compared with English text, Chinese text is more difficult to extract useful feature words. Due to the complex syntax of Chinese, there are currently relatively few automatic summarization methods for Chinese text. In the past, only the important sentences in the original text can be selected and simply arranged to obtain a summary with chaotic sentences and insufficient coherence. Meanwhile, because Chinese short text usually contains more redundant information and the sentence structure is not neat, we propose a topic-based automatic summary method for Chinese short text. Firstly, a key sentence selection method is proposed combining topic words and TF-IDF to obtain the score of each text corresponding to the topic in the original text data. Then the sentence with the highest score as the topic sentence of the topic is selected. Considering that the short text of Weibo may contain a lot of irrelevant information and sometimes even lack some important components of topic, three retouching mechanisms are proposed to improve the conciseness, richness and readability of topic sentence extraction results. We validate our approach on natural disaster and social hot event datasets from Sina Weibo. The experimental results show that the polished topic summary not only reflects the exact relationship between topic sentences and natural disasters or social hot events, but also has rich semantic information. More importantly, we can almost grasp the basic elements of natural disaster or social hot event from the topic sentence, so as to help the government guide disaster relief or meet the needs of users for quickly obtaining information of social hot events.

摘要

大多数当前的自动摘要方法都是针对英文文本的。中文文本中词的区别较大,词性类型繁多且复杂,多义词或歧义词频繁出现。因此,与英文文本相比,中文文本更难提取有用的特征词。由于中文语法复杂,目前针对中文文本的自动摘要方法相对较少。过去,只能选择原文中的重要句子并简单排列,以获得句子混乱、连贯性不足的摘要。同时,由于中文短文本通常包含更多冗余信息且句子结构不整洁,我们提出了一种基于主题的中文短文本自动摘要方法。首先,提出了一种结合主题词和 TF-IDF 的关键句选择方法,以获得原始文本数据中每个文本与主题对应的分数。然后选择主题分数最高的句子作为主题的主题句。考虑到微博短文本可能包含大量不相关的信息,有时甚至缺乏主题的一些重要组成部分,提出了三种润色机制来提高主题句提取结果的简洁性、丰富度和可读性。我们在来自新浪微博的自然灾害和社会热点事件数据集上验证了我们的方法。实验结果表明,经过润色的主题总结不仅反映了主题句与自然灾害或社会热点事件之间的准确关系,而且具有丰富的语义信息。更重要的是,我们几乎可以从主题句中掌握自然灾害或社会热点事件的基本要素,从而帮助政府指导救灾或满足用户快速获取社会热点事件信息的需求。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验