Suppr超能文献

一种基于质心的新型句子分类方法,用于对新冠疫情新闻报道进行摘要提取。

A novel centroid based sentence classification approach for extractive summarization of COVID-19 news reports.

作者信息

Banerjee Sumanta, Mukherjee Shyamapada, Bandyopadhyay Sivaji

机构信息

Computer Science and Engineering, National Institute of Technology Silchar, Silchar, Assam 788010 India.

出版信息

Int J Inf Technol. 2023;15(4):1789-1801. doi: 10.1007/s41870-023-01221-x. Epub 2023 Mar 24.

Abstract

A COVID-19 news covers subtopics like infections, deaths, the economy, jobs, and more. The proposed method generates a news summary based on the subtopics of a reader's interest. It extracts a centroid having the lexical pattern of the sentences on those subtopics by the frequently used words in them. The centroid is then used as a query in the vector space model (VSM) for sentence classification and extraction, producing a query focused summarization (QFS) of the documents. Three approaches, TF-IDF, word vector averaging, and auto-encoder are experimented to generate sentence embedding that are used in VSM. These embeddings are ranked depending on their similarities with the query embedding. A Novel approach has been introduced to find the value for the similarity parameter using a supervised technique to classify the sentences. Finally, the performance of the method has been assessed in two different ways. All the sentences of the dataset are considered together in the first assessment and in the second, each document wise group of sentences is considered separately using fivefold cross-validation. The proposed method has achieved a minimum of 0.60 to a maximum of 0.63 mean F1 scores with the three sentence encoding approaches on the test dataset.

摘要

一篇关于新冠疫情的新闻涵盖了感染、死亡、经济、就业等多个子主题。所提出的方法基于读者感兴趣的子主题生成新闻摘要。它通过子主题句子中常用的词提取具有这些句子词汇模式的质心。然后,将该质心用作向量空间模型(VSM)中的查询,用于句子分类和提取,从而生成文档的查询聚焦摘要(QFS)。实验了三种方法,即词频 - 逆文档频率(TF-IDF)、词向量平均和自动编码器,以生成用于VSM的句子嵌入。这些嵌入根据它们与查询嵌入的相似度进行排序。引入了一种新颖的方法,使用监督技术对句子进行分类来找到相似度参数的值。最后,以两种不同的方式评估了该方法的性能。在第一次评估中,将数据集中的所有句子放在一起考虑,在第二次评估中,使用五折交叉验证分别考虑每个文档的句子组。所提出的方法在测试数据集上使用三种句子编码方法时,平均F1分数最低为0.60,最高为0.63。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/77c4/10036244/e682256b16a6/41870_2023_1221_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验