Banerjee Sumanta, Mukherjee Shyamapada, Bandyopadhyay Sivaji
Computer Science and Engineering, National Institute of Technology Silchar, Silchar, Assam 788010 India.
Int J Inf Technol. 2023;15(4):1789-1801. doi: 10.1007/s41870-023-01221-x. Epub 2023 Mar 24.
A COVID-19 news covers subtopics like infections, deaths, the economy, jobs, and more. The proposed method generates a news summary based on the subtopics of a reader's interest. It extracts a centroid having the lexical pattern of the sentences on those subtopics by the frequently used words in them. The centroid is then used as a query in the vector space model (VSM) for sentence classification and extraction, producing a query focused summarization (QFS) of the documents. Three approaches, TF-IDF, word vector averaging, and auto-encoder are experimented to generate sentence embedding that are used in VSM. These embeddings are ranked depending on their similarities with the query embedding. A Novel approach has been introduced to find the value for the similarity parameter using a supervised technique to classify the sentences. Finally, the performance of the method has been assessed in two different ways. All the sentences of the dataset are considered together in the first assessment and in the second, each document wise group of sentences is considered separately using fivefold cross-validation. The proposed method has achieved a minimum of 0.60 to a maximum of 0.63 mean F1 scores with the three sentence encoding approaches on the test dataset.
一篇关于新冠疫情的新闻涵盖了感染、死亡、经济、就业等多个子主题。所提出的方法基于读者感兴趣的子主题生成新闻摘要。它通过子主题句子中常用的词提取具有这些句子词汇模式的质心。然后,将该质心用作向量空间模型(VSM)中的查询,用于句子分类和提取,从而生成文档的查询聚焦摘要(QFS)。实验了三种方法,即词频 - 逆文档频率(TF-IDF)、词向量平均和自动编码器,以生成用于VSM的句子嵌入。这些嵌入根据它们与查询嵌入的相似度进行排序。引入了一种新颖的方法,使用监督技术对句子进行分类来找到相似度参数的值。最后,以两种不同的方式评估了该方法的性能。在第一次评估中,将数据集中的所有句子放在一起考虑,在第二次评估中,使用五折交叉验证分别考虑每个文档的句子组。所提出的方法在测试数据集上使用三种句子编码方法时,平均F1分数最低为0.60,最高为0.63。