Syed Ayesha Ayub, Gaol Ford Lumban, Boediman Alfred, Matsuo Tokuro, Budiharto Widodo
Department of Doctor of Computer Science - BINUS Graduate Program, Bina Nusantara University, Jakarta, Indonesia.
Department of Econometrics and Statistics - The University of Chicago, Booth School of Business, USA.
Data Brief. 2023 Sep 1;50:109535. doi: 10.1016/j.dib.2023.109535. eCollection 2023 Oct.
Customer reviews are valuable resources containing customer opinions and sentiments toward the product. The reviews are informative but can be quite lengthy or may contain repetitive information calling for opinion summarization systems that retain only the significant opinion information from the review. Abstractive summarization is a form of text summarization that generates a summary mimicking a human-written summary [1]. When pretrained language models are finetuned for abstractive review summarization, there usually occurs a problem known as the 'domain shift', because the source and target domains exhibit data from varying distributions [2]. This issue results in performance degradation of the model at the target end. This paper contributes a data package comprising of an annotated abstractive summarization dataset (annotated_abs_summ) of airline reviews having 500 reviews and abstractive summary pairs, a dataset (review_titles_data) consisting of 7079 reviews and review title pairs for review title generatioon or domain adaptive training [3] to address the domain shift problem for abstractive opinion summarization and, an annotated reviews dataset (annotated_sentiment) for rating-based sentiment classification. All datasets have been collected from the Skytrax Review Portal via web scraping using Python programming language. The datasets have several potential use cases. The abstractive summarization dataset can serve as a benchmark dataset for airline review summarization. The dataset for domain adaptive training can be used as a standalone dataset for review title generation. The dataset for sentiment analysis is multipurpose having columns like user rating and recommendation value, that can be used for statistical analysis like finding correlation between these data items as well as for other Natural Language Processing (NLP) tasks like predicting rating or recommendation value from the customer reviews. The datasets can be extended using various data augmentation techniques [4,5]. Moreover, the datasets are related and can be collectively used to develop a multi-task learning model [6] for better learning efficiency and improved performance.
客户评论是包含客户对产品的意见和情感的宝贵资源。这些评论信息丰富,但可能篇幅很长,或者可能包含重复信息,因此需要意见汇总系统,该系统仅保留评论中重要的意见信息。摘要式汇总是文本汇总的一种形式,它生成的摘要类似于人工撰写的摘要[1]。当预训练语言模型针对摘要式评论汇总进行微调时,通常会出现一个称为“领域转移”的问题,因为源域和目标域呈现出来自不同分布的数据[2]。这个问题导致模型在目标端的性能下降。本文贡献了一个数据包,其中包括一个有500条航空公司评论及摘要对的带注释的摘要式汇总数据集(annotated_abs_summ)、一个由7079条评论及评论标题对组成的用于评论标题生成或领域自适应训练的数据集(review_titles_data)[3],以解决摘要式意见汇总中的领域转移问题,以及一个用于基于评分的情感分类的带注释的评论数据集(annotated_sentiment)。所有数据集都是使用Python编程语言通过网络爬虫从Skytrax评论门户收集的。这些数据集有几个潜在的用例。摘要式汇总数据集可以用作航空公司评论汇总的基准数据集。用于领域自适应训练的数据集可以用作评论标题生成的独立数据集。情感分析数据集具有多种用途,它有用户评分和推荐值等列,可用于统计分析,如找出这些数据项之间的相关性,也可用于其他自然语言处理(NLP)任务,如根据客户评论预测评分或推荐值。可以使用各种数据增强技术[4,5]来扩展这些数据集。此外,这些数据集是相关的,可以共同用于开发多任务学习模型[6],以提高学习效率和性能。