Text Mining and Retrieval Group, Leipzig University, Leipzig, DE-04109, Germany.
ScaDS.AI, Center for Scalable Data Analytics and Artificial Intelligence, Leipzig, DE-04105, Germany.
Sci Data. 2023 Jan 26;10(1):58. doi: 10.1038/s41597-022-01908-z.
We present the Webis-STEREO-21 dataset, a massive collection of Scientific Text Reuse in Open-access publications. It contains 91 million cases of reused text passages found in 4.2 million unique open-access publications. Cases range from overlap of as few as eight words to near-duplicate publications and include a variety of reuse types, ranging from boilerplate text to verbatim copying to quotations and paraphrases. Featuring a high coverage of scientific disciplines and varieties of reuse, as well as comprehensive metadata to contextualize each case, our dataset addresses the most salient shortcomings of previous ones on scientific writing. The Webis-STEREO-21 does not indicate if a reuse case is legitimate or not, as its focus is on the general study of text reuse in science, which is legitimate in the vast majority of cases. It allows for tackling a wide range of research questions from different scientific backgrounds, facilitating both qualitative and quantitative analysis of the phenomenon as well as a first-time grounding on the base rate of text reuse in scientific publications.
我们呈现了 Webis-STEREO-21 数据集,这是一个大规模的科学文本在开放获取出版物中重复使用的集合。它包含了在 420 万篇独特的开放获取出版物中发现的 9100 万个重复文本段落的案例。这些案例的重复文本从只有 8 个字的重叠到几乎完全重复的出版物都有,并且包括各种重复类型,从模板文本到逐字复制、引语和释义。该数据集具有高涵盖的科学学科和各种重复类型,以及全面的元数据来为每个案例提供背景信息,解决了之前在科学写作方面的最显著的缺点。Webis-STEREO-21 并没有指出重复案例是否合法,因为它的重点是科学文本重复的一般研究,这种重复在绝大多数情况下都是合法的。它允许从不同的科学背景提出广泛的研究问题,促进对这一现象的定性和定量分析,以及首次确定科学出版物中文本重复的基本比率。