• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大量科学文本在开放获取出版物中的重复使用数据集。

A large dataset of scientific text reuse in Open-Access publications.

机构信息

Text Mining and Retrieval Group, Leipzig University, Leipzig, DE-04109, Germany.

ScaDS.AI, Center for Scalable Data Analytics and Artificial Intelligence, Leipzig, DE-04105, Germany.

出版信息

Sci Data. 2023 Jan 26;10(1):58. doi: 10.1038/s41597-022-01908-z.

DOI:10.1038/s41597-022-01908-z
PMID:36702840
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9879940/
Abstract

We present the Webis-STEREO-21 dataset, a massive collection of Scientific Text Reuse in Open-access publications. It contains 91 million cases of reused text passages found in 4.2 million unique open-access publications. Cases range from overlap of as few as eight words to near-duplicate publications and include a variety of reuse types, ranging from boilerplate text to verbatim copying to quotations and paraphrases. Featuring a high coverage of scientific disciplines and varieties of reuse, as well as comprehensive metadata to contextualize each case, our dataset addresses the most salient shortcomings of previous ones on scientific writing. The Webis-STEREO-21 does not indicate if a reuse case is legitimate or not, as its focus is on the general study of text reuse in science, which is legitimate in the vast majority of cases. It allows for tackling a wide range of research questions from different scientific backgrounds, facilitating both qualitative and quantitative analysis of the phenomenon as well as a first-time grounding on the base rate of text reuse in scientific publications.

摘要

我们呈现了 Webis-STEREO-21 数据集,这是一个大规模的科学文本在开放获取出版物中重复使用的集合。它包含了在 420 万篇独特的开放获取出版物中发现的 9100 万个重复文本段落的案例。这些案例的重复文本从只有 8 个字的重叠到几乎完全重复的出版物都有,并且包括各种重复类型,从模板文本到逐字复制、引语和释义。该数据集具有高涵盖的科学学科和各种重复类型,以及全面的元数据来为每个案例提供背景信息,解决了之前在科学写作方面的最显著的缺点。Webis-STEREO-21 并没有指出重复案例是否合法,因为它的重点是科学文本重复的一般研究,这种重复在绝大多数情况下都是合法的。它允许从不同的科学背景提出广泛的研究问题,促进对这一现象的定性和定量分析,以及首次确定科学出版物中文本重复的基本比率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7865/9879940/d412964efef3/41597_2022_1908_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7865/9879940/39fd63f496c3/41597_2022_1908_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7865/9879940/bd2ece10ae4b/41597_2022_1908_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7865/9879940/d412964efef3/41597_2022_1908_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7865/9879940/39fd63f496c3/41597_2022_1908_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7865/9879940/bd2ece10ae4b/41597_2022_1908_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7865/9879940/d412964efef3/41597_2022_1908_Fig3_HTML.jpg

相似文献

1
A large dataset of scientific text reuse in Open-Access publications.大量科学文本在开放获取出版物中的重复使用数据集。
Sci Data. 2023 Jan 26;10(1):58. doi: 10.1038/s41597-022-01908-z.
2
Data reuse and the open data citation advantage.数据重用与开放数据引文优势。
PeerJ. 2013 Oct 1;1:e175. doi: 10.7717/peerj.175. eCollection 2013.
3
Towards building a trustworthy pipeline integrating Neuroscience Gateway and Open Science Chain.迈向构建可信的神经科学网关与开放科学链集成管道。
Database (Oxford). 2024 Apr 3;2024. doi: 10.1093/database/baae023.
4
Text recycling in STEM: A text-analytic study of recently published research articles.STEM领域中的文本复用:对近期发表的研究文章的文本分析研究
Account Res. 2021 Aug;28(6):349-371. doi: 10.1080/08989621.2020.1850284. Epub 2020 Nov 24.
5
Patterns of text reuse in a scientific corpus.科学语料库中的文本复用模式。
Proc Natl Acad Sci U S A. 2015 Jan 6;112(1):25-30. doi: 10.1073/pnas.1415135111. Epub 2014 Dec 8.
6
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
7
The reuse of public datasets in the life sciences: potential risks and rewards.生命科学中公共数据集的再利用:潜在风险与回报
PeerJ. 2020 Sep 22;8:e9954. doi: 10.7717/peerj.9954. eCollection 2020.
8
Attitudes toward text recycling in academic writing across disciplines.跨学科视角下学术写作中对文本再利用的态度
Account Res. 2018;25(3):142-169. doi: 10.1080/08989621.2018.1434622. Epub 2018 Feb 24.
9
[Lysenkoism in Polish botany].[波兰植物学中的李森科主义]
Kwart Hist Nauki Tech. 2008;53(2):83-161.
10
Text Reuse at Scale. An interface for the exploration of text reuse data in semantically enriched historical newspapers.大规模文本复用。一个用于在语义丰富的历史报纸中探索文本复用数据的界面。
Front Big Data. 2023 Nov 3;6:1249469. doi: 10.3389/fdata.2023.1249469. eCollection 2023.

引用本文的文献

1
An open paradigm dataset for intelligent monitoring of underground drilling operations in coal mines.一个用于煤矿井下钻孔作业智能监测的开放范式数据集。
Sci Data. 2025 May 13;12(1):780. doi: 10.1038/s41597-025-05118-1.

本文引用的文献

1
Text recycling in STEM: A text-analytic study of recently published research articles.STEM领域中的文本复用:对近期发表的研究文章的文本分析研究
Account Res. 2021 Aug;28(6):349-371. doi: 10.1080/08989621.2020.1850284. Epub 2020 Nov 24.
2
Plagiarism detectors are a crutch, and a problem.
Nature. 2019 Mar;567(7749):435. doi: 10.1038/d41586-019-00893-5.
3
Attitudes toward text recycling in academic writing across disciplines.跨学科视角下学术写作中对文本再利用的态度
Account Res. 2018;25(3):142-169. doi: 10.1080/08989621.2018.1434622. Epub 2018 Feb 24.
4
Patterns of text reuse in a scientific corpus.科学语料库中的文本复用模式。
Proc Natl Acad Sci U S A. 2015 Jan 6;112(1):25-30. doi: 10.1073/pnas.1415135111. Epub 2014 Dec 8.
5
Self-plagiarism and dual and redundant publications: what is the problem? Commentary on 'Seven ways to plagiarize: handling real allegations of research misconduct'.
Sci Eng Ethics. 2002 Oct;8(4):543-4. doi: 10.1007/s11948-002-0007-4.