Suppr超能文献

利用来源分析揭示论文工厂的科学论文。

Unveiling scientific articles from paper mills with provenance analysis.

机构信息

Artificial Intelligence Lab. Recod.ai, Institute of Computing, Universidade Estadual de Campinas, Campinas, São Paulo, Brazil.

Department of Computer Science, Loyola University Chicago, Chicago, Illinois, United States of America.

出版信息

PLoS One. 2024 Oct 30;19(10):e0312666. doi: 10.1371/journal.pone.0312666. eCollection 2024.

Abstract

The increasing prevalence of fake publications created by paper mills poses a significant challenge to maintaining scientific integrity. While integrity analysts typically rely on textual and visual clues to identify fake articles, determining which papers merit further investigation can be akin to searching for a needle in a haystack, as these fake publications have non-related authors and are published on non-related venues. To address this challenge, we developed a new methodology for provenance analysis, which automatically tracks and groups suspicious figures and documents. Our approach groups manuscripts from the same paper mill by analyzing their figures and identifying duplicated and manipulated regions. These regions are linked and organized in a provenance graph, providing evidence of systematic production. We tested our solution on a paper mill dataset of hundreds of documents and also on a larger version of the dataset that deliberately included thousands of documents intentionally selected to distract our method. Our approach successfully identified and linked systematically produced articles on both datasets by pinpointing the figures they reused and manipulated from one another. The technique herein proposed offers a promising solution to identify fraudulent manuscripts, and it could be a valuable tool for supporting scientific integrity.

摘要

日益增多的由论文工厂制造的虚假出版物对维护科学诚信构成了重大挑战。虽然完整性分析人员通常依赖文本和视觉线索来识别虚假文章,但确定哪些论文值得进一步调查就像是在干草堆里找针一样,因为这些虚假出版物的作者之间没有关联,发表的刊物也没有关联。为了解决这一挑战,我们开发了一种新的溯源分析方法,可以自动跟踪和分组可疑的人物和文件。我们的方法通过分析论文中的图像并识别重复和操纵的区域,将来自同一论文工厂的手稿进行分组。这些区域在溯源图中链接并组织起来,提供了系统生产的证据。我们在一个由数百篇文档组成的论文工厂数据集上测试了我们的解决方案,还在一个更大的数据集上进行了测试,该数据集故意包含数千篇文档,这些文档是故意挑选出来以分散我们的方法的注意力的。我们的方法通过指出它们相互重复和操纵的图像,成功地识别和链接了这两个数据集上系统生成的文章。本文提出的技术为识别欺诈性手稿提供了一个有前途的解决方案,它可能是支持科学诚信的一个有价值的工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ced/11524478/079b86c148d6/pone.0312666.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验