• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

具有高度内容相似性的文档的作者身份识别。

Authorship identification of documents with high content similarity.

作者信息

Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman

机构信息

Know-Center GmbH, Inffeldgasse 13, Graz, Austria.

出版信息

Scientometrics. 2018;115(1):223-237. doi: 10.1007/s11192-018-2661-6. Epub 2018 Feb 2.

DOI:10.1007/s11192-018-2661-6
PMID:29527072
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5838116/
Abstract

The goal of our work is inspired by the task of associating segments of text to their real authors. In this work, we focus on analyzing the way humans judge different writing styles. This analysis can help to better understand this process and to thus simulate/ mimic such behavior accordingly. Unlike the majority of the work done in this field (i.e. authorship attribution, plagiarism detection, etc.) which uses content features, we focus only on the stylometric, i.e. content-agnostic, characteristics of authors. Therefore, we conducted two pilot studies to determine, if humans can identify authorship among documents with high content similarity. The first was a quantitative experiment involving crowd-sourcing, while the second was a qualitative one executed by the authors of this paper. Both studies confirmed that this task is quite challenging. To gain a better understanding of how humans tackle such a problem, we conducted an exploratory data analysis on the results of the studies. In the first experiment, we compared the decisions against content features and stylometric features. While in the second, the evaluators described the process and the features on which their judgment was based. The findings of our detailed analysis could (1) help to improve algorithms such as automatic authorship attribution as well as plagiarism detection, (2) assist forensic experts or linguists to create profiles of writers, (3) support intelligence applications to analyze aggressive and threatening messages and (4) help editor conformity by adhering to, for instance, journal specific writing style.

摘要

我们工作的目标源自将文本片段与其真实作者进行关联的任务。在这项工作中,我们专注于分析人类判断不同写作风格的方式。这种分析有助于更好地理解这一过程,并据此模拟/模仿此类行为。与该领域的大多数工作(即作者身份归属、剽窃检测等)不同,后者使用内容特征,而我们仅关注作者的文体特征,即与内容无关的特征。因此,我们进行了两项初步研究,以确定人类是否能够在内容高度相似的文档中识别作者身份。第一项是涉及众包的定量实验,第二项是由本文作者进行的定性实验。两项研究均证实这项任务颇具挑战性。为了更好地理解人类如何解决此类问题,我们对研究结果进行了探索性数据分析。在第一个实验中,我们将决策与内容特征和文体特征进行了比较。而在第二个实验中,评估者描述了他们做出判断所基于的过程和特征。我们详细分析的结果可以(1)帮助改进诸如自动作者身份归属以及剽窃检测等算法,(2)协助法医专家或语言学家创建作者档案,(3)支持情报应用来分析攻击性和威胁性信息,以及(4)通过遵循例如特定期刊的写作风格来帮助编辑符合规范。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/714e/5838116/b467144cfaa1/11192_2018_2661_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/714e/5838116/4d1209c7ed0d/11192_2018_2661_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/714e/5838116/29bab7acabb6/11192_2018_2661_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/714e/5838116/58f171904fcf/11192_2018_2661_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/714e/5838116/d234666b670f/11192_2018_2661_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/714e/5838116/b467144cfaa1/11192_2018_2661_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/714e/5838116/4d1209c7ed0d/11192_2018_2661_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/714e/5838116/29bab7acabb6/11192_2018_2661_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/714e/5838116/58f171904fcf/11192_2018_2661_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/714e/5838116/d234666b670f/11192_2018_2661_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/714e/5838116/b467144cfaa1/11192_2018_2661_Fig5_HTML.jpg

相似文献

1
Authorship identification of documents with high content similarity.具有高度内容相似性的文档的作者身份识别。
Scientometrics. 2018;115(1):223-237. doi: 10.1007/s11192-018-2661-6. Epub 2018 Feb 2.
2
Unveiling ChatGPT text using writing style.用写作风格揭示ChatGPT文本。
Heliyon. 2024 Jun 15;10(12):e32976. doi: 10.1016/j.heliyon.2024.e32976. eCollection 2024 Jun 30.
3
Can anonymous posters on medical forums be reidentified?医学论坛上的匿名发帖者能被重新识别身份吗?
J Med Internet Res. 2013 Oct 3;15(10):e215. doi: 10.2196/jmir.2514.
4
A versatile dataset for intrinsic plagiarism detection, text reuse analysis, and author clustering in Urdu.一个用于乌尔都语中内在抄袭检测、文本重用分析和作者聚类的多功能数据集。
Data Brief. 2023 Nov 26;52:109857. doi: 10.1016/j.dib.2023.109857. eCollection 2024 Feb.
5
Dyslexia, authorial identity, and approaches to learning and writing: a mixed methods study.诵读困难、作者身份认同以及学习和写作方法:一项混合方法研究。
Br J Educ Psychol. 2012 Jun;82(Pt 2):289-307. doi: 10.1111/j.2044-8279.2011.02026.x. Epub 2011 Mar 16.
6
Network motifs for translator stylometry identification.用于翻译风格识别的网络基元。
PLoS One. 2019 Feb 8;14(2):e0211809. doi: 10.1371/journal.pone.0211809. eCollection 2019.
7
Learning Stylometric Representations for Authorship Analysis.学习文体风格表示法进行作者分析。
IEEE Trans Cybern. 2019 Jan;49(1):107-121. doi: 10.1109/TCYB.2017.2766189. Epub 2017 Nov 21.
8
Verifying authorship for forensic purposes: A computational protocol and its validation.用于法医学目的的作者身份验证:计算协议及其验证。
Forensic Sci Int. 2021 Aug;325:110824. doi: 10.1016/j.forsciint.2021.110824. Epub 2021 May 9.
9
Authorship attribution of source code by using back propagation neural network based on particle swarm optimization.基于粒子群优化的反向传播神经网络对源代码的作者归属分析
PLoS One. 2017 Nov 2;12(11):e0187204. doi: 10.1371/journal.pone.0187204. eCollection 2017.
10
Is writing as difficult as it seems?
Mem Cognit. 1995 Nov;23(6):767-79. doi: 10.3758/bf03200928.

引用本文的文献

1
Interpol review of questioned documents 2016-2019.国际刑警组织2016 - 2019年可疑文件审查
Forensic Sci Int Synerg. 2020 Apr 12;2:429-441. doi: 10.1016/j.fsisyn.2020.01.012. eCollection 2020.

本文引用的文献

1
Automatic recognition of conceptualization zones in scientific articles and two life science applications.科学文章中概念化区域的自动识别及两个生命科学应用。
Bioinformatics. 2012 Apr 1;28(7):991-1000. doi: 10.1093/bioinformatics/bts071. Epub 2012 Feb 8.
2
FACTA: a text search engine for finding associated biomedical concepts.FACTA:一个用于查找相关生物医学概念的文本搜索引擎。
Bioinformatics. 2008 Nov 1;24(21):2559-60. doi: 10.1093/bioinformatics/btn469. Epub 2008 Sep 4.
3
THE CHARACTERISTIC CURVES OF COMPOSITION.
Science. 1887 Mar 11;9(214S):237-46. doi: 10.1126/science.ns-9.214S.237.
4
BioRAT: extracting biological information from full-length papers.BioRAT:从全文论文中提取生物学信息。
Bioinformatics. 2004 Nov 22;20(17):3206-13. doi: 10.1093/bioinformatics/bth386. Epub 2004 Jul 1.