Rexha Andi, Kröll Mark, Ziak Hermann, Kern Roman
Know-Center GmbH, Inffeldgasse 13, Graz, Austria.
Scientometrics. 2018;115(1):223-237. doi: 10.1007/s11192-018-2661-6. Epub 2018 Feb 2.
The goal of our work is inspired by the task of associating segments of text to their real authors. In this work, we focus on analyzing the way humans judge different writing styles. This analysis can help to better understand this process and to thus simulate/ mimic such behavior accordingly. Unlike the majority of the work done in this field (i.e. authorship attribution, plagiarism detection, etc.) which uses content features, we focus only on the stylometric, i.e. content-agnostic, characteristics of authors. Therefore, we conducted two pilot studies to determine, if humans can identify authorship among documents with high content similarity. The first was a quantitative experiment involving crowd-sourcing, while the second was a qualitative one executed by the authors of this paper. Both studies confirmed that this task is quite challenging. To gain a better understanding of how humans tackle such a problem, we conducted an exploratory data analysis on the results of the studies. In the first experiment, we compared the decisions against content features and stylometric features. While in the second, the evaluators described the process and the features on which their judgment was based. The findings of our detailed analysis could (1) help to improve algorithms such as automatic authorship attribution as well as plagiarism detection, (2) assist forensic experts or linguists to create profiles of writers, (3) support intelligence applications to analyze aggressive and threatening messages and (4) help editor conformity by adhering to, for instance, journal specific writing style.
我们工作的目标源自将文本片段与其真实作者进行关联的任务。在这项工作中,我们专注于分析人类判断不同写作风格的方式。这种分析有助于更好地理解这一过程,并据此模拟/模仿此类行为。与该领域的大多数工作(即作者身份归属、剽窃检测等)不同,后者使用内容特征,而我们仅关注作者的文体特征,即与内容无关的特征。因此,我们进行了两项初步研究,以确定人类是否能够在内容高度相似的文档中识别作者身份。第一项是涉及众包的定量实验,第二项是由本文作者进行的定性实验。两项研究均证实这项任务颇具挑战性。为了更好地理解人类如何解决此类问题,我们对研究结果进行了探索性数据分析。在第一个实验中,我们将决策与内容特征和文体特征进行了比较。而在第二个实验中,评估者描述了他们做出判断所基于的过程和特征。我们详细分析的结果可以(1)帮助改进诸如自动作者身份归属以及剽窃检测等算法,(2)协助法医专家或语言学家创建作者档案,(3)支持情报应用来分析攻击性和威胁性信息,以及(4)通过遵循例如特定期刊的写作风格来帮助编辑符合规范。