Speech and Language Laboratory, The Australian National University, Building #9, Canberra, ACT 2600, Australia; Linguistics Program, School of Culture, History and Language, College of Asia and the Pacific, The Australian National University, Building #9, Canberra, ACT 2600, Australia.
Sci Justice. 2023 Mar;63(2):181-199. doi: 10.1016/j.scijus.2022.12.007. Epub 2023 Jan 3.
This study empirically demonstrates the efficacy of a two-level Dirichlet-multinomial statistical model (the Multinomial system) for computing likelihood ratios (LR) for linguistic, textual evidence with multiple stylometric feature types with discrete values. The LRs are calculated separately for each feature type, namely, word, character and part of speech N-grams (N = 1,2,3), which are combined as overall LRs through logistic regression fusion. The Multinomial system's performance is compared with that of a previously proposed system with the cosine distance (the Cosine system) using the same data (i.e., documents collated from 2160 authors). The experimental results show that: (1) the Multinomial system outperforms the Cosine system with the fused feature types by a log-LR cost of ca. 0.01 ∼ 0.05 bits; and (2) the Multinomial system is more advantageous in performance with longer documents than the Cosine system. Although the Cosine system is more robust overall against the sampling variability arising from the number of authors included in the reference and calibration databases, the Multinomial system can achieve reasonable stability in performance; for example, the standard deviation value of the log-LR cost becomes lower than 0.01 (10 random samplings of authors for the reference and calibration databases) with 60 or more authors in each database.
本研究从实证角度证明了二层次狄利克雷多项式统计模型(多项式系统)在计算具有离散值的多种文体特征类型的语言、文本证据似然比(LR)方面的有效性。LR 分别针对每个特征类型进行计算,即单词、字符和词性 N 元组(N=1、2、3),通过逻辑回归融合将这些特征类型的 LR 组合为总体 LR。将多项式系统与之前使用相同数据(即从 2160 位作者整理的文档)提出的基于余弦距离的系统(余弦系统)进行比较。实验结果表明:(1)融合特征类型后,多项式系统的对数 LR 成本比余弦系统高出约 0.01~0.05 位;(2)与余弦系统相比,多项式系统在处理较长文档时具有更高的性能优势。尽管余弦系统在整体上对参考和校准数据库中包含的作者数量引起的抽样可变性更稳健,但多项式系统可以实现合理的性能稳定性;例如,在每个数据库中包含 60 个或更多作者时,对数 LR 成本的标准偏差值会降低到 0.01 以下(参考和校准数据库的 10 次随机作者抽样)。