Faculty of Electrical Engineering and Informatics, Department of Telecommunication and Media Informatics, Budapest University of Technology and Economics, Magyar tudósok körútja 2, Budapest, 1117, Hungary.
Doctoral School of Law Enforcement, Hungarian National University of Public Service, H-1083 Budapest, 2 Ludovika tér, Budapest, H-1441, Hungary.
J Forensic Sci. 2023 May;68(3):871-883. doi: 10.1111/1556-4029.15250. Epub 2023 Mar 31.
In forensic voice comparison, deep learning has become widely popular recently. It is mainly used to learn speaker representations, called embeddings or embedding vectors. Speaker embeddings are often trained using corpora mostly containing widely spoken languages. Thus, language dependency is an important factor in automatic forensic voice comparison, especially when the target language is linguistically very different from that the model is trained on. In the case of a low-resource language, developing a corpus for forensic purposes containing enough speakers to train deep learning models is costly. This study aims to investigate whether a model pre-trained on multilingual (mostly English) corpus can be used on a target low-resource language (here, Hungarian), not represented by the model. Often multiple samples are not available from the offender (unknown speaker). Samples are therefore compared pairwise with and without speaker enrollment for suspect (known) speakers. Two corpora are used that were developed especially for forensic purposes and a third that is meant for traditional speaker verification. Speaker embedding vectors are extracted by the x-vector and ECAPA-TDNN techniques. Speaker verification was evaluated in the likelihood-ratio framework. A comparison is made between the language combinations (modeling, LR calibration, and evaluation). The results were evaluated by Cllr and EER metrics. It was found that the model pre-trained on a different language but on a corpus with a significant number of speakers can be used on samples with language mismatch. Sample duration and speaking style also seem to affect the performance.
在法医语音比较中,深度学习最近变得非常流行。它主要用于学习说话人表示,称为嵌入或嵌入向量。说话人嵌入通常使用主要包含广泛使用的语言的语料库进行训练。因此,语言依赖性是自动法医语音比较的一个重要因素,尤其是当目标语言与模型所训练的语言在语言学上非常不同时。在资源匮乏的语言情况下,为法医目的开发包含足够说话人来训练深度学习模型的语料库是昂贵的。本研究旨在调查在多语言(主要是英语)语料库上预训练的模型是否可以用于目标低资源语言(此处为匈牙利语),而模型中没有表示该语言的内容。通常情况下,犯罪者(未知说话人)没有多个样本。因此,使用和不使用说话人注册来比较嫌疑说话人(已知说话人)的样本。使用了专门为法医目的开发的两个语料库和一个用于传统说话人验证的语料库。通过 x-vector 和 ECAPA-TDNN 技术提取说话人嵌入向量。在似然比框架中评估说话人验证。比较了语言组合(建模、LR 校准和评估)。通过 Cllr 和 EER 指标评估结果。结果发现,在不同语言但在包含大量说话人的语料库上预训练的模型可以用于语言不匹配的样本。样本持续时间和说话风格似乎也会影响性能。