Institute of Applied Computer Science, Lodz University of Technology, Stefanowskiego 18/22, 90-001 Łódź, Poland.
Sensors (Basel). 2021 Nov 20;21(22):7728. doi: 10.3390/s21227728.
The presented paper is concerned with detection of presentation attacks against unsupervised remote biometric speaker verification, using a well-known challenge-response scheme. We propose a novel approach to convolutional phoneme classifier training, which ensures high phoneme recognition accuracy even for significantly simplified network architectures, thus enabling efficient utterance verification on resource-limited hardware, such as mobile phones or embedded devices. We consider Deep Convolutional Neural Networks operating on windows of speech Mel-Spectrograms as a means for phoneme recognition, and we show that one can boost the performance of highly simplified neural architectures by modifying the principle underlying training set construction. Instead of generating training examples by slicing spectrograms using a sliding window, as it is commonly done, we propose to maximize the consistency of phoneme-related spectrogram structures that are to be learned, by choosing only spectrogram chunks from the central regions of phoneme articulation intervals. This approach enables better utilization of the limited capacity of the considered simplified networks, as it significantly reduces a within-class data scatter. We show that neural architectures comprising as few as dozens of thousands parameters can successfully-with accuracy of up to 76%, solve the 39-phoneme recognition task (we use the English language TIMIT database for experimental verification of the method). We also show that ensembling of simple classifiers, using a basic bagging method, boosts the recognition accuracy by another 2-3%, offering Phoneme Error Rates at the level of 23%, which approaches the accuracy of the state-of-the-art deep neural architectures that are one to two orders of magnitude more complex than the proposed solution. This, in turn, enables executing reliable presentation attack detection, based on just few-syllable long challenges on highly resource-limited computing hardware.
本文针对无监督远程生物特征说话人验证中的演示攻击检测问题,使用了一种广为人知的问答式方案。我们提出了一种新颖的方法来训练卷积音素分类器,即使在网络结构大大简化的情况下,也能确保高的音素识别准确率,从而能够在资源有限的硬件(如手机或嵌入式设备)上实现高效的语音验证。我们将在语音梅尔频谱图上运行的深度卷积神经网络作为音素识别的一种手段,并表明可以通过修改训练集构建的基本原理来提高高度简化的神经网络结构的性能。我们不是像通常那样使用滑动窗口对频谱图进行切片来生成训练示例,而是建议通过仅从音素发音间隔的中心区域选择频谱图块,来最大化要学习的音素相关频谱图结构的一致性。这种方法可以更好地利用所考虑的简化网络的有限容量,因为它显著减少了类内数据的分散。我们表明,包含数万参数的神经网络架构可以成功地(准确率高达 76%)解决 39 音素识别任务(我们使用英语 TIMIT 数据库对该方法进行了实验验证)。我们还表明,使用基本的袋装方法对简单分类器进行集成,可以将识别准确率提高 2-3%,提供接近最先进的深度神经网络架构的准确率,而这些架构比所提出的解决方案复杂一到两个数量级。这反过来又能够在高度资源有限的计算硬件上,仅基于几个音节长的挑战,执行可靠的演示攻击检测。