Shim Hye-Jin, Jung Jee-Weon, Yu Ha-Jin
School of Computer Science, University of Seoul, Seoul, 02504, South Korea.
School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA.
J Acoust Soc Am. 2024 Oct 1;156(4):2701-2708. doi: 10.1121/10.0032393.
Although the recent state-of-the-art systems show almost perfect performance, analysis of speaker embeddings has been lacking thus far. An in-depth analysis of speaker representation will be performed by looking into which features are selected. To this end, various intermediate representations of the trained model are observed using graph attentive feature aggregation, which includes a graph attention layer and graph pooling layer followed by a readout operation. To do so, the TIMIT dataset, which has comparably restricted conditions (e.g., the region and phoneme) is used after pre-training the model on the VoxCeleb dataset and then freezing the weight parameters. Through extensive experiments, there is a consistent trend in speaker representation in that the models learn to exploit sequence and phoneme information despite no supervision in that direction. The results shed light to help understand speaker embedding, which is yet considered to be a black box.
尽管最近的先进系统表现出几乎完美的性能,但迄今为止,对说话人嵌入的分析仍很缺乏。将通过研究选择了哪些特征来对说话人表示进行深入分析。为此,使用图注意力特征聚合来观察训练模型的各种中间表示,该聚合包括一个图注意力层和图池化层,随后是一个读出操作。为此,在VoxCeleb数据集上对模型进行预训练,然后冻结权重参数后,使用条件相对受限(例如,区域和音素)的TIMIT数据集。通过大量实验,说话人表示存在一致的趋势,即尽管在该方向上没有监督,但模型学会了利用序列和音素信息。这些结果有助于理解说话人嵌入,而说话人嵌入至今仍被视为一个黑箱。