Hypothesis testing for evaluating a multimodal pattern recognition framework applied to speaker detection.

Besson Patricia, Kunt Murat

Signal Processing Institute (ITS), Ecole Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland.

J Neuroeng Rehabil. 2008 Mar 27;5:11. doi: 10.1186/1743-0003-5-11.

BACKGROUND

Speaker detection is an important component of many human-computer interaction applications, like for example, multimedia indexing, or ambient intelligent systems. This work addresses the problem of detecting the current speaker in audio-visual sequences. The detector performs with few and simple material since a single camera and microphone meets the needs.

METHOD

A multimodal pattern recognition framework is proposed, with solutions provided for each step of the process, namely, the feature generation and extraction steps, the classification, and the evaluation of the system performance. The decision is based on the estimation of the synchrony between the audio and the video signals. Prior to the classification, an information theoretic framework is applied to extract optimized audio features using video information. The classification step is then defined through a hypothesis testing framework in order to get confidence levels associated to the classifier outputs, allowing thereby an evaluation of the performance of the whole multimodal pattern recognition system.

RESULTS

Through the hypothesis testing approach, the classifier performance can be given as a ratio of detection to false-alarm probabilities. Above all, the hypothesis tests give means for measuring the whole pattern recognition process efficiency. In particular, the gain offered by the proposed feature extraction step can be evaluated. As a result, it is shown that introducing such a feature extraction step increases the ability of the classifier to produce good relative instance scores, and therefore, the performance of the pattern recognition process.

CONCLUSION

The powerful capacities of hypothesis tests as an evaluation tool are exploited to assess the performance of a multimodal pattern recognition process. In particular, the advantage of performing or not a feature extraction step prior to the classification is evaluated. Although the proposed framework is used here for detecting the speaker in audiovisual sequences, it could be applied to any other classification task involving two spatio-temporal co-occurring signals.

相似文献

Hypothesis testing for evaluating a multimodal pattern recognition framework applied to speaker detection.

J Neuroeng Rehabil. 2008 Mar 27;5:11. doi: 10.1186/1743-0003-5-11.

Nonlinear dynamic neural network for text-independent speaker identification using information theoretic learning technology.

Conf Proc IEEE Eng Med Biol Soc. 2006;2006:2442-5. doi: 10.1109/IEMBS.2006.260525.

Speech sound classification and detection of articulation disorders with support vector machines and wavelets.

Conf Proc IEEE Eng Med Biol Soc. 2006;2006:2199-202. doi: 10.1109/IEMBS.2006.259499.

4-D facial expression recognition by learning geometric deformations.

IEEE Trans Cybern. 2014 Dec;44(12):2443-57. doi: 10.1109/TCYB.2014.2308091.

Performance enhancement for audio-visual speaker identification using dynamic facial muscle model.

Med Biol Eng Comput. 2006 Oct;44(10):919-30. doi: 10.1007/s11517-006-0106-5. Epub 2006 Sep 26.

Audio-visual active speaker tracking in cluttered indoors environments.

IEEE Trans Syst Man Cybern B Cybern. 2008 Jun;38(3):799-807. doi: 10.1109/TSMCB.2008.922063.

Monotonicity and error type differentiability in performance measures for target detection and tracking in video.

IEEE Trans Pattern Anal Mach Intell. 2013 Oct;35(10):2553-60. doi: 10.1109/TPAMI.2013.70.

Integrating face and gait for human recognition at a distance in video.

IEEE Trans Syst Man Cybern B Cybern. 2007 Oct;37(5):1119-37. doi: 10.1109/tsmcb.2006.889612.

Subject-specific and pose-oriented facial features for face recognition across poses.

IEEE Trans Syst Man Cybern B Cybern. 2012 Oct;42(5):1357-68. doi: 10.1109/TSMCB.2012.2191773. Epub 2012 Apr 25.

Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences.

IEEE Trans Syst Man Cybern B Cybern. 2006 Apr;36(2):433-49. doi: 10.1109/tsmcb.2005.859075.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Hypothesis testing for evaluating a multimodal pattern recognition framework applied to speaker detection.

J Neuroeng Rehabil. 2008 Mar 27;5:11. doi: 10.1186/1743-0003-5-11.

Nonlinear dynamic neural network for text-independent speaker identification using information theoretic learning technology.

Conf Proc IEEE Eng Med Biol Soc. 2006;2006:2442-5. doi: 10.1109/IEMBS.2006.260525.

Speech sound classification and detection of articulation disorders with support vector machines and wavelets.

Conf Proc IEEE Eng Med Biol Soc. 2006;2006:2199-202. doi: 10.1109/IEMBS.2006.259499.

4-D facial expression recognition by learning geometric deformations.

IEEE Trans Cybern. 2014 Dec;44(12):2443-57. doi: 10.1109/TCYB.2014.2308091.

Performance enhancement for audio-visual speaker identification using dynamic facial muscle model.

Med Biol Eng Comput. 2006 Oct;44(10):919-30. doi: 10.1007/s11517-006-0106-5. Epub 2006 Sep 26.

Audio-visual active speaker tracking in cluttered indoors environments.

IEEE Trans Syst Man Cybern B Cybern. 2008 Jun;38(3):799-807. doi: 10.1109/TSMCB.2008.922063.

Monotonicity and error type differentiability in performance measures for target detection and tracking in video.

IEEE Trans Pattern Anal Mach Intell. 2013 Oct;35(10):2553-60. doi: 10.1109/TPAMI.2013.70.

Integrating face and gait for human recognition at a distance in video.

IEEE Trans Syst Man Cybern B Cybern. 2007 Oct;37(5):1119-37. doi: 10.1109/tsmcb.2006.889612.

Subject-specific and pose-oriented facial features for face recognition across poses.

IEEE Trans Syst Man Cybern B Cybern. 2012 Oct;42(5):1357-68. doi: 10.1109/TSMCB.2012.2191773. Epub 2012 Apr 25.

Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences.

IEEE Trans Syst Man Cybern B Cybern. 2006 Apr;36(2):433-49. doi: 10.1109/tsmcb.2005.859075.

作者信息

机构信息

出版信息

BACKGROUND

METHOD

RESULTS

CONCLUSION

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献