Development on SNR estimator for audio-visual speech recognition based on waveform amplitude distribution analysis

For audio-visual speech recognition (AVSR) that uses audio modality combined with visual modality, the performance of speech recognition system can be improved, particularly when operating in a noisy environment. Audio modality can be easily corrupted by ambient noise, and this causes difficulty in...

Full description

Saved in:
Bibliographic Details
Main Author: Thum, Wei Seong
Format: Thesis
Language:English
Published: 2018
Subjects:
Online Access:http://umpir.ump.edu.my/id/eprint/27969/1/Development%20on%20SNR%20estimator%20for%20audio-visual%20speech%20recognition%20based%20on%20waveform%20amplitude.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:For audio-visual speech recognition (AVSR) that uses audio modality combined with visual modality, the performance of speech recognition system can be improved, particularly when operating in a noisy environment. Audio modality can be easily corrupted by ambient noise, and this causes difficulty in distinguishing the actual speech signal with noise signal correctly. Signal-to-noise ratio (SNR) is a fundamental measuring ratio of signal power over noise power, which is expressed in decibels (dB). One of the most famous SNR estimation techniques is the waveform amplitude distribution analysis (WADA), where it assumes that the amplitude of speech and noise follows gamma and Gaussian distributions. It has been used in some research works as a benchmark for result comparison. However, there is no clear instruction on how to build the look-up table. In this work, the development and rebuild of the look-up table using the own database corrupted with general white noise as the noise reference has been proposed. The reconstruction of WADA look-up table technique, which is known as the waveform amplitude distribution analysis-white (WADA-W), is able to enhance the SNR estimation by referring to the reconstructed WADA-W look-up table instead of a general WADA precomputed look-up table. The proposed WADA-W SNR estimation technique was evaluated by developing an AVSR system that utilised mel-frequency cepstral coefficients (MFCC) features and shape-based visual features from two speech databases: LUNA-V and CUAVE. According to the experimental result, it showed that by referring to the WADA-W look-up table, it is capable of performing a consistent SNR estimation with more accurate and less bias result compared to the original WADA technique under four types of noises, which are white, babble, factory1, and factory2 noises from the NOISEX-92 dataset. The overall deviation of the SNR estimation of the LUNA-V database using the proposed WADA-W technique was just approximately 9.6dB, whereas the deviation of NIST and WADA techniques was approximately 42.3dB and 67.3dB respectively. By using the same proposed technique for CUAVE database, the overall deviation of the SNR estimation was only 13.3dB, whereas the deviation of NIST and WADA techniques was 50.6dB and 62.3dB respectively. The classification was done using the multi-stream hidden Markov model (MSHMM) with leave-one-out cross-validation (LOOCV) technique. From the experiments, it showed that the proposed AVSR system able to achieve the highest accuracy at 96.6% using LUNA-V database and 95.2% for CUAVE database under clean condition. In conclusion, the proposed WADA-W SNR estimator able to improve by 4.5% and 12.7% compared to the original WADA technique by using the LUNA-V and CUAVE database respectively.