Phoneme based speech to text translation system for Malaysian English pronunciation

Speech is the most common and vocalized form of human communication. Communication through speech helps to convey the linguistic information and also helps to express information about the person’s social and regional origin, health and emotional state. Recent improvement in phoneme based speech...

Full description

Saved in:
Bibliographic Details
Main Author: Sathees Kumar, Nataraj
Format: Thesis
Language:English
Subjects:
Online Access:http://dspace.unimap.edu.my:80/xmlui/bitstream/123456789/31909/1/Page%201-24.pdf
http://dspace.unimap.edu.my:80/xmlui/bitstream/123456789/31909/2/Full%20text.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-unimap-31909
record_format uketd_dc
institution Universiti Malaysia Perlis
collection UniMAP Institutional Repository
language English
topic Phoneme
Speech signal processing
English language
Speech to text translation
Speech recognition systems
spellingShingle Phoneme
Speech signal processing
English language
Speech to text translation
Speech recognition systems
Sathees Kumar, Nataraj
Phoneme based speech to text translation system for Malaysian English pronunciation
description Speech is the most common and vocalized form of human communication. Communication through speech helps to convey the linguistic information and also helps to express information about the person’s social and regional origin, health and emotional state. Recent improvement in phoneme based speech to text translation system has become one of the most exciting areas of the speech signal processing; because of the major advances in statistical modeling of speech, automatic speech recognition systems have find widespread of applications in tasks that require human machine interface. The advancement and development of speech to text translation system can be used in many applications such as Medical Transcriptions (digital speech to text) Automated transcription, Telematics and Air traffic control. In this research work, two sets of isolated word speech signal database has been built namely Vowels Class Word Database (VCWD) and Phonemes Class Word Database (PCWD). The VCWD was initially built to classify the isolated words based on the eleven classes of vowels. The database has been analyzed using four different spectral analysis techniques such as Mel-Frequency Cepstral Co-efficient (MFCC), Linear Predictive Coefficient (LPC), Perceptual Linear Predictive Analysis (PLP) and Relative Spectra- Perceptual Linear Predictive Analysis (RASTA-PLP)) to determine the best discriminative features and to identify the network parameters. The PCWD has been built to develop the phoneme based speech to text translation system using Linear Predictive Coefficients (LPC) and Multilayer Neural Network models (MLNN) using fusion concept for the classification of isolated words and phoneme. The isolated word speech signals are recorded using a speech acquisition algorithm developed using a MATLAB Graphical user interface (GUI). The speech signals are recorded for 15 seconds at 16 kHz sampling frequency. The recorded speech signals are pre-processed and used to segment the voiced/unvoiced parts of the speech signal. A simple fuzzy voice classifier has been proposed to extract the voiced portion using frame energy and change in energy features. The extracted voiced portions are pre-processed and divided into a number of frames. For each frame signal, the spectral features are extracted and used as a feature set for the classification. The classification tasks of the isolated words and phonemes are associated with the extracted features to establish input output mapping. The data are then normalized and randomized to rearrange the values into definite range. The Multilayer Neural Network (MLNN) model has been developed with four combinations of input and hidden activation functions. To improve the performance rate and reduce the training time a simple systole activation function has been proposed. The neural network models are trained with 60%, 70% and 80% of the total data samples. The trained neural network is validated with the remaining 40%, 30% and 20% of data samples by simulating the network. The performance of the network is calculated by measuring the true positives, false negatives and classification accuracy and the results are compared. It is observed that the fuzzy voice classifier is developed with less complexity and yields better accuracy when compared with the other voiced/unvoiced classification methods available in the literature. The LPC features show better discrimination and the MLNN neural network models trained using the LPC spectral band features gives better classification accuracy when compared with other feature extraction algorithms. Also, the proposed systole activation function produces reduced training time and epoch rate when compared with the other network models.
format Thesis
author Sathees Kumar, Nataraj
author_facet Sathees Kumar, Nataraj
author_sort Sathees Kumar, Nataraj
title Phoneme based speech to text translation system for Malaysian English pronunciation
title_short Phoneme based speech to text translation system for Malaysian English pronunciation
title_full Phoneme based speech to text translation system for Malaysian English pronunciation
title_fullStr Phoneme based speech to text translation system for Malaysian English pronunciation
title_full_unstemmed Phoneme based speech to text translation system for Malaysian English pronunciation
title_sort phoneme based speech to text translation system for malaysian english pronunciation
granting_institution Universiti Malaysia Perlis (UniMAP)
granting_department School of Mechatronic Engineering
url http://dspace.unimap.edu.my:80/xmlui/bitstream/123456789/31909/1/Page%201-24.pdf
http://dspace.unimap.edu.my:80/xmlui/bitstream/123456789/31909/2/Full%20text.pdf
_version_ 1747836792918245376
spelling my-unimap-319092014-02-13T10:50:14Z Phoneme based speech to text translation system for Malaysian English pronunciation Sathees Kumar, Nataraj Speech is the most common and vocalized form of human communication. Communication through speech helps to convey the linguistic information and also helps to express information about the person’s social and regional origin, health and emotional state. Recent improvement in phoneme based speech to text translation system has become one of the most exciting areas of the speech signal processing; because of the major advances in statistical modeling of speech, automatic speech recognition systems have find widespread of applications in tasks that require human machine interface. The advancement and development of speech to text translation system can be used in many applications such as Medical Transcriptions (digital speech to text) Automated transcription, Telematics and Air traffic control. In this research work, two sets of isolated word speech signal database has been built namely Vowels Class Word Database (VCWD) and Phonemes Class Word Database (PCWD). The VCWD was initially built to classify the isolated words based on the eleven classes of vowels. The database has been analyzed using four different spectral analysis techniques such as Mel-Frequency Cepstral Co-efficient (MFCC), Linear Predictive Coefficient (LPC), Perceptual Linear Predictive Analysis (PLP) and Relative Spectra- Perceptual Linear Predictive Analysis (RASTA-PLP)) to determine the best discriminative features and to identify the network parameters. The PCWD has been built to develop the phoneme based speech to text translation system using Linear Predictive Coefficients (LPC) and Multilayer Neural Network models (MLNN) using fusion concept for the classification of isolated words and phoneme. The isolated word speech signals are recorded using a speech acquisition algorithm developed using a MATLAB Graphical user interface (GUI). The speech signals are recorded for 15 seconds at 16 kHz sampling frequency. The recorded speech signals are pre-processed and used to segment the voiced/unvoiced parts of the speech signal. A simple fuzzy voice classifier has been proposed to extract the voiced portion using frame energy and change in energy features. The extracted voiced portions are pre-processed and divided into a number of frames. For each frame signal, the spectral features are extracted and used as a feature set for the classification. The classification tasks of the isolated words and phonemes are associated with the extracted features to establish input output mapping. The data are then normalized and randomized to rearrange the values into definite range. The Multilayer Neural Network (MLNN) model has been developed with four combinations of input and hidden activation functions. To improve the performance rate and reduce the training time a simple systole activation function has been proposed. The neural network models are trained with 60%, 70% and 80% of the total data samples. The trained neural network is validated with the remaining 40%, 30% and 20% of data samples by simulating the network. The performance of the network is calculated by measuring the true positives, false negatives and classification accuracy and the results are compared. It is observed that the fuzzy voice classifier is developed with less complexity and yields better accuracy when compared with the other voiced/unvoiced classification methods available in the literature. The LPC features show better discrimination and the MLNN neural network models trained using the LPC spectral band features gives better classification accuracy when compared with other feature extraction algorithms. Also, the proposed systole activation function produces reduced training time and epoch rate when compared with the other network models. Universiti Malaysia Perlis (UniMAP) 2012 Thesis en http://dspace.unimap.edu.my:80/dspace/handle/123456789/31909 http://dspace.unimap.edu.my:80/xmlui/bitstream/123456789/31909/1/Page%201-24.pdf c0b07f39f02d909c1ccc2ecd991f696f http://dspace.unimap.edu.my:80/xmlui/bitstream/123456789/31909/2/Full%20text.pdf 1a7cb3afa51ce94b41c1246b0ef95d39 http://dspace.unimap.edu.my:80/xmlui/bitstream/123456789/31909/3/license.txt 8a4605be74aa9ea9d79846c1fba20a33 Phoneme Speech signal processing English language Speech to text translation Speech recognition systems School of Mechatronic Engineering