Speech synthesis module with adaptive emotional expression /

Computer generated speech replaces the conventional text based interaction methods. Initially, speech synthesis generated human voice that lacked emotional expression. This kind of speech does not encourage users to interact with computers. Emotional speech synthesis is one of the challenges of spee...

Full description

Saved in:
Bibliographic Details
Main Author: Mahmood, Ahmed Mustafa (Author)
Format: Thesis
Language:English
Published: Kuala Lumpur : Kulliyyah of Engineering, International Islamic University Malaysia, 2010
Subjects:
Online Access:http://studentrepo.iium.edu.my/handle/123456789/5230
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Computer generated speech replaces the conventional text based interaction methods. Initially, speech synthesis generated human voice that lacked emotional expression. This kind of speech does not encourage users to interact with computers. Emotional speech synthesis is one of the challenges of speech synthesise research. The quality of emotional speech synthesis is judged by its intelligibility and similarity to natural speech. High quality speech is achievable using the high computational cost unit selection technology. This technology relays on huge sets of recorded speech segments to achieve optimum quality. On the other hand, diphone synthesis technology utilizes computational resources and storage spaces. Its quality is less than unit selection, however, due to the introduction of many digital signal processing algorithms such as the PSOLA algorithm, more natural results was achievable. Emotional speech synthesis research has two significant trends. The first is unit selection based synthesis that aims to fulfil market needs regardless of resource utilization, and the second is diphone based synthesis that is often non-commercial, and oriented to develop intelligent algorithms that utilizes minimum resources to achieve natural output. In this thesis, the possibilities of achieving high quality speech using low computational cost systems are investigated. The diphone synthesis is chosen as the speech synthesis technology. The existing approaches to emotional emulation is analysed to determine aspects that could be further enhanced. Two aspects are highlighted: formant relation to emotions and the deterministic nature of pitch pattern relation to emotion. These asoects does not receive much attention from the existing approaches. Two algorithms are proposed to address these two aspects: formant manipulation, and deterministic pitch pattern generation algorithm. These algorithm are incorporated into one TTS system. The quality of speech synthesis of the proposed system is evaluated using the recently developed objective evaluation methods. The results show significantly small values of simulation error, the mean square error values for happy, sad, fear and anger emotions respectively are: 0.03225, 0.12928, 0.02513 and 0.02429. This margin of error value provides an evidence of the accuracy of the proposed system.
Item Description:Abstracts in English and Arabic.
"A dissertation submitted in partial fulfilment of the requirements for the degree of Master of Science (Computer and Information Engineering)." --On title page.
Physical Description:xv, 117 leaves : illustrations ; 30 cm.
Bibliography:Includes bibliographical references (leaves 81-84).