Speech synthesis module with adaptive emotional expression /
Computer generated speech replaces the conventional text based interaction methods. Initially, speech synthesis generated human voice that lacked emotional expression. This kind of speech does not encourage users to interact with computers. Emotional speech synthesis is one of the challenges of spee...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
Kuala Lumpur :
Kulliyyah of Engineering, International Islamic University Malaysia,
2010
|
Subjects: | |
Online Access: | http://studentrepo.iium.edu.my/handle/123456789/5230 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Computer generated speech replaces the conventional text based interaction methods. Initially, speech synthesis generated human voice that lacked emotional expression. This kind of speech does not encourage users to interact with computers. Emotional speech synthesis is one of the challenges of speech synthesise research. The quality of emotional speech synthesis is judged by its intelligibility and similarity to natural speech. High quality speech is achievable using the high computational cost unit selection technology. This technology relays on huge sets of recorded speech segments to achieve optimum quality. On the other hand, diphone synthesis technology utilizes computational resources and storage spaces. Its quality is less than unit selection, however, due to the introduction of many digital signal processing algorithms such as the PSOLA algorithm, more natural results was achievable. Emotional speech synthesis research has two significant trends. The first is unit selection based synthesis that aims to fulfil market needs regardless of resource utilization, and the second is diphone based synthesis that is often non-commercial, and oriented to develop intelligent algorithms that utilizes minimum resources to achieve natural output. In this thesis, the possibilities of achieving high quality speech using low computational cost systems are investigated. The diphone synthesis is chosen as the speech synthesis technology. The existing approaches to emotional emulation is analysed to determine aspects that could be further enhanced. Two aspects are highlighted: formant relation to emotions and the deterministic nature of pitch pattern relation to emotion. These asoects does not receive much attention from the existing approaches. Two algorithms are proposed to address these two aspects: formant manipulation, and deterministic pitch pattern generation algorithm. These algorithm are incorporated into one TTS system. The quality of speech synthesis of the proposed system is evaluated using the recently developed objective evaluation methods. The results show significantly small values of simulation error, the mean square error values for happy, sad, fear and anger emotions respectively are: 0.03225, 0.12928, 0.02513 and 0.02429. This margin of error value provides an evidence of the accuracy of the proposed system. |
---|---|
Item Description: | Abstracts in English and Arabic. "A dissertation submitted in partial fulfilment of the requirements for the degree of Master of Science (Computer and Information Engineering)." --On title page. |
Physical Description: | xv, 117 leaves : illustrations ; 30 cm. |
Bibliography: | Includes bibliographical references (leaves 81-84). |