Speech emotion recognition using spectrograms and convolutional neural networks /

Speech Emotion Recognition (SER) is the task of recognising the emotional aspects of speech irrespective of the semantic contents. Recognising these human speech emotions have gained much importance in recent years in order to improve both the naturalness and efficiency of Human-Machine Interactions...

Full description

Saved in:

Bibliographic Details
Main Author:	Majid, Taiba (Author)
Format:	Thesis
Language:	English
Published:	Kuala Lumpur : Kulliyyah of Engineering, International Islamic University Malaysia, 20
Subjects:	Automatic speech recognition Speech processing systems Theses, IIUM local
Online Access:	http://studentrepo.iium.edu.my/handle/123456789/10783
Tags:	Add Tag No Tags, Be the first to tag this record!


LEADER	050290000a22003850004500
008	210811s2021 my a f m 000 0 eng d
040			\|a UIAM \|b eng \|e rda
041			\|a eng
043			\|a a-my---
050	0	0	\|a TK7882.S65
100	1		\|a Majid, Taiba, \|e author \|9 4762
245	1	0	\|a Speech emotion recognition using spectrograms and convolutional neural networks / \|c by Taiba Majid
264		1	\|a Kuala Lumpur : \|b Kulliyyah of Engineering, International Islamic University Malaysia, \|c 20
300			\|a xiv, 135 leaves : \|b colour illustrations ; \|c 30cm.
336			\|2 rdacontent \|a text
337			\|2 rdamedia \|a unmediated
337			\|2 rdmedia \|a computer
338			\|2 rdacarrier \|a volume
338			\|2 rdacarrier \|a online resource
347			\|2 rdaft \|a text file \|b PDF
500			\|a Abstracts in English and Arabic.
500			\|a "A dissertation submitted in fulfilment of the requirement for the degree of Master of Science (Communication Engineering)." --On title page.
502			\|a Thesis (MSCE)--International Islamic University Malaysia, 2020.
504			\|a Includes bibliographical references (leaves 119-127).
520			\|a Speech Emotion Recognition (SER) is the task of recognising the emotional aspects of speech irrespective of the semantic contents. Recognising these human speech emotions have gained much importance in recent years in order to improve both the naturalness and efficiency of Human-Machine Interactions (HCI). Deep Learning techniques have proved to be best suited for emotion recognition over traditional techniques because of their advantages like fast and scalable, all-purpose parameter fitting and infinitely flexible function. Nevertheless, there is no common consensus on how to measure or categorise emotions as they are subjective. The crucial aspect of SER system is selecting the speech emotion corpora (database), recognition of various features inherited in speech and a flexible model for the classification of those features. Therefore, this research proposes a different architecture of Deep Learning technique - Convolution Neural Networks (CNNs) known as Deep Stride Convolutional Neural Network (DSCNN) using the plain nets strategy to learn discriminative features and then classify them. The main objective is to formulate an optimum model by taking a smaller number of convolutional layers and eliminate the pooling-layers to increase computational stability. This elimination tends to increase the accuracy and decrease the computational time of speech emotion recognition (SER) system. Instead of pooling layers, notable strides have been used for the necessary dimension reduction. CNN and DSCNN are trained on three databases; a German database Berlin Emotional Database (Emo-DB), an English database Surrey Audio-Visual Expressed Emotion (SAVEE) and Indian Institute of Technology Kharagpur Simulated Emotion Hindi Speech Corpus (IITKGP-SEHSC), a Hindi database. The speech signals of three databases are converted to clean spectrograms by applying STFT on them after preprocessing. For the evaluation process, four emotions angry, happy, neutral, and sad have been considered. Besides, F1 scores have been calculated for all the considered emotions of all databases. Evaluation results show that the proposed architecture of both CNN and DSCNN outperform the-state-of-art models in terms of validation accuracy. The proposed architecture of CNN improves the accuracy of absolute 6.37%, 9.72% and 5.22% for EmoDB, SAVEE database and IITKGP-SEHSC database respectively. In comparison, as DSCNN architecture improves the performance by absolute 6.37%, 10.72% and 7.22% for EmoDB, SAVEE database and IITKGP-SEHSC database respectively compared to the best existing models. Furthermore, the proposed DSCNN architecture performs better for the three examining databases than proposed CNN architecture in terms of computational time. The computational time difference is found to be 60 seconds, 58 seconds and 56 seconds for EmoDB, SAVEE database and IITKGP-SEHSC respectively on 300 epochs. This study has set new benchmarks for all the three databases for upcoming work, which proves the effectiveness and significance of the proposed SER techniques. Future work is warranted to examine the capability of CNN and DSCNN for the voice-based identification of gender and image/video-based emotion recognition.
650		0	\|a Automatic speech recognition \|9 4162
650		0	\|a Speech processing systems \|9 4163
655	7		\|a Theses, IIUM local
690			\|a Dissertations, Academic \|x Department of Electrical and Computer Engineering \|z IIUM \|9 4446
710	2		\|a International Islamic University Malaysia. \|b Department of Electrical and Computer Engineering \|9 4449
856	4		\|u http://studentrepo.iium.edu.my/handle/123456789/10783
900			\|a sz-asbh
942			\|2 lcc \|c THESIS \|n 0
999			\|c 439422 \|d 472713
952			\|0 0 \|1 0 \|2 lcc \|4 0 \|6 T T K7882 S65 M00233S 02021 \|7 3 \|8 IIUMTHESIS \|9 762296 \|a IIUM \|b IIUM \|c THESIS \|d 2022-06-23 \|g 0.00 \|o t TK 7882 S65 M233S 2021 \|p 11100392670 \|r 1900-01-02 \|t 2 \|v 0.00 \|y THESIS

Speech emotion recognition using spectrograms and convolutional neural networks /

Similar Items