Block-based neural network mapping on graphics processor unit

Block-based neural network (BbNN) was introduced to improve the training speed of artificial neural network. Various works had been carried out by previous researchers to improve training speed of BbNN system. Multithread BbNN training on field-programmable gate array (FPGA) limits training speed du...

Full description

Saved in:
Bibliographic Details
Main Author: Ong, Chin Tong
Format: Thesis
Language:English
Published: 2015
Subjects:
Online Access:http://eprints.utm.my/id/eprint/53959/1/OngChinTongMFKE2015.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Block-based neural network (BbNN) was introduced to improve the training speed of artificial neural network. Various works had been carried out by previous researchers to improve training speed of BbNN system. Multithread BbNN training on field-programmable gate array (FPGA) limits training speed due to low performance of Nios II software used for communication between central processing unit (CPU) and FPGA. This project aims to improve training speed of multithread BbNN block by mapping BbNN model into Compute Unified Device Architecture (CUDA) core. In this project, each BbNN block is mapped into a CUDA core with each core running on a single thread. The functional verification of BbNN core is carried out based on the BbNN output accuracy value. Near 100 percent accuracy value obtained is used to verify the CUDA mapped BbNN. The performance trade-off analysis had been carried out by comparing the accuracy value obtained from BbNN evolution on GPU versus CPU implementations. From the results obtained, it is found out that the performance of CUDA-mapped BbNN can only be as fast as CPU-mapped implementation. Although CUDA-mapped BbNN implementation run multiple BbNN blocks training in parallel, large data transfer between CPU and GPU dominates the performance gain in training multiple BbNN blocks in parallel. Besides that, a significant gain in training speed can only be seen if the order of complexity for GPU execution is at a higher order compared to the order of CPU-GPU data transfer. The result obtained in this project provides recommendation for future research works on how to further improve the training speed of CUDA-base BbNN implementation.