Unstructured big data processing in cloud computing environment by using Amazon Elastic Map Reduce

Nowadays, growing expansion of data content on the web delivers a huge amount of collective resources. Twitter, one of the biggest social media site collects tweets in millions every day in the range of Petabyte per year. Societies share their experiences, thoughts or simply talk just about wh...

Full description

Saved in:
Bibliographic Details
Main Author: Busu, Norzaharawani
Format: Thesis
Language:English
Published: 2017
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/67852/1/FSKTM%202017%2024%20IR.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Nowadays, growing expansion of data content on the web delivers a huge amount of collective resources. Twitter, one of the biggest social media site collects tweets in millions every day in the range of Petabyte per year. Societies share their experiences, thoughts or simply talk just about whatever concerns them online. Unstructured big data in social media plays vital roles in sentiment analysis or also known as opinion mining. Continuous structured and unstructured data are being generated in a large scale every day. These data are meaningless if they are not being captured and analyzed accordingly. Traditional RDBMS technology becomes less reliable when dealing with huge amount of structured data and the processing speed of data becomes sluggish if the infrastructure is not being upgraded to match the big amount of data. Furthermore, RDBMS is not capable to deal with unstructured data. Due to petabytes of records are generated every year on the net, capturing and analyzing big data can be challenging and cloud computing technologies are able to provide an on-demand infrastructures and services based on user requirements. Therefore, this thesis aims to use cloud based infrastructure which is Amazon Web Service to capture unstructured of big data, and afterward analyzing, visualizing and extracting useful information from large, diverse, distributed and mixed of data gathered from public data sets and Twitter’s Application Programming Interface (API). The results and explanation on the experiments mentioned in the chapter four; show the test bed result on collecting twitter data, test bed result on processing twitter input data and test bed result on output data. The analysis emphasizes on the elapsed time when collecting twitter data and also the performance of Amazon Elastic MapReduce (EMR). The infrastructures provided by Amazon Web Service are proficient enough to captured and manipulated large volume of unstructured big data on twitter. Afterward, this study have tested the capability of Amazon Elastic MapReduce (EMR) to process the input twitter data that had collected earlier, and transform them into a meaningful output that can be used for any decision making.