Big data processing on educational data mining using pyspark with jupyter notebook

The rapid advancement of the information technology brings new challenges and put new demands on our education system. The process of teaching and learning have moved from classroom to Computer Aided Learning (CAL) system. Big data technology and machine learning plays an important role in Computer...

Full description

Saved in:
Bibliographic Details
Main Author: Ravichandran, Vinitha
Format: Thesis
Language:English
Published: 2018
Online Access:http://eprints.utm.my/id/eprint/81375/1/VinithaRavichandranMFC2018.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-utm-ep.81375
record_format uketd_dc
spelling my-utm-ep.813752019-08-23T04:06:50Z Big data processing on educational data mining using pyspark with jupyter notebook 2018 Ravichandran, Vinitha The rapid advancement of the information technology brings new challenges and put new demands on our education system. The process of teaching and learning have moved from classroom to Computer Aided Learning (CAL) system. Big data technology and machine learning plays an important role in Computer Aided Learning (CAL) system due to the massive information or data generated by the system. This leads to the rapid development of data mining in education denote as Educational Data Mining (EDM). The abundance of data collected by the system can be used to analyse, predict and solve many societal issues in the education field such as improve the quality of education, predict as well as monitor educational outcomes. Effective analysing or predicting the future growth of students’ performance can make the Computer Aided Learning (CAL) system a better platform for learning compared to traditional learning. Machine learning techniques were used to get reliable and accurate prediction on students’ performance. Apache Hadoop has been the backbone for big data technology until the emergence of Apache Spark. However, only several researches are done on EDM using Apache Spark. In this dissertation, PySpark was be integrated with Jupyter Notebook to perform EDM on Educational Process Mining (EPM) data set. The Spark MLlib was used to compare four classification algorithms such as Logistic Regression, Naïve Bayes, Decision Tree and Random Forest to deal with EPM data set. Random Forest classifier outperformed other classifiers in Accuracy, Area Under the Precision-Recall(PR) and Area Under the Receiver Operating Characteristic (ROC) although with slightly slower Execution Time in this study. Random Forest classifier are the best classifier when dealing with EDM. 2018 Thesis http://eprints.utm.my/id/eprint/81375/ http://eprints.utm.my/id/eprint/81375/1/VinithaRavichandranMFC2018.pdf application/pdf en public http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:119718 masters Universiti Teknologi Malaysia Computer Science
institution Universiti Teknologi Malaysia
collection UTM Institutional Repository
language English
description The rapid advancement of the information technology brings new challenges and put new demands on our education system. The process of teaching and learning have moved from classroom to Computer Aided Learning (CAL) system. Big data technology and machine learning plays an important role in Computer Aided Learning (CAL) system due to the massive information or data generated by the system. This leads to the rapid development of data mining in education denote as Educational Data Mining (EDM). The abundance of data collected by the system can be used to analyse, predict and solve many societal issues in the education field such as improve the quality of education, predict as well as monitor educational outcomes. Effective analysing or predicting the future growth of students’ performance can make the Computer Aided Learning (CAL) system a better platform for learning compared to traditional learning. Machine learning techniques were used to get reliable and accurate prediction on students’ performance. Apache Hadoop has been the backbone for big data technology until the emergence of Apache Spark. However, only several researches are done on EDM using Apache Spark. In this dissertation, PySpark was be integrated with Jupyter Notebook to perform EDM on Educational Process Mining (EPM) data set. The Spark MLlib was used to compare four classification algorithms such as Logistic Regression, Naïve Bayes, Decision Tree and Random Forest to deal with EPM data set. Random Forest classifier outperformed other classifiers in Accuracy, Area Under the Precision-Recall(PR) and Area Under the Receiver Operating Characteristic (ROC) although with slightly slower Execution Time in this study. Random Forest classifier are the best classifier when dealing with EDM.
format Thesis
qualification_level Master's degree
author Ravichandran, Vinitha
spellingShingle Ravichandran, Vinitha
Big data processing on educational data mining using pyspark with jupyter notebook
author_facet Ravichandran, Vinitha
author_sort Ravichandran, Vinitha
title Big data processing on educational data mining using pyspark with jupyter notebook
title_short Big data processing on educational data mining using pyspark with jupyter notebook
title_full Big data processing on educational data mining using pyspark with jupyter notebook
title_fullStr Big data processing on educational data mining using pyspark with jupyter notebook
title_full_unstemmed Big data processing on educational data mining using pyspark with jupyter notebook
title_sort big data processing on educational data mining using pyspark with jupyter notebook
granting_institution Universiti Teknologi Malaysia
granting_department Computer Science
publishDate 2018
url http://eprints.utm.my/id/eprint/81375/1/VinithaRavichandranMFC2018.pdf
_version_ 1747818316508954624