An online density-based clustering algorithm for data stream based on local optimal radius and cluster pruning

Data stream clustering plays an important role in data stream mining for knowledge extraction. In recent years, numerous researchers have studied the online density-based clustering technique due to its capability to generate arbitrarily shaped clusters. The technique summarizes the data stream in m...

Full description

Saved in:
Bibliographic Details
Main Author: Islam, Md Kamrul
Format: Thesis
Language:English
Published: 2019
Subjects:
Online Access:http://umpir.ump.edu.my/id/eprint/31091/1/An%20online%20density-based%20clustering%20algorithm%20for%20data%20stream%20based%20on%20local%20optimal%20radius%20and%20cluster%20pruning.wm.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Data stream clustering plays an important role in data stream mining for knowledge extraction. In recent years, numerous researchers have studied the online density-based clustering technique due to its capability to generate arbitrarily shaped clusters. The technique summarizes the data stream in micro-clusters and the micro-clusters form the clusters. However, most of the clusters are either not fully online, or cannot handle the properties of data stream properly. Moreover, the algorithms require predefining the global optimal radius of micro-clusters, which is a difficult task, and an erroneous choice deteriorates the cluster quality. In addition, the algorithms ignore the presence of temporarily irrelevant micro-clusters, which may be relevant in the future. This ignorance causes the degradation of clustering quality and the increase of the processing time as micro-clusters are deleted and created frequently due to evolving nature of data stream. In this study, a fully online density-based clustering algorithm called Buffer-based Online Clustering for Evolving Data Stream (BOCEDS) is presented. BOCEDS clusters the data stream in a single stage. The algorithm summarizes the data from data stream in micro-clusters. This algorithm maintains the local optimal radius of micro-clusters rather than a global and constant radius. Moreover, it introduces a buffer for storing irrelevant micro-clusters and a fully online pruning process for extracting the temporarily irrelevant micro-cluster from the buffer. The pruning process improves processing time. In addition, BOCEDS proposes an online micro-cluster energy updating function based on the spatial information of the data stream. Then, clustering graphs are generated based on the connectivity among micro-clusters. The clusters are generated from the clustering graphs. To evaluate the performance, BOCEDS algorithm is executed on two syntactic and one practical data streams. The experimental result shows BOCEDS is able to generate new clusters and remove outdated clusters with time as data stream contents change. The experiment on noisy data stream shows that BOCEDS algorithm can detect noise with an accuracy of approximately 100%. The overall clustering accuracy and purity are more than 99%. Experimental results are compared with other alternative online/offline hybrid density-based clustering algorithms. The average processing time for data point in the data stream is about 2 milliseconds which is much lower than the aligned clustering algorithms in literature. The algorithm is also more scalable to high dimensional data stream than the existing algorithms. The sensitivity of clustering parameters in BOCEDS is also measured. The result shows that in case of changing the values of parameters the cluster quality deviates by a very small amount (<1%). These results prove the superiority of BOCEDS algorithm over the existing clustering algorithms. The BOCEDS algorithm is then applied to real-world weather data streams to demonstrate its capability to detect the drifts in the data stream and discover arbitrarily shaped clusters.