Join query enhancement processing (jqpro) with big rdf data on a distributed system using hashing-merge join technique

Semantic web technologies have emerged in the last few years across different fields of study and their data are still growing rapidly. Specifically, the increased data storage and publishing capabilities in standard open web formats have made the technology much more successful. So, the data have b...

Full description

Saved in:
Bibliographic Details
Main Author: Nahla Mohammedelzein, Elawad Babiker
Format: Thesis
Language:English
Published: 2021
Subjects:
Online Access:http://umpir.ump.edu.my/id/eprint/38471/1/Join%20query%20enhancement%20processing%20%28jqpro%29%20with%20big%20rdf%20data%20on%20a%20distributed%20system%20using%20hashing-merge%20join%20technique.ir.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-ump-ir.38471
record_format uketd_dc
institution Universiti Malaysia Pahang Al-Sultan Abdullah
collection UMPSA Institutional Repository
language English
advisor Mazlina, Abdul Majid
topic Q Science (General)
Q Science (General)
QA76 Computer software
spellingShingle Q Science (General)
Q Science (General)
QA76 Computer software
Nahla Mohammedelzein, Elawad Babiker
Join query enhancement processing (jqpro) with big rdf data on a distributed system using hashing-merge join technique
description Semantic web technologies have emerged in the last few years across different fields of study and their data are still growing rapidly. Specifically, the increased data storage and publishing capabilities in standard open web formats have made the technology much more successful. So, the data have become readable by humans, and they can be processed on a computer. The demand for complex multiple RDF queries is becoming significant with the increasing number of RDF triples. Such complex queries occasionally produce many common subexpressions. It is therefore extremely challenging to reduce the amount of RDF queries and transmission time for a vast number of related RDF data. Moreover, Recent literature shows that join query processing of Big RDF data has introduced many problems with respect to execution time and throughput. The hash-based encoding induces low execution time, which takes a long time to load and hence does not load all graphs. This is because the Resource Description Framework (RDF) collects and analyses large data in swarms, thereby having to deal with the inherent challenge of efficient swarm storage. The effective storage and data retrieval, which could be applied to high amounts of possible schema-less data, has also proven exceedingly difficult for RDF data storage. For instance, it is particularly difficult to view semantic and SPARQL query languages, as well as huge and complex graph patterns. To address this problem, a Join Query Processing Model (JQPro) is introduced for Big RDF data. The objectives of this research are: (i) formulate plan generator algorithms for join query processing on the basis of the previous research. (ii) develop an enhancement model of Join Query Processing (JQPro) based on SPARQL and Hadoop MapReduce using hashing-merge join technique to process Big RDF Data. (iii) evaluate and compare the performance based on the execution time, throughput, and CPU utilization of the JQPro model with existing models. On the other hand, the throughput was employed to measure the units of information that a system can process in each time frame. In addition, the CPU utilization was used in the big join query processing as an important resource element particularly during the map, to reduce phases. Furthermore, the hash-join and Sort-Merge algorithms were used to generate the join query processing, and this was employed due to their capacity to allow for more data sets to be joined. Both processes were sorted by algorithms on join attributes and the sorted relations was merged. Therefore, the join column sorted the groups of datasets with the same value. The sort–merge–join algorithm sorts the datasets on the joining attribute and then searches for tuples by merging the two datasets. Then, a processing framework for RDF queries was introduced and the benchmark was used for performance evaluation. Finally, the validation was conducted by standard statistical analysis to validate and compare the performance of the JQPro model with current models. In addition, the synthetic benchmarks Lehigh University Benchmark (LUBM) and Waterloo SPARQL Diversity Test Suite (WatDiv) v06 were used for measurement. The experiment was carried out on three datasets ranging from 10 million to 1 billion RDF triples produced by the generator of WatDiv data with a scale factor of 10, 100 and 1000, respectively. A selective dataset for each experimental query was also used for the processing of RDFs with a LUBM benchmark in sizes 500, 1000 and 2000 million triples. The result revealed that there is a strong correlation between execution time and throughput with a strength of 99.9% percent as confirmed by the Pearson correlation coefficient. Furthermore, the findings show that the JQPro solution was comparable to gStore RDF-3X, RDFox and PARJ and the percentage of improved performance was 87.77% in terms of execution time. The CPU utilization was significantly increased by extensive mapping and reduced code computing. It is therefore inferred that the JQPro solution is timely and innovative, as it provides an efficient execution time and CPU utilization where users could perform better queries for Big RDF data processing in a seamless manner
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Nahla Mohammedelzein, Elawad Babiker
author_facet Nahla Mohammedelzein, Elawad Babiker
author_sort Nahla Mohammedelzein, Elawad Babiker
title Join query enhancement processing (jqpro) with big rdf data on a distributed system using hashing-merge join technique
title_short Join query enhancement processing (jqpro) with big rdf data on a distributed system using hashing-merge join technique
title_full Join query enhancement processing (jqpro) with big rdf data on a distributed system using hashing-merge join technique
title_fullStr Join query enhancement processing (jqpro) with big rdf data on a distributed system using hashing-merge join technique
title_full_unstemmed Join query enhancement processing (jqpro) with big rdf data on a distributed system using hashing-merge join technique
title_sort join query enhancement processing (jqpro) with big rdf data on a distributed system using hashing-merge join technique
granting_institution Universiti Malaysia Pahang
granting_department Faculty of Computing
publishDate 2021
url http://umpir.ump.edu.my/id/eprint/38471/1/Join%20query%20enhancement%20processing%20%28jqpro%29%20with%20big%20rdf%20data%20on%20a%20distributed%20system%20using%20hashing-merge%20join%20technique.ir.pdf
_version_ 1783732291992813568
spelling my-ump-ir.384712023-08-25T02:15:45Z Join query enhancement processing (jqpro) with big rdf data on a distributed system using hashing-merge join technique 2021-08 Nahla Mohammedelzein, Elawad Babiker Q Science (General) QA75 Electronic computers. Computer science QA76 Computer software Semantic web technologies have emerged in the last few years across different fields of study and their data are still growing rapidly. Specifically, the increased data storage and publishing capabilities in standard open web formats have made the technology much more successful. So, the data have become readable by humans, and they can be processed on a computer. The demand for complex multiple RDF queries is becoming significant with the increasing number of RDF triples. Such complex queries occasionally produce many common subexpressions. It is therefore extremely challenging to reduce the amount of RDF queries and transmission time for a vast number of related RDF data. Moreover, Recent literature shows that join query processing of Big RDF data has introduced many problems with respect to execution time and throughput. The hash-based encoding induces low execution time, which takes a long time to load and hence does not load all graphs. This is because the Resource Description Framework (RDF) collects and analyses large data in swarms, thereby having to deal with the inherent challenge of efficient swarm storage. The effective storage and data retrieval, which could be applied to high amounts of possible schema-less data, has also proven exceedingly difficult for RDF data storage. For instance, it is particularly difficult to view semantic and SPARQL query languages, as well as huge and complex graph patterns. To address this problem, a Join Query Processing Model (JQPro) is introduced for Big RDF data. The objectives of this research are: (i) formulate plan generator algorithms for join query processing on the basis of the previous research. (ii) develop an enhancement model of Join Query Processing (JQPro) based on SPARQL and Hadoop MapReduce using hashing-merge join technique to process Big RDF Data. (iii) evaluate and compare the performance based on the execution time, throughput, and CPU utilization of the JQPro model with existing models. On the other hand, the throughput was employed to measure the units of information that a system can process in each time frame. In addition, the CPU utilization was used in the big join query processing as an important resource element particularly during the map, to reduce phases. Furthermore, the hash-join and Sort-Merge algorithms were used to generate the join query processing, and this was employed due to their capacity to allow for more data sets to be joined. Both processes were sorted by algorithms on join attributes and the sorted relations was merged. Therefore, the join column sorted the groups of datasets with the same value. The sort–merge–join algorithm sorts the datasets on the joining attribute and then searches for tuples by merging the two datasets. Then, a processing framework for RDF queries was introduced and the benchmark was used for performance evaluation. Finally, the validation was conducted by standard statistical analysis to validate and compare the performance of the JQPro model with current models. In addition, the synthetic benchmarks Lehigh University Benchmark (LUBM) and Waterloo SPARQL Diversity Test Suite (WatDiv) v06 were used for measurement. The experiment was carried out on three datasets ranging from 10 million to 1 billion RDF triples produced by the generator of WatDiv data with a scale factor of 10, 100 and 1000, respectively. A selective dataset for each experimental query was also used for the processing of RDFs with a LUBM benchmark in sizes 500, 1000 and 2000 million triples. The result revealed that there is a strong correlation between execution time and throughput with a strength of 99.9% percent as confirmed by the Pearson correlation coefficient. Furthermore, the findings show that the JQPro solution was comparable to gStore RDF-3X, RDFox and PARJ and the percentage of improved performance was 87.77% in terms of execution time. The CPU utilization was significantly increased by extensive mapping and reduced code computing. It is therefore inferred that the JQPro solution is timely and innovative, as it provides an efficient execution time and CPU utilization where users could perform better queries for Big RDF data processing in a seamless manner 2021-08 Thesis http://umpir.ump.edu.my/id/eprint/38471/ http://umpir.ump.edu.my/id/eprint/38471/1/Join%20query%20enhancement%20processing%20%28jqpro%29%20with%20big%20rdf%20data%20on%20a%20distributed%20system%20using%20hashing-merge%20join%20technique.ir.pdf pdf en public phd doctoral Universiti Malaysia Pahang Faculty of Computing Mazlina, Abdul Majid