Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification

Peer-to-Peer (P2P) detection by Machine Learning (ML) classification is affected by the quality and recency of training dataset. Hence, to classify P2P traffic on-line requires the removal of these limitations. In this research work, a novel practical training dataset generation and automatic retrai...

Full description

Saved in:
Bibliographic Details
Main Author: Zarei, Roozbeh
Format: Thesis
Language:English
Published: 2012
Subjects:
Online Access:http://eprints.utm.my/id/eprint/33398/5/RoozehZareiMFKE2012.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-utm-ep.33398
record_format uketd_dc
spelling my-utm-ep.333982018-05-27T08:07:40Z Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification 2012-01 Zarei, Roozbeh TK Electrical engineering. Electronics Nuclear engineering Peer-to-Peer (P2P) detection by Machine Learning (ML) classification is affected by the quality and recency of training dataset. Hence, to classify P2P traffic on-line requires the removal of these limitations. In this research work, a novel practical training dataset generation and automatic retraining mechanism for on-line P2P traffic classification are proposed. These two proposals are integrated in a system that removes the limitations of ML classification and makes them suitable for on-line P2P traffic classification. For the first part, a novel two-stage training dataset generation is proposed by combining a 3-class heuristic and a 3-class statistical classification to accurately generate training dataset. In the heuristic stage, traffic is classified as P2P, nonP2P or unknown. In statistical stage, a dual-Decision Tree (DT) is built based on dataset generated in heuristic stage to classify unknown traffic into three classes in order to reduce the amount of classified unknown traffics. The final training dataset is generated based on all flows which are classified in these two stages. In the second part of the system, an automatic retraining mechanism is proposed to satisfy the needs of retraining ML classifier by detecting the changes of traffic behavior and updating the on-line ML classifier with recent accurate training dataset. This mechanism evaluates the accuracy of the on-line ML classifier based on flows labeled by the two-stage training dataset generation. The on-line ML classifier is retrained if its accuracy falls below a predefined threshold. The proposed system has been evaluated on traces captured from the Universiti Teknologi Malaysia (UTM) campus network between October and November 2011. The overall results shows that the two-stage training dataset generation can generate accurate training dataset by classifying more than 95% of total flows with high accuracy (98:59%) and low false positive (0:91%). The on-line ML classifier which is built based on (J48) algorithm and training dataset generated by the two-stage training dataset generation classifies traffic with high accuracy (99%) by using the 25 feature extracted from first 5 packets of each flow. The results also show that using automatic retraining mechanism allow the on-line ML classifier able to maintain its accuracy above a set threshold over time. 2012-01 Thesis http://eprints.utm.my/id/eprint/33398/ http://eprints.utm.my/id/eprint/33398/5/RoozehZareiMFKE2012.pdf application/pdf en public http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:72709?site_name=Restricted Repository masters Universiti Teknologi Malaysia, Faculty of Electrical Engineering Faculty of Electrical Engineering
institution Universiti Teknologi Malaysia
collection UTM Institutional Repository
language English
topic TK Electrical engineering
Electronics Nuclear engineering
spellingShingle TK Electrical engineering
Electronics Nuclear engineering
Zarei, Roozbeh
Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification
description Peer-to-Peer (P2P) detection by Machine Learning (ML) classification is affected by the quality and recency of training dataset. Hence, to classify P2P traffic on-line requires the removal of these limitations. In this research work, a novel practical training dataset generation and automatic retraining mechanism for on-line P2P traffic classification are proposed. These two proposals are integrated in a system that removes the limitations of ML classification and makes them suitable for on-line P2P traffic classification. For the first part, a novel two-stage training dataset generation is proposed by combining a 3-class heuristic and a 3-class statistical classification to accurately generate training dataset. In the heuristic stage, traffic is classified as P2P, nonP2P or unknown. In statistical stage, a dual-Decision Tree (DT) is built based on dataset generated in heuristic stage to classify unknown traffic into three classes in order to reduce the amount of classified unknown traffics. The final training dataset is generated based on all flows which are classified in these two stages. In the second part of the system, an automatic retraining mechanism is proposed to satisfy the needs of retraining ML classifier by detecting the changes of traffic behavior and updating the on-line ML classifier with recent accurate training dataset. This mechanism evaluates the accuracy of the on-line ML classifier based on flows labeled by the two-stage training dataset generation. The on-line ML classifier is retrained if its accuracy falls below a predefined threshold. The proposed system has been evaluated on traces captured from the Universiti Teknologi Malaysia (UTM) campus network between October and November 2011. The overall results shows that the two-stage training dataset generation can generate accurate training dataset by classifying more than 95% of total flows with high accuracy (98:59%) and low false positive (0:91%). The on-line ML classifier which is built based on (J48) algorithm and training dataset generated by the two-stage training dataset generation classifies traffic with high accuracy (99%) by using the 25 feature extracted from first 5 packets of each flow. The results also show that using automatic retraining mechanism allow the on-line ML classifier able to maintain its accuracy above a set threshold over time.
format Thesis
qualification_level Master's degree
author Zarei, Roozbeh
author_facet Zarei, Roozbeh
author_sort Zarei, Roozbeh
title Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification
title_short Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification
title_full Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification
title_fullStr Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification
title_full_unstemmed Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification
title_sort practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification
granting_institution Universiti Teknologi Malaysia, Faculty of Electrical Engineering
granting_department Faculty of Electrical Engineering
publishDate 2012
url http://eprints.utm.my/id/eprint/33398/5/RoozehZareiMFKE2012.pdf
_version_ 1747816151568613376