Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali

Well-produced analysis results require good quality data. However, missing data is often a major problem in several scientific research, including air quality data set. Missing values lead to the problem of low accuracy prediction and bias of the analysis results. This situation shows the importance...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Muhammad Ghazali, Shamihah
التنسيق: أطروحة
اللغة:English
منشور في: 2022
الموضوعات:
الوصول للمادة أونلاين:https://ir.uitm.edu.my/id/eprint/66929/1/66929.pdf
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
id my-uitm-ir.66929
record_format uketd_dc
spelling my-uitm-ir.669292022-09-19T07:00:23Z Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali 2022 Muhammad Ghazali, Shamihah Data processing Indoor air pollution. Including indoor air quality Well-produced analysis results require good quality data. However, missing data is often a major problem in several scientific research, including air quality data set. Missing values lead to the problem of low accuracy prediction and bias of the analysis results. This situation shows the importance of imputation methods to replace the missing values with estimated values. Based on the literature search, investigation for an appropriate imputation method on Single-Site Temporal Time-Dependent (SSTTD) multivariate structure air quality dataset particularly with long gap sequence of missing values issue was found less discussed. Several empirical orthogonal functions (EOF) based imputation methods are proposed in this study to fill the gap. The EOF, sometimes named Principal Component Analysis (PCA) method, is a promising technique applied to solve for missing values. However, the existing EOF imputation method has a drawback because it uses data matrix centralization based on statistics mean for EOF computation. To be applied for the air quality dataset, the existing approach needs to be improvised because the air quality dataset often consists of extreme observations due to climatic variations and random processes. Therefore, the implementation of statistic median and trimmed mean seems better in the matrix centralization. In this study, several proposed EOF-based methods are introduced. The capability of the methods for estimating missing values for long gap problems focusing on air quality (PM10) of the SSTTD multivariate data set in Malaysia is investigated. The performance of the existing EOF based method, the EOF mean centred approach (EOF-mean) and several proposed EOF based methods; the EOF based on median (EOF-median), EOF based on the trimmed mean (EOF-trimmean) and the newly applied Regularized Expectation Maximization Principal Component Analysis (R-EMPCA) are compared. The study was conducted using real PM10 data set from Klang and Shah Alam air quality monitoring stations. Performance assessment and evaluation of the methods was conducted by comparing the imputed values in the artificial missing data set with the true observed values in the reference (complete) data set. The artificial missing values data sets are created from an identified reference (complete) data set with respect to several patterns according to four different percentages (5, 10, 20 and 30) and long sequence (gap) size (12, 24, 168 and 720) of missing points (hours) at both study locations. Based on several performance indicators, including RMSE, MAE, Rsquare and AI, the results have shown that R-EMPCA has the most excellent performance with the highest accuracy in estimating the missing values, and the second best is EOF-trimmean. For further improvement, the estimation of the estimated values was improvised using B-spline Roughness Penalty (RP) Smoothing approach, which resulted in the proposed R-EMPCA-RP and EOF-trimmean-RP imputation methods. The application of the RP approach is proven fruitful. 2022 Thesis https://ir.uitm.edu.my/id/eprint/66929/ https://ir.uitm.edu.my/id/eprint/66929/1/66929.pdf text en public masters Universiti Teknologi MARA (UiTM) Faculty of Computer and Mathematical Sciences Shaadan, Norshaida Idrus, Zainura
institution Universiti Teknologi MARA
collection UiTM Institutional Repository
language English
advisor Shaadan, Norshaida
Idrus, Zainura
topic Data processing
Data processing
spellingShingle Data processing
Data processing
Muhammad Ghazali, Shamihah
Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali
description Well-produced analysis results require good quality data. However, missing data is often a major problem in several scientific research, including air quality data set. Missing values lead to the problem of low accuracy prediction and bias of the analysis results. This situation shows the importance of imputation methods to replace the missing values with estimated values. Based on the literature search, investigation for an appropriate imputation method on Single-Site Temporal Time-Dependent (SSTTD) multivariate structure air quality dataset particularly with long gap sequence of missing values issue was found less discussed. Several empirical orthogonal functions (EOF) based imputation methods are proposed in this study to fill the gap. The EOF, sometimes named Principal Component Analysis (PCA) method, is a promising technique applied to solve for missing values. However, the existing EOF imputation method has a drawback because it uses data matrix centralization based on statistics mean for EOF computation. To be applied for the air quality dataset, the existing approach needs to be improvised because the air quality dataset often consists of extreme observations due to climatic variations and random processes. Therefore, the implementation of statistic median and trimmed mean seems better in the matrix centralization. In this study, several proposed EOF-based methods are introduced. The capability of the methods for estimating missing values for long gap problems focusing on air quality (PM10) of the SSTTD multivariate data set in Malaysia is investigated. The performance of the existing EOF based method, the EOF mean centred approach (EOF-mean) and several proposed EOF based methods; the EOF based on median (EOF-median), EOF based on the trimmed mean (EOF-trimmean) and the newly applied Regularized Expectation Maximization Principal Component Analysis (R-EMPCA) are compared. The study was conducted using real PM10 data set from Klang and Shah Alam air quality monitoring stations. Performance assessment and evaluation of the methods was conducted by comparing the imputed values in the artificial missing data set with the true observed values in the reference (complete) data set. The artificial missing values data sets are created from an identified reference (complete) data set with respect to several patterns according to four different percentages (5, 10, 20 and 30) and long sequence (gap) size (12, 24, 168 and 720) of missing points (hours) at both study locations. Based on several performance indicators, including RMSE, MAE, Rsquare and AI, the results have shown that R-EMPCA has the most excellent performance with the highest accuracy in estimating the missing values, and the second best is EOF-trimmean. For further improvement, the estimation of the estimated values was improvised using B-spline Roughness Penalty (RP) Smoothing approach, which resulted in the proposed R-EMPCA-RP and EOF-trimmean-RP imputation methods. The application of the RP approach is proven fruitful.
format Thesis
qualification_level Master's degree
author Muhammad Ghazali, Shamihah
author_facet Muhammad Ghazali, Shamihah
author_sort Muhammad Ghazali, Shamihah
title Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali
title_short Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali
title_full Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali
title_fullStr Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali
title_full_unstemmed Long gap imputation in air quality (PM10) data set using improvised EOF-based method with roughness penalty approach / Shamihah Muhammad Ghazali
title_sort long gap imputation in air quality (pm10) data set using improvised eof-based method with roughness penalty approach / shamihah muhammad ghazali
granting_institution Universiti Teknologi MARA (UiTM)
granting_department Faculty of Computer and Mathematical Sciences
publishDate 2022
url https://ir.uitm.edu.my/id/eprint/66929/1/66929.pdf
_version_ 1783735648105005056