Modified frequency tables for visualisation and analysis of univariate and bivariate data

One way to make sense of data is to organize it into a more meaningful format called frequency table. The existing continuous univariate frequency table uses the midpoint to represent the magnitude of observations in each class, which results in an error called grouping error. The use of the midp...

Full description

Saved in:

Bibliographic Details
Main Author:	Bappah, Mohammed Mohammed
Format:	Thesis
Language:	English
Published:	2021
Subjects:	Social sciences - Statistical methods Statistics Multivariate analysis
Online Access:	http://psasir.upm.edu.my/id/eprint/92818/1/FS%202021%2041%20-%20IR.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my-upm-ir.92818
record_format	uketd_dc
institution	Universiti Putra Malaysia
collection	PSAS Institutional Repository
language	English
advisor	Adam, Mohd Bakri
topic	Social sciences - Statistical methods Statistics Multivariate analysis
spellingShingle	Social sciences - Statistical methods Statistics Multivariate analysis Bappah, Mohammed Mohammed Modified frequency tables for visualisation and analysis of univariate and bivariate data
description	One way to make sense of data is to organize it into a more meaningful format called frequency table. The existing continuous univariate frequency table uses the midpoint to represent the magnitude of observations in each class, which results in an error called grouping error. The use of the midpoint is due to the assumption that each class’s observations are uniformly distributed and concentrated around their midpoint, which is not always valid. The most significant parameter used when constructing the continuous frequency table is the number of classes or class width. Several rules for choosing the number of classes or class width have been reported in the literature; however, none has been proven to be better in all situations. The existing discrete frequency tables are simple to construct, easy to understand and interpret. However, when the number of elements in data is substantial, the table can be complicated. The existing non-parametric correlation measure, the Kendall correlation method, becomes laborious when the number of paired continuous observations is large enough. Generally, continuous data are measured values such as amount rainfall, length In this research, to address the issue of grouping error, we proposed three statistics, median, midrange, and random selection to be used as the magnitude of observations in each class instead of the midpoint. In choosing the number of classes or class width, a new class width rule is proposed. We also proposed new discrete frequency tables that can be constructed by grouping the elements in data into classes. Using the bivariate continuous frequency table, a new correlation measure that is straightforward and free of normality assumption is developed. On addressing the issue of missing data in a univariate continuous frequency table, five different imputation methods are compared. The four methods and the binning rules are simultaneously compared using root mean-squared-error (RMSE). Whereas the comparison using real data, the absolute error is used. The proposed discrete frequency tables are described using simulated and real data. While the new bivariate continuous table’s correlation measure is illustrated using simulations and real data. Generally, continuous data are measured values such as amount rainfall, length The comparison using the continuous frequency table’s measure of location, mean, showed that the methods that used the median and midrange of observations in each class performed better relative to other methods. In choosing the number of classes, the proposed class width rule is the best for data simulated from the normal and exponential distributions. Meanwhile, for data simulated from the uniform distribution, the square root rule performed better than the other rules. The methods’ evaluation using the frequency table’s measures of skewness and kurtosis indicated that still, the methods that used the median and midrange to represent the magnitude of observations in each class were the best. The new discrete frequency tables can be a better choice, since, they can handle datasets with a substantial number of elements, and vividly reveals the significant features of datasets. Generally, continuous data are measured values such as amount rainfall, length The results also showed that the new measure of correlation approximately equals to the Kendall correlation. Indeed, it can be used when the data is discrete, and the best alternative when the number of paired observations is large. In handling missing data, the simulation results showed that the mean imputation method is the best while the findings using real data indicated the mean imputation, k nearest neighbor imputation, and the multiple imputations by chained equations were the best methods. Also, the five imputation methods’ performance is independent of the dataset and the percentage of missingness. And that the error increases as the percentage of missing observations increases.
format	Thesis
qualification_level	Doctorate
author	Bappah, Mohammed Mohammed
author_facet	Bappah, Mohammed Mohammed
author_sort	Bappah, Mohammed Mohammed
title	Modified frequency tables for visualisation and analysis of univariate and bivariate data
title_short	Modified frequency tables for visualisation and analysis of univariate and bivariate data
title_full	Modified frequency tables for visualisation and analysis of univariate and bivariate data
title_fullStr	Modified frequency tables for visualisation and analysis of univariate and bivariate data
title_full_unstemmed	Modified frequency tables for visualisation and analysis of univariate and bivariate data
title_sort	modified frequency tables for visualisation and analysis of univariate and bivariate data
granting_institution	Universiti Putra Malaysia
publishDate	2021
url	http://psasir.upm.edu.my/id/eprint/92818/1/FS%202021%2041%20-%20IR.pdf
_version_	1747813774431092736
spelling	my-upm-ir.928182022-06-01T07:57:15Z Modified frequency tables for visualisation and analysis of univariate and bivariate data 2021-02 Bappah, Mohammed Mohammed One way to make sense of data is to organize it into a more meaningful format called frequency table. The existing continuous univariate frequency table uses the midpoint to represent the magnitude of observations in each class, which results in an error called grouping error. The use of the midpoint is due to the assumption that each class’s observations are uniformly distributed and concentrated around their midpoint, which is not always valid. The most significant parameter used when constructing the continuous frequency table is the number of classes or class width. Several rules for choosing the number of classes or class width have been reported in the literature; however, none has been proven to be better in all situations. The existing discrete frequency tables are simple to construct, easy to understand and interpret. However, when the number of elements in data is substantial, the table can be complicated. The existing non-parametric correlation measure, the Kendall correlation method, becomes laborious when the number of paired continuous observations is large enough. Generally, continuous data are measured values such as amount rainfall, length In this research, to address the issue of grouping error, we proposed three statistics, median, midrange, and random selection to be used as the magnitude of observations in each class instead of the midpoint. In choosing the number of classes or class width, a new class width rule is proposed. We also proposed new discrete frequency tables that can be constructed by grouping the elements in data into classes. Using the bivariate continuous frequency table, a new correlation measure that is straightforward and free of normality assumption is developed. On addressing the issue of missing data in a univariate continuous frequency table, five different imputation methods are compared. The four methods and the binning rules are simultaneously compared using root mean-squared-error (RMSE). Whereas the comparison using real data, the absolute error is used. The proposed discrete frequency tables are described using simulated and real data. While the new bivariate continuous table’s correlation measure is illustrated using simulations and real data. Generally, continuous data are measured values such as amount rainfall, length The comparison using the continuous frequency table’s measure of location, mean, showed that the methods that used the median and midrange of observations in each class performed better relative to other methods. In choosing the number of classes, the proposed class width rule is the best for data simulated from the normal and exponential distributions. Meanwhile, for data simulated from the uniform distribution, the square root rule performed better than the other rules. The methods’ evaluation using the frequency table’s measures of skewness and kurtosis indicated that still, the methods that used the median and midrange to represent the magnitude of observations in each class were the best. The new discrete frequency tables can be a better choice, since, they can handle datasets with a substantial number of elements, and vividly reveals the significant features of datasets. Generally, continuous data are measured values such as amount rainfall, length The results also showed that the new measure of correlation approximately equals to the Kendall correlation. Indeed, it can be used when the data is discrete, and the best alternative when the number of paired observations is large. In handling missing data, the simulation results showed that the mean imputation method is the best while the findings using real data indicated the mean imputation, k nearest neighbor imputation, and the multiple imputations by chained equations were the best methods. Also, the five imputation methods’ performance is independent of the dataset and the percentage of missingness. And that the error increases as the percentage of missing observations increases. Social sciences - Statistical methods Statistics Multivariate analysis 2021-02 Thesis http://psasir.upm.edu.my/id/eprint/92818/ http://psasir.upm.edu.my/id/eprint/92818/1/FS%202021%2041%20-%20IR.pdf text en public doctoral Universiti Putra Malaysia Social sciences - Statistical methods Statistics Multivariate analysis Adam, Mohd Bakri

Modified frequency tables for visualisation and analysis of univariate and bivariate data

Similar Items