Modified frequency tables for visualisation and analysis of univariate and bivariate data
One way to make sense of data is to organize it into a more meaningful format called frequency table. The existing continuous univariate frequency table uses the midpoint to represent the magnitude of observations in each class, which results in an error called grouping error. The use of the midp...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2021
|
Subjects: | |
Online Access: | http://psasir.upm.edu.my/id/eprint/92818/1/FS%202021%2041%20-%20IR.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my-upm-ir.92818 |
---|---|
record_format |
uketd_dc |
institution |
Universiti Putra Malaysia |
collection |
PSAS Institutional Repository |
language |
English |
advisor |
Adam, Mohd Bakri |
topic |
Social sciences - Statistical methods Statistics Multivariate analysis |
spellingShingle |
Social sciences - Statistical methods Statistics Multivariate analysis Bappah, Mohammed Mohammed Modified frequency tables for visualisation and analysis of univariate and bivariate data |
description |
One way to make sense of data is to organize it into a more meaningful format called
frequency table. The existing continuous univariate frequency table uses the midpoint
to represent the magnitude of observations in each class, which results in an
error called grouping error. The use of the midpoint is due to the assumption that
each class’s observations are uniformly distributed and concentrated around their
midpoint, which is not always valid. The most significant parameter used when
constructing the continuous frequency table is the number of classes or class width.
Several rules for choosing the number of classes or class width have been reported
in the literature; however, none has been proven to be better in all situations. The
existing discrete frequency tables are simple to construct, easy to understand and
interpret. However, when the number of elements in data is substantial, the table
can be complicated. The existing non-parametric correlation measure, the Kendall
correlation method, becomes laborious when the number of paired continuous observations
is large enough. Generally, continuous data are measured values such as
amount rainfall, length
In this research, to address the issue of grouping error, we proposed three statistics,
median, midrange, and random selection to be used as the magnitude of observations
in each class instead of the midpoint. In choosing the number of classes or
class width, a new class width rule is proposed. We also proposed new discrete frequency
tables that can be constructed by grouping the elements in data into classes.
Using the bivariate continuous frequency table, a new correlation measure that is
straightforward and free of normality assumption is developed. On addressing the
issue of missing data in a univariate continuous frequency table, five different imputation
methods are compared.
The four methods and the binning rules are simultaneously compared using root
mean-squared-error (RMSE). Whereas the comparison using real data, the absolute
error is used. The proposed discrete frequency tables are described using simulated
and real data. While the new bivariate continuous table’s correlation measure is
illustrated using simulations and real data. Generally, continuous data are measured
values such as amount rainfall, length
The comparison using the continuous frequency table’s measure of location, mean,
showed that the methods that used the median and midrange of observations in each
class performed better relative to other methods. In choosing the number of classes,
the proposed class width rule is the best for data simulated from the normal and exponential
distributions. Meanwhile, for data simulated from the uniform distribution,
the square root rule performed better than the other rules. The methods’ evaluation
using the frequency table’s measures of skewness and kurtosis indicated that still,
the methods that used the median and midrange to represent the magnitude of observations
in each class were the best. The new discrete frequency tables can be a
better choice, since, they can handle datasets with a substantial number of elements,
and vividly reveals the significant features of datasets. Generally, continuous data
are measured values such as amount rainfall, length
The results also showed that the new measure of correlation approximately equals
to the Kendall correlation. Indeed, it can be used when the data is discrete, and the
best alternative when the number of paired observations is large. In handling missing
data, the simulation results showed that the mean imputation method is the best
while the findings using real data indicated the mean imputation, k nearest neighbor
imputation, and the multiple imputations by chained equations were the best methods.
Also, the five imputation methods’ performance is independent of the dataset
and the percentage of missingness. And that the error increases as the percentage of
missing observations increases. |
format |
Thesis |
qualification_level |
Doctorate |
author |
Bappah, Mohammed Mohammed |
author_facet |
Bappah, Mohammed Mohammed |
author_sort |
Bappah, Mohammed Mohammed |
title |
Modified frequency tables for visualisation and analysis of univariate and bivariate data |
title_short |
Modified frequency tables for visualisation and analysis of univariate and bivariate data |
title_full |
Modified frequency tables for visualisation and analysis of univariate and bivariate data |
title_fullStr |
Modified frequency tables for visualisation and analysis of univariate and bivariate data |
title_full_unstemmed |
Modified frequency tables for visualisation and analysis of univariate and bivariate data |
title_sort |
modified frequency tables for visualisation and analysis of univariate and bivariate data |
granting_institution |
Universiti Putra Malaysia |
publishDate |
2021 |
url |
http://psasir.upm.edu.my/id/eprint/92818/1/FS%202021%2041%20-%20IR.pdf |
_version_ |
1747813774431092736 |
spelling |
my-upm-ir.928182022-06-01T07:57:15Z Modified frequency tables for visualisation and analysis of univariate and bivariate data 2021-02 Bappah, Mohammed Mohammed One way to make sense of data is to organize it into a more meaningful format called frequency table. The existing continuous univariate frequency table uses the midpoint to represent the magnitude of observations in each class, which results in an error called grouping error. The use of the midpoint is due to the assumption that each class’s observations are uniformly distributed and concentrated around their midpoint, which is not always valid. The most significant parameter used when constructing the continuous frequency table is the number of classes or class width. Several rules for choosing the number of classes or class width have been reported in the literature; however, none has been proven to be better in all situations. The existing discrete frequency tables are simple to construct, easy to understand and interpret. However, when the number of elements in data is substantial, the table can be complicated. The existing non-parametric correlation measure, the Kendall correlation method, becomes laborious when the number of paired continuous observations is large enough. Generally, continuous data are measured values such as amount rainfall, length In this research, to address the issue of grouping error, we proposed three statistics, median, midrange, and random selection to be used as the magnitude of observations in each class instead of the midpoint. In choosing the number of classes or class width, a new class width rule is proposed. We also proposed new discrete frequency tables that can be constructed by grouping the elements in data into classes. Using the bivariate continuous frequency table, a new correlation measure that is straightforward and free of normality assumption is developed. On addressing the issue of missing data in a univariate continuous frequency table, five different imputation methods are compared. The four methods and the binning rules are simultaneously compared using root mean-squared-error (RMSE). Whereas the comparison using real data, the absolute error is used. The proposed discrete frequency tables are described using simulated and real data. While the new bivariate continuous table’s correlation measure is illustrated using simulations and real data. Generally, continuous data are measured values such as amount rainfall, length The comparison using the continuous frequency table’s measure of location, mean, showed that the methods that used the median and midrange of observations in each class performed better relative to other methods. In choosing the number of classes, the proposed class width rule is the best for data simulated from the normal and exponential distributions. Meanwhile, for data simulated from the uniform distribution, the square root rule performed better than the other rules. The methods’ evaluation using the frequency table’s measures of skewness and kurtosis indicated that still, the methods that used the median and midrange to represent the magnitude of observations in each class were the best. The new discrete frequency tables can be a better choice, since, they can handle datasets with a substantial number of elements, and vividly reveals the significant features of datasets. Generally, continuous data are measured values such as amount rainfall, length The results also showed that the new measure of correlation approximately equals to the Kendall correlation. Indeed, it can be used when the data is discrete, and the best alternative when the number of paired observations is large. In handling missing data, the simulation results showed that the mean imputation method is the best while the findings using real data indicated the mean imputation, k nearest neighbor imputation, and the multiple imputations by chained equations were the best methods. Also, the five imputation methods’ performance is independent of the dataset and the percentage of missingness. And that the error increases as the percentage of missing observations increases. Social sciences - Statistical methods Statistics Multivariate analysis 2021-02 Thesis http://psasir.upm.edu.my/id/eprint/92818/ http://psasir.upm.edu.my/id/eprint/92818/1/FS%202021%2041%20-%20IR.pdf text en public doctoral Universiti Putra Malaysia Social sciences - Statistical methods Statistics Multivariate analysis Adam, Mohd Bakri |