Modified frequency tables for visualisation and analysis of univariate and bivariate data

One way to make sense of data is to organize it into a more meaningful format called frequency table. The existing continuous univariate frequency table uses the midpoint to represent the magnitude of observations in each class, which results in an error called grouping error. The use of the midp...

Full description

Saved in:
Main Author: Thesis English 2021 http://psasir.upm.edu.my/id/eprint/92818/1/FS%202021%2041%20-%20IR.pdf No Tags, Be the first to tag this record!
Summary: One way to make sense of data is to organize it into a more meaningful format called frequency table. The existing continuous univariate frequency table uses the midpoint to represent the magnitude of observations in each class, which results in an error called grouping error. The use of the midpoint is due to the assumption that each class’s observations are uniformly distributed and concentrated around their midpoint, which is not always valid. The most significant parameter used when constructing the continuous frequency table is the number of classes or class width. Several rules for choosing the number of classes or class width have been reported in the literature; however, none has been proven to be better in all situations. The existing discrete frequency tables are simple to construct, easy to understand and interpret. However, when the number of elements in data is substantial, the table can be complicated. The existing non-parametric correlation measure, the Kendall correlation method, becomes laborious when the number of paired continuous observations is large enough. Generally, continuous data are measured values such as amount rainfall, length In this research, to address the issue of grouping error, we proposed three statistics, median, midrange, and random selection to be used as the magnitude of observations in each class instead of the midpoint. In choosing the number of classes or class width, a new class width rule is proposed. We also proposed new discrete frequency tables that can be constructed by grouping the elements in data into classes. Using the bivariate continuous frequency table, a new correlation measure that is straightforward and free of normality assumption is developed. On addressing the issue of missing data in a univariate continuous frequency table, five different imputation methods are compared. The four methods and the binning rules are simultaneously compared using root mean-squared-error (RMSE). Whereas the comparison using real data, the absolute error is used. The proposed discrete frequency tables are described using simulated and real data. While the new bivariate continuous table’s correlation measure is illustrated using simulations and real data. Generally, continuous data are measured values such as amount rainfall, length The comparison using the continuous frequency table’s measure of location, mean, showed that the methods that used the median and midrange of observations in each class performed better relative to other methods. In choosing the number of classes, the proposed class width rule is the best for data simulated from the normal and exponential distributions. Meanwhile, for data simulated from the uniform distribution, the square root rule performed better than the other rules. The methods’ evaluation using the frequency table’s measures of skewness and kurtosis indicated that still, the methods that used the median and midrange to represent the magnitude of observations in each class were the best. The new discrete frequency tables can be a better choice, since, they can handle datasets with a substantial number of elements, and vividly reveals the significant features of datasets. Generally, continuous data are measured values such as amount rainfall, length The results also showed that the new measure of correlation approximately equals to the Kendall correlation. Indeed, it can be used when the data is discrete, and the best alternative when the number of paired observations is large. In handling missing data, the simulation results showed that the mean imputation method is the best while the findings using real data indicated the mean imputation, k nearest neighbor imputation, and the multiple imputations by chained equations were the best methods. Also, the five imputation methods’ performance is independent of the dataset and the percentage of missingness. And that the error increases as the percentage of missing observations increases.