Study of dataset originality

The web link of the article is here.


Download the article 


In this study, a minimum ratio which keeps the originality of a dataset has been proposed.

There are some related documents, MATLAB codes, outputs as named MATLAB Codes.zip below. Anyone can download, run and examine publicly. 

The dataset including BMI (Body Mass Index) values obtained with height and weight values of 100 persons:


Table 1 and Table 2 can also be downloaded from the file below: Supplementary files.docx. What are there in this file: 


Table 1: VKI (Body Mass Index) values obtained with height and weight values of 100 persons


Table 2: 15 different data sets containing the BMI values from the original data set are randomly derived. In the first 10 derived data sets, 100 people were thrown out of a group of 1 /, and (1-1/e) of them were left original. In the last 5 data sets, the original data rate is 10%, 20%, 30%, 40% and 50%, respectively.

Bar charts and results in Excel can also be downloaded using the link below.

For any question: 

Faruk Bulut

Faculty of Engineering, Arel University, Izmir- Turkey

[e-mail: farukbulut (at) esenyurt(dot) edu (dot) tr]

Abstract

Pattern recognition, data mining and machine learning disciplines always work with a predefined dataset to create a hypothesis for an artificial decision support system. A dataset might occasionally be damaged due to various reasons. They might be subdivided for cross-validation to test an expert system performance. Some samples in the dataset might be deleted since they lose their importance. In addition, some noisy and outlier data need to be removed since it defects the general layout. In such similar cases, it is important to note how many percentages of the samples in a set should remain original in order to both avoid corruption and keep the overall originality. The ratio of missed, deleted, and removed samples in a dataset is a crucial issue for maintaining the whole integrity. In this study, a theoretical approach has been proposed about that the integrity and originality of a dataset can be preserved with a certain ratio of non-selection probability. It is approximately 63.21%, derived from the equation (1 − 1/e), which is the minimum ratio for the remaining original samples. e is the natural logarithm base. In other words, (1/e) % amount of the data at most might be removed from the set for the preservation of the originality. The rest data points in the set will be used for resampling. A variety of parametric and nonparametric criterions and tests in statistics such as Kolmogorov–Smirnov, t-tests, Kruskal–Wallis ANOVA, and Ansari–Bradley has been used in the proofing process of the proposed theory. In the experiments, a synthetic dataset has been damaged many times and compared with its original form in order to observe whether the originality and homogeneity changed or not. Experiments indicate that the ratio of (1 − 1/e) is the fundamental lower bound ratio and limit for the authenticity and actuality of a dataset