Data Prep

Dataset

Trying to predict what's the range of parameters that determine what's the total amount. Here, `Total_Amount` is the label and other columns are quantitative data which are used to predict the label for any new record.

For predicting, we split the dataset in the ratio of 70:30 where 70% of the data is used for training the model and the rest of 30% is used for testing the accuracy of the model.

Training dataset

Out of 22 million records, only 1 Million records are sampled due to computing restrictions. This is then split in a 70:30 ratio. For training data, the label `Total_Amount` is available.

For computation in R, the labels in the training dataset are removed and given to the model separately when training. Link to the Dataset.

Testing Dataset

Here, the label `Total_Amount` is removed. This is removed because we use this dataset to predict the labels and then one can compare the accuracy of the model using the predicted and known labels for the testing dataset.

Link to the Dataset.

Training and testing datasets need to be disjoint?

It is important that the training set and testing set are disjoint, which means that they do not share any data points. This is because if the training set and testing set overlap, the model may simply memorize the data points in the training set instead of learning the underlying patterns and relationships in the data. This is known as overfitting, where the model becomes too complex and fits the training data too closely, resulting in poor generalization to new, unseen data.

By ensuring that the training set and testing set are disjoint, the model is forced to learn general patterns and relationships in the data that can be applied to new data. This allows for better performance on unseen data and helps to prevent overfitting.

It is also important to note that simply randomly splitting the dataset into training and testing sets may not always be optimal, especially for small datasets. Other techniques such as cross-validation may be used to ensure that the model is evaluated on a more representative sample of the data and to improve the reliability of the performance metrics.

Page updated

Report abuse