Data Prep

Dataset

Trying to predict what's the range of parameters that determine what's the total trip time for travelling between sectors. Here, `Trip_time` is the label and other columns are quantitative data which are used by SVM to predict the time for travel for any new record.

For predicting, we split the dataset in the ratio of 70:30 where 70% of the data is used for training the model and the rest of 30% is used for testing the accuracy of the model.

Training dataset

Out of 22 million records, only 1 Million records are sampled due to computing restrictions. This is then split in a 70:30 ratio. For training data, the label `Trip_time` is available.

For computation in Python, the labels in the training dataset are removed and given to the model separately when training. Link to the Dataset.

Testing Dataset

Here, the label `Trip_time` is removed. This is removed because we use this dataset to predict the labels and then one can compare the accuracy of the model using the predicted and known labels for the testing dataset.

Link to the Dataset.

Why ONLY labeLed data can be used for supervised methods?

Supervised machine learning methods, such as Support Vector Machines (SVMs), require labeled data because they rely on a learning process that involves mapping input features to known output labels or target values. The labeled data is used during the training phase to build a model that can make predictions on new, unseen data points.

Some reasons why only labeled data can be used for supervised machine learning methods:

Ground Truth: Labeled data provides the ground truth or correct answers for the learning algorithm. During the training process, the algorithm learns to associate input features with their corresponding output labels or target values. This allows the algorithm to learn the underlying patterns and relationships in the data, and make accurate predictions on new, unseen data points.
Model Evaluation: Labeled data is used to evaluate the performance of the trained model. The model predictions can be compared against the actual labels to measure the accuracy, precision, recall, and other performance metrics of the model. This evaluation helps in assessing the model's performance and identifying any areas of improvement.
Supervised Learning: Supervised machine learning methods are based on the concept of supervised learning, where the algorithm learns from labeled examples to make predictions on new data points. The algorithm needs to know the correct answers or labels during the training process to adjust its model parameters and optimize its performance.
Model Interpretability: Labeled data allows for better model interpretability. With labeled data, it is possible to interpret the learned patterns and relationships in the data, understand the model's decision-making process, and explain the predictions made by the model. This is particularly important in domains where explainability and interpretability of the model's predictions are crucial, such as healthcare, finance, and legal domains.

In summary, labeled data is necessary for supervised machine learning methods, including SVMs, as it provides the ground truth for training the model, evaluating the model's performance, and interpreting the model's predictions.

So, Why are SVM Labels Numeric?

Support Vector Machines (SVMs) are a type of supervised machine learning algorithm that are designed for classification or regression tasks. SVMs learn from labelled data, which means they require examples with known labels or target values during the training process.

SVMs are primarily used for binary classification tasks, where the goal is to classify data points into one of two classes based on their input features. However, SVMs can also be used for multi-class classification tasks by using techniques such as one-vs-rest or one-vs-one approaches. In both cases, the data points need to be labelled with their corresponding class or target values.

SVMs operate on the principle of finding an optimal decision boundary or hyperplane that can separate the data points of different classes with the maximum margin. The training data points are used to define this decision boundary by finding the optimal parameters of the SVM model. Once the SVM model is trained, it can be used to make predictions on new, unseen data points by applying the learned decision boundary.

Numeric data is used in SVMs because the algorithm relies on mathematical calculations and optimization techniques to find the optimal decision boundary. The input features of the data points need to be represented as numeric values so that they can be processed mathematically to determine the optimal hyperplane. Additionally, the labels or target values associated with the data points also need to be numeric so that the SVM model can calculate the prediction error and optimize the model parameters accordingly.

In summary, SVMs require labeled numeric data because they use mathematical calculations and optimization techniques to find the optimal decision boundary for classification or regression tasks. Numeric data is necessary for representing the input features and labels in a format that can be processed mathematically by the SVM algorithm

Training and testing datasets need to be disjoint?

It is important that the training set and testing set are disjoint, which means that they do not share any data points. This is because if the training set and testing set to overlap, the model may simply memorize the data points in the training set instead of learning the underlying patterns and relationships in the data. This is known as overfitting, where the model becomes too complex and fits the training data too closely, resulting in poor generalization to new, unseen data.

By ensuring that the training set and testing set are disjoint, the model is forced to learn general patterns and relationships in the data that can be applied to new data. This allows for better performance on unseen data and helps to prevent overfitting.

It is also important to note that simply randomly splitting the dataset into training and testing sets may not always be optimal, especially for small datasets. Other techniques such as cross-validation may be used to ensure that the model is evaluated on a more representative sample of the data and to improve the reliability of the performance metrics.

Page updated

Report abuse