The pre-processing dataset, which was used for SVM, was derived by transforming the target variable of anxiety into three higher-level categories as low, moderate, and high severity. It was performed initially by introducing a new target column on which values from 1–3 were assigned as 0 (low severity), values from 4–7 as 1 (moderate severity), and values from 8–10 as 2 (high severity). Categorical features such as medication, smoking, and gender were label encoded into numerical values. Irrelevant columns such as ID were dropped. The features were standardized using a StandardScaler so that input variables were on the same scale, which is necessary for the functioning of SVM. The resulting dataset was entirely numeric and ready for modeling and was split into disjoint training and test sets to fairly test the SVM model.
How the Train-Test Split Was Created
To evaluate the performance of the Logistic regression the dataset was divided into two parts: a training set and a testing set. This was done using the train_test_split() function from the scikit-learn library, with 80% of the data allocated for training and the remaining 20% for testing. The training set is used to teach the model patterns within the data, while the testing set is used to measure how well the model performs on new, unseen data. The split was applied randomly to ensure that both sets represent the overall data distribution, especially for the target variable.
Is the Train Test split different or same?
The train-test split used for most models (Naive Bayes, Decision Tree, Logistic Regression) was 70-30. However, for the Support Vector Machine (SVM) model, we used an 80-20 split to allow more data for training due to its sensitivity to sample size.
Why the Training and Testing Data Must Be Disjoint
It is critical that the training and testing sets are completely disjoint, meaning they do not contain any overlapping rows. A disjoint split ensures that the model is evaluated on data it has never seen before, providing a realistic assessment of its ability to generalize to new inputs. If the same data were present in both sets, the model might simply memorize the answers instead of learning the underlying patterns, leading to misleadingly high accuracy during testing. This phenomenon, known as data leakage, can result in poor real-world performance. Therefore, maintaining a clean separation between training and testing data is essential for building trustworthy and robust models.
Why SVMs can only work on labeled numeric data?
SVMs can only work with labeled numerical data because their core calculations are founded directly on mathematical operations like dot products and distances between feature vectors. These operations require numerical input. If data is categorical or non-numerical, the SVM model cannot calculate meaningful distances or margins between points. Therefore, before SVM training, categorical features must be encoded numerically (e.g., through label encoding or one-hot encoding), and the target variable must have labeled classes to direct the learning process. In the absence of numerical input and labeled data, SVMs would not be able to come up with a good separating hyperplane.
Code & Results