Feature Selection

Train-Test-Split

Feature Selection

Feature selection is the process of choosing the most relevant features or variables from a dataset that contribute the most to predicting the target variable. By removing irrelevant or redundant features, it improves model performance, reduces overfitting, and speeds up training time, leading to more accurate and efficient machine learning models.

Input Value

X = def_tech

Since technical skills have the highest correlation with jobs, this feature will be used as the input value to match candidates with skills.

Output Value

y = job

The 'job' column will be used as the target column for prediction.

Training and Testing

Train-test split is a method used to evaluate the performance of a machine learning model by dividing the dataset into two parts: a training set, used to train the model, and a test set, used to assess its accuracy on unseen data. This helps ensure that the model generalizes well and avoids overfitting by testing it on data it hasn’t been trained on.

Typically, a common split ratio is 80% for training and 20% for testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

X_train: This is the portion of the feature data used to train the model.

X_test: This is the remaining feature data used to evaluate the model’s performance on unseen data after training.

y_train: This is the portion of the target variable (labels) corresponding to X_train, used during the training phase.

y_test: This is the portion of the target variable corresponding to X_test, used to test and evaluate the model’s predictions.

random_state: Shuffling and Randomness-- Helps ensure that you get the same training and test split every time you run the code, which is important for consistent evaluation.

Page updated

Google Sites

Report abuse