3.2.5 The Machine Learning Process
Step 1. Data preparation - Perform data cleaning procedures such as transformation into a structured format and removing missing data and noisy/corrupted observations.
Step 2a. Learning data - Create a learning data set used to train the model.
Step 2b. Testing data - Create a test dataset used to evaluate the model performance. Only perform this step in the case of supervised learning.
Step 3. Learning Process Loop - Selection. An algorithm is chosen based on the problem. Depending on the selected algorithm, additional pre-processing steps might be necessary.
Step 4. Learning Process Loop - Evaluation. This selected algorithm's performance is evaluated on the learning data. If the algorithm and the model reach an acceptable performance on learning data, the solution validates the test data. Otherwise, repeat the learning process with a proposed new model and algorithm.
Step 5. Model evaluation - Test the solution on the test data. The performances on learning data are not necessarily transferrable to test data. The more complex and fine-tuned the model is, the higher the chances are that the model will become prone to overfitting, which means it cannot perform accurately against unseen data. Overfitting can result in going back to the model learning process.
Step 6. Model implementation - After the model achieves satisfactory performance on test data, implement the model. Implementing the model means performing the necessary tasks to scale the machine learning solution to big data.