In this lab project, the primary focus is on applying machine learning techniques to data analysis. You will work with the following two datasets:
Unknown Species: This dataset contains 500 data points across three species, each characterized by four common features.
Loan Approval: This dataset includes both a training set and a testing set with loan application data, along with their corresponding approval statuses.
The data files are available in the "DS4Everyone-project2" dataset. Ensure you use version "v7".
In this task, you are expected to identify important features from the data, choose a model to train with these features, perform inference using the model, and then visualize the results.
1. Use a "CSV Scan" operator to read the unknown_species.csv file.
2. Use a "Projection" operator to output data for 3 columns: feature 1, 2 and the target attribute Label.
3. Split the data using the "Split" operator, and set its random seed to 42.
4. Use a "Linear Perceptron" operator to train a model, and check its accuracy in the console.
5. Choose two features that can best differentiate the species from step 2. Optionally, you can use a "Scatter Matrix" operator to visualize the data, which can help you determine which two features best differentiate the species. Aim for a minimum testing accuracy of 0.60 (60%).
6. Use a "Sklearn Prediction" operator after the "Linear Perceptron" operator. Pass the output model from "Linear Perceptron" to the "Sklearn Prediction" operator's "model" port.
7. Pass the entire dataset after feature selection to the other port of the "Sklearn Prediction" operator.
8. Use a "Scatterplot" operator to plot the results with two chosen features for the X and Y values and the predicted label for the color.
The following is an example for the workflow and the visualization.
Building on Task 1, you will explore an alternative machine-learning model to improve performance. Select one model from the list below and aim for a minimum testing accuracy of 85% (0.85):
Decision Tree
K-Nearest Neighbor (KNN)
Support Vector Machine (SVM)
Random Forest
In this task, you will use the Loan Approval dataset to perform a binary classification task. Each row in loan_train.csv represents a loan application. The target column, "loan_status", indicates whether the application was approved (1) or rejected (0). Your objective is to build a machine-learning model that predicts the loan status based on the applicant's information.
Use train_loan.csv for feature selection and model training.
Select any machine learning model operator under "sklearn" category except "Linear Regression", as it is not suitable for classification tasks. Also, avoid using the "Dummy Classifier" operator because it's for internal testing purposes.
Apply the techniques you’ve learned to train the best-performing model.
Test your model using test_loan.csv and record its accuracy score. Ignore the ground truth column "loan_status" in the "Sklearn Prediction" operator.
Submit your accuracy score to the competition dashboard.
Note: It's better to start with a small amount of data for both training and testing. Once you feel confident about the features and model you selected, proceed with the entire data. Otherwise, you may waste time waiting for the training and prediction results.
Below is an example workflow for this task.
Just FYI, we converted all string columns into numeric labels for easier processing, the following is the mapping of each string column in the data:
Label Mapping for column 'person_home_ownership':
'MORTGAGE' -> 0
'OTHER' -> 1
'OWN' -> 2
'RENT' -> 3
--------------------
Label Mapping for column 'loan_intent':
'DEBTCONSOLIDATION' -> 0
'EDUCATION' -> 1
'HOMEIMPROVEMENT' -> 2
'MEDICAL' -> 3
'PERSONAL' -> 4
'VENTURE' -> 5
--------------------
Label Mapping for column 'loan_grade':
'A' -> 0
'B' -> 1
'C' -> 2
'D' -> 3
'E' -> 4
'F' -> 5
'G' -> 6
--------------------
Label Mapping for column 'cb_person_default_on_file':
'N' -> 0
'Y' -> 1
--------------------