https://developers.google.com/machine-learning/crash-course/framing/
https://developers.google.com/machine-learning/crash-course/training-and-test-sets/
https://developers.google.com/machine-learning/crash-course/validation/
s representative of the data set as a whole. In other words, don't pick a test set with different characteristics than the training set.
Never train on test data
Experiments on decision forest
https://developers.google.com/machine-learning/decision-forests/practice?hl=en
np.random.seed(1)
# Use the ~10% of the examples as the testing set
# and the remaining ~90% of the examples as the training set.
test_indices = np.random.rand(len(pandas_dataset)) < 0.1
pandas_train_dataset = pandas_dataset[~test_indices]
pandas_test_dataset = pandas_dataset[test_indices]
Colab shows that the root condition evaluated 277 examples. However, you might remember that the training dataset contained 309 examples. The remaining 32 examples were used for validation.
https://developers.google.com/machine-learning/crash-course/classification/accuracy
https://www.comet.com/site/fraud-detection-imbalanced-classification/
https://www.kaggle.com/datasets/ealaxi/paysim1?resource=download
https://www.kaggle.com/code/waleedfaheem/credit-card-fraud-detection-auc-0-9
https://towardsdatascience.com/a-deeper-dive-into-the-nsl-kdd-data-set-15c753364657
From now on, KDDTrain+ will be referred to as train and KDDTest+ will be referred to as test.
he data set contains 43 features per record, with 41 of the features referring to the traffic input itself and the last two are labels (whether it is a normal or attack) and Score (the severity of the traffic input itself).
https://www.kaggle.com/datasets/hassan06/nslkdd/data