GOAL : Learn to deal with overfitting with pytorch toolbox.
Learning experience: In this work i continue wtih the programming 8, keep learning about nn, and more focus on solve overfitting, after this work i have known that overfitting and how to tune both cpu and gpu in the same time, at the end i try to visualized the model, that will help me understand other people's model easily, this work is some how interesting and useful.
In this continue work we will try to deal with overfitting, let we introduce about overfitting[2].
Usually a learning algorithm is trained using some set of "training data": exemplary situations for which the desired output is known. The goal is that the algorithm will also perform well on predicting the output when fed "validation data" that was not encountered during its training.
Overfitting is the use of models or procedures that violate Occam's razor, for example by including more adjustable parameters than are ultimately optimal, or by using a more complicated approach than is ultimately optimal. For an example where there are too many adjustable parameters, consider a dataset where training data for y can be adequately predicted by a linear function of two independent variables. Such a function requires only three parameters (the intercept and two slopes). Replacing this simple function with a new, more complex quadratic function, or with a new, more complex linear function on more than two independent variables, carries a risk: Occam's razor implies that any given complex function is a priori less probable than any given simple function. If the new, more complicated function is selected instead of the simple function, and if there was not a large enough gain in training-data fit to offset the complexity increase, then the new complex function "overfits" the data, and the complex overfitted function will likely perform worse than the simpler function on validation data outside the training dataset, even though the complex function performed as well, or perhaps even better, on the training dataset.[11]
When comparing different types of models, complexity cannot be measured solely by counting how many parameters exist in each model; the expressivity of each parameter must be considered as well. For example, it is nontrivial to directly compare the complexity of a neural net (which can track curvilinear relationships) with m parameters to a regression model with n parameters.[11]
Overfitting is especially likely in cases where learning was performed too long or where training examples are rare, causing the learner to adjust to very specific random features of the training data that have no causal relation to the target function. In this process of overfitting, the performance on the training examples still increases while the performance on unseen data becomes worse.
As a simple example, consider a database of retail purchases that includes the item bought, the purchaser, and the date and time of purchase. It's easy to construct a model that will fit the training set perfectly by using the date and time of purchase to predict the other attributes, but this model will not generalize at all to new data, because those past times will never occur again.
Generally, a learning algorithm is said to overfit relative to a simpler one if it is more accurate in fitting known data (hindsight) but less accurate in predicting new data (foresight). One can intuitively understand overfitting from the fact that information from all past experience can be divided into two groups: information that is relevant for the future, and irrelevant information ("noise"). Everything else being equal, the more difficult a criterion is to predict (i.e., the higher its uncertainty), the more noise exists in past information that needs to be ignored. The problem is determining which part to ignore. A learning algorithm that can reduce the risk of fitting noise is called "robust."
The green line represents the overfitting model, and the black line represents the regularized model. Although the green line fits the training data perfectly, it is adjusted too tightly or precisely and will have a higher error rate on new test data than the black line.
This chapter will learn about Weight-Decay, Early Stopping, Dropout. [3]
Weight-Decay
Weight-Decay is actually the L1 / L2 regularization we most often talk about, but why does adding a regularizer cause weight decay?
Let’s take L2 regularization as an example
From the first item on the right, we can know that when updating parameters, we must first multiply the parameters by(1−2𝜂𝜆) This item, which will cause the parameter to decrease and move closer to 0 (because1−2𝜂𝜆<1 ), this is the spirit of weight decay.
Early Stopping
When talking about overfitting, we often see graphs like this. From graphs like this, we can know that during the training process, the best time point is at 𝑑𝑣𝑐∗ Stop training here and you can get an optimal model. messageBut how do we know 𝑑𝑣𝑐∗ At which point in time or number of iterations will it appear? This can be verified using the Validation set.
Dropout
This is a unique regularization method in Deep Learning. A certain percentage of neurons are randomly stopped in each layer for training.
We randomly stop neurons. In other words, there are many diverse neural networks, and these diverse models are finally integrated (Ensemble) to obtain an average.
But there is a mathematical problem here, these neural networks use 𝑝% To get the average weights obtained from training into the test data, you must multiply the weights obtained by training by 1−𝑝% This is the result we hope to get. There will be such a problem. From an intuitive point of view, when we proceed 𝑝% The weights obtained by dropout are directly taken into the test data without dropout. There may be
1/𝑝% the difference in multiples.
Therefore, we must take these into consideration to have reasonable results on the test data.
In pytorch the function Adam, provide the weight-decay [4]
CLASS torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, *, foreach=None, maximize=False, capturable=False, differentiable=False, fused=None), Implements Adam algorithm.
Parameters
lr (float, Tensor, optional) – learning rate (default: 1e-3). A tensor LR is not yet supported for all our implementations. Please use a float LR if you are not also specifying fused=True or capturable=True.
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
In PyTorch Lightning provide early stopping. [5]
CLASS lightning.pytorch.callbacks.EarlyStopping(monitor, min_delta=0.0, patience=3, verbose=False, mode='min', strict=True, check_finite=True, stopping_threshold=None, divergence_threshold=None, check_on_train_epoch_end=None, log_rank_zero_only=False), Monitor a metric and stop training when it stops improving.
Parameters
monitor : (str) – quantity to be monitored.
min_delta : (float) – minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than or equal to min_delta, will count as no improvement.
verbose : (bool) – verbosity mode.
patience: (int) – number of checks with no improvement after which training will be stopped. Under the default configuration, one check happens after every training epoch. However, the frequency of validation can be modified by setting various parameters on the Trainer, for example check_val_every_n_epoch and val_check_interval.
In pytorch the function nn, provide dropout [6]
CLASS torch.nn.Dropout(p=0.5, inplace=False), During training, randomly zeroes some of the elements of the input tensor with probability p.
The zeroed elements are chosen independently for each forward call and are sampled from a Bernoulli distribution.
Each channel will be zeroed out independently on every forward call.
This has proven to be an effective technique for regularization and preventing the co-adaptation of neurons as described in the paper Improving neural networks by preventing co-adaptation of feature detectors .
Furthermore, the outputs are scaled by a factor of 1/1-p during training. This means that during evaluation the module simply computes an identity function.
Parameters
[1] Chapter 21. Neural Networks, Machine Learning with Python - Theory and Implementation.
[2] Overfitting, Wikipedia.
[3] Regularization 方法 : Weight Decay , Early Stopping and Dropout, HackMD, Allen Tzeng, 20190624.
[4] Adam, © Copyright 2023, PyTorch Contributors.
[5] EARLYSTOPPING, lightning ai.
[6] Dropout. © Copyright 2023, PyTorch Contributors.
Continue with P8
21.8 Visualize Training History
This section will find the “sweet spot” in a neural network’s loss and/or accuracy score.
This code demonstrates a complete workflow of creating, training, and evaluating a simple neural network using PyTorch.
First, it imports the necessary libraries, including PyTorch, Sklearn, and Matplotlib.
Next, it generates a synthetic dataset for binary classification using Sklearn's make_classification function. This dataset contains 1000 samples with 10 features each, and it splits the dataset into training and test sets.
It sets random seeds to ensure reproducibility and converts the data into PyTorch tensors for subsequent model training.
A simple neural network, SimpleNeuralNet, is defined. This network consists of three fully connected layers, using ReLU as the activation function between layers and Sigmoid for the final output to suit the binary classification problem.
The neural network is then initialized, and the loss function (binary cross-entropy loss) and the optimizer (RMSprop) are defined.
The training dataset is packaged using TensorDataset and DataLoader, setting the batch size to 100 and shuffling the data at each iteration.
During the training process, multiple epochs (8 in this case) are conducted. In each epoch, batches of data are extracted from the training set, forward propagated to compute outputs, loss is calculated, and backpropagation updates the model parameters.
At the end of each epoch, the loss is computed using both the training and test data, and these losses are recorded for subsequent plotting.
Finally, Matplotlib is used to plot the training and test loss as a function of epochs, visualizing the training process of the model.
Discussion
When our neural network is new, it will have poor performance. As the neural network learns on the training data, the model’s error on both the training and test set will tend to decrease. However, at a certain point, a neural network can start “memorizing” the training data and overfit. When this starts happening, the training error may decrease while the test error starts increasing. Therefore, in many cases, there is a “sweet spot” where the test error (which is the error we mainly care about) is at its lowest point. This effect can be seen in the solution, where we visualize the training and test loss at each epoch. Note that the test error is lowest around epoch 6, after which the training loss plateaus while the test loss starts increasing. From this point onward, the model is overfitting.
21.9 Reducing Overfitting with Weight Regularization
This section will reduce overfitting by regularizing the weights of your network.
This code demonstrates how to create, train, and evaluate a simple neural network using PyTorch.
First, the necessary libraries are imported, including PyTorch, NumPy, and Sklearn, for handling data and constructing the neural network.
Then, a synthetic dataset for binary classification is generated using Sklearn's make_classification function. This dataset contains 1000 samples, each with 10 features. The dataset is then split into training and test sets, with the test set comprising 10% of the total data.
Next, random seeds are set to ensure reproducibility, and the data is converted to PyTorch tensors for subsequent model training.
A simple neural network, SimpleNeuralNet, is defined. This network consists of three fully connected layers, using ReLU as the activation function between layers. The final output layer uses the Sigmoid function, which is suitable for binary classification problems.
The neural network is then initialized, and the loss function (binary cross-entropy loss) and optimizer (Adam) are defined, with the learning rate and weight decay set.
The training dataset is packaged using TensorDataset and DataLoader, setting the batch size to 100 and shuffling the data at each iteration.
During the training process, multiple epochs (100 in this case) are conducted. In each epoch, batches of data are extracted from the training set, forward propagated to compute outputs, loss is calculated, and backpropagation updates the model parameters.
After training, the model is evaluated using the test data. With torch.no_grad() to disable gradient calculation, the forward pass computes the outputs, and the test loss is calculated. Additionally, test accuracy is computed by comparing the network outputs (rounded) with the true labels, and the accuracy is printed.
Finally, the code prints the test loss and test accuracy.
Discussion
One strategy to combat overfitting neural networks is by penalizing the parameters (i.e., weights) of the neural network such that they are driven to be small values, creating a simpler model less prone to overfit. This method is called weight regularization or weight decay. More specifically, in weight regularization a penalty is added to the loss function, such as the L2 norm.
In PyTorch, we can add weight regularization by including weight_decay=1e-5 in the optimizer where regularization happens. In this example, 1e-5 determines how much we penalize higher parameter values. Values greater than 0 indicate L2 regularization in PyTorch.
21.10 Reducing Overfitting with Early Stopping
This section will reduce overfitting by stopping training when your train and test scores diverge.
This code demonstrates how to use PyTorch Lightning to create, train, and apply early stopping to a simple neural network.
First, the necessary libraries are imported, including PyTorch, NumPy, Sklearn, and PyTorch Lightning.
Next, a synthetic dataset for binary classification is generated using Sklearn's make_classification function. This dataset contains 1000 samples, each with 10 features. The dataset is then split into training and test sets, with the test set comprising 10% of the total data.
Random seeds are set to ensure reproducibility, and the data is converted to PyTorch tensors for subsequent model training.
A simple neural network, SimpleNeuralNet, is defined. This network consists of three fully connected layers, using ReLU as the activation function between layers. The final output layer uses the Sigmoid function, which is suitable for binary classification problems.
Next, a LightningNetwork class is defined, which is a model class for PyTorch Lightning. This class includes a training_step method that defines the training step, calculates outputs and loss, and logs the training loss. It also includes a configure_optimizers method that sets up the Adam optimizer.
The training dataset is packaged using TensorDataset and DataLoader, setting the batch size to 100 and shuffling the data at each iteration.
The neural network is initialized and wrapped in the LightningNetwork.
A PyTorch Lightning Trainer is set up, configuring early stopping (monitoring val_loss and stopping training when the loss does not decrease) and training for up to 1000 epochs.
Finally, the model is trained using the trainer.fit method.
Discussion
As we discussed in Recipe 21.8, typically in the first several training epochs, both the training and test errors will decrease, but at some point the network will start “memorizing” the training data, causing the training error to continue to decrease even while the test error starts increasing. Because of this phenomenon, one of the most common and very effective methods to counter overfitting is to monitor the training process and stop training when the test error starts to increase. This strategy is called early stopping.
In PyTorch, we can implement early stopping as a callback function. Callbacks are functions that can be applied at certain stages of the training process, such as at the end of each epoch. However, PyTorch itself does not define an early stopping class for you, so here we use the popular library lightning (known as PyTorch Lightning) to use an out-of-the-box one. PyTorch Lightning is a high-level library for PyTorch that provides a lot of useful features. In our solution, we included PyTorch Lightning’s EarlyStopping(monitor="val_loss", mode="min", patience=3) to define that we wanted to monitor the test (validation) loss at each epoch, and if the test loss has not improved after three epochs (the default), training is interrupted.
If we did not include the EarlyStopping callback, the model would train for the full 1,000 max epochs without stopping on its own.
21.11 Reducing Overfitting with Dropout
This section will reduce overfitting.
This code demonstrates how to train a simple neural network using PyTorch, and evaluate it using a training and test dataset.
First, the necessary libraries are imported, including PyTorch, NumPy, and Sklearn. Next, a synthetic binary classification dataset is generated using Sklearn's make_classification function. This dataset contains 1000 samples, each with 10 features. The dataset is then split into training and test sets, with the test set comprising 10% of the total data.
Random seeds are set to ensure reproducibility, and the data is converted to PyTorch tensors for subsequent model training.
A simple neural network, SimpleNeuralNet, is defined. This network consists of three fully connected layers, using ReLU as the activation function between layers. The final output layer uses the Sigmoid function, suitable for binary classification problems. A Dropout layer is added after the second hidden layer to randomly drop 10% of the neurons, helping to prevent overfitting.
Next, the neural network network is initialized, and the loss function and optimizer are defined. Binary Cross-Entropy Loss (BCELoss) and RMSprop optimizer are used here.
The training dataset is packaged using TensorDataset and DataLoader, setting the batch size to 100 and shuffling the data at each iteration.
The training loop is set up to train for 3 epochs. In each epoch, for each batch of data, forward propagation, loss calculation, backpropagation, and weight updates are performed. At the end of each epoch, the current loss is printed.
After training, the model is evaluated on the test data. Using torch.no_grad() to stop gradient computation, the test loss and accuracy are calculated and printed.
Discussion
Dropout is a fairly common method for regularizing smaller neural networks. In dropout, every time a batch of observations is created for training, a proportion of the units in one or more layers is multiplied by zero (i.e., dropped). In this setting, every batch is trained on the same network (e.g., the same parameters), but each batch is confronted by a slightly different version of that network’s architecture.
Dropout is thought to be effective because by constantly and randomly dropping units in each batch, it forces units to learn parameter values able to perform under a wide variety of network architectures. That is, they learn to be robust to disruptions (i.e., noise) in the other hidden units, and this prevents the network from simply memorizing the training data.
It is possible to add dropout to both the hidden and input layers. When an input layer is dropped, its feature value is not introduced into the network for that batch.
In PyTorch, we can implement dropout by adding an nn.Dropout layer into our network architecture. Each nn.Dropout layer will drop a user-defined hyperparameter of units in the previous layer every batch.
21.12 Saving Model Training Progress
This section will solve when given a neural network that will take a long time to train, you want to save your progress in case the training process is interrupted.
This code demonstrates how to train a simple neural network using PyTorch and save the model at the end of each epoch.
First, the necessary libraries are imported, including PyTorch, NumPy, and Sklearn. Next, a synthetic binary classification dataset is generated using Sklearn's make_classification function. This dataset contains 1000 samples, each with 10 features. The dataset is then split into training and test sets, with the test set comprising 10% of the total data.
Random seeds are set to ensure reproducibility, and the data is converted to PyTorch tensors for subsequent model training.
A simple neural network, SimpleNeuralNet, is defined. This network consists of three fully connected layers, using ReLU as the activation function between layers. The final output layer uses the Sigmoid function, suitable for binary classification problems. A Dropout layer is added after the second hidden layer to randomly drop 10% of the neurons, helping to prevent overfitting.
Next, the neural network network is initialized, and the loss function and optimizer are defined. Binary Cross-Entropy Loss (BCELoss) and RMSprop optimizer are used here.
The training dataset is packaged using TensorDataset and DataLoader, setting the batch size to 100 and shuffling the data at each iteration.
The training loop is set up to train for 5 epochs. In each epoch, for each batch of data, forward propagation, loss calculation, backpropagation, and weight updates are performed. At the end of each epoch, the current model state, including the epoch number, model parameters, optimizer parameters, and loss value, is saved. The current loss is printed at the end of each epoch.
Discussion
In the real world, it is common for neural networks to train for hours or even days. During that time a lot can go wrong: computers can lose power, servers can crash, or inconsiderate graduate students can close your laptop.
We can use torch.save to alleviate this problem by saving the model after every epoch. Specifically, after every epoch, we save a model to the location model.pt, the second argument to the torch.save function. If we include only a filename (e.g., model.pt) that file will be overridden with the latest model every epoch.
As you can imagine, we can introduce additional logic to save the model every few epochs, only save a model if the loss goes down, etc. We could even combine this approach with the early stopping approach in PyTorch Lightning to ensure we save a model no matter at what epoch the training ends.
21.13 Tuning Neural Networks
This section will automatically select the best hyperparameters for your neural network.
This code demonstrates how to use PyTorch and Ray Tune to train a simple neural network and perform hyperparameter tuning to optimize the model.
First, necessary libraries are imported, including PyTorch, NumPy, Sklearn, and Ray Tune. A synthetic binary classification dataset is generated using Sklearn's make_classification function, containing 1000 samples, each with 10 features. The dataset is then split into training and test sets, with the test set comprising 10% of the total data.
Random seeds are set to ensure reproducibility, and the data is converted to PyTorch tensors for subsequent model training.
A simple neural network, SimpleNeuralNet, is defined. This network consists of three fully connected layers, using ReLU as the activation function between layers. The final output layer uses the Sigmoid function, suitable for binary classification problems. The structure of this neural network allows adjusting the sizes of the two hidden layers.
Next, a configuration dictionary config is defined, which contains the ranges for the hidden layer sizes and learning rate hyperparameters. These ranges use Ray Tune's tune.sample_from and tune.loguniform methods to sample randomly.
An ASHAScheduler is defined, which is a hyperparameter tuning scheduler used to manage the training process. It monitors the model's loss and stops some trials early if they perform poorly. CLIReporter is used to display progress during training in the command line.
The train_model function is the core function for training the model. This function uses PyTorch to define the model, loss function, and optimizer, and reports the loss after each epoch. The function's config parameter contains the hyperparameters obtained from the Ray Tune configuration.
Finally, tune.run is used to execute the hyperparameter tuning process. This function takes the training function, resource configuration, hyperparameter configuration, number of samples, scheduler, and progress reporter as parameters. Upon completion, the best trial results are retrieved and printed, including the best configuration and final validation loss.
Discussion
In Recipes 12.1 and 12.2, we covered using scikit-learn’s model selection techniques to identify the best hyperparameters of a scikit-learn model. While in general the scikit-learn approach can also be applied to neural networks, the ray tuning library provides a sophisticated API that allows you to schedule experiments on both CPUs and GPUs.
The hyperparameters of a model are important and should be selected with care. However, running experiments to select hyperparameters can be both cost and time prohibitive. Therefore, automatic hyperparameter tuning of neural networks is not the silver bullet, but it is a useful tool to have in certain circumstances.
In our solution we conducted a search of different parameters for layer sizes and the learning rate of our optimizer. The best_trial.config shows the parameters in our ray tuning configuration that led to the lowest loss and best experiment outcome.
NOTE: TuneError: ('Trials did not complete', [train_model_3c82f_00000]),
tune.report(loss=(loss.item())) => session.report({"loss": loss.item()}).
21.14 Visualizing Neural Networks
This section will quickly visualize a neural network’s architecture.
This code demonstrates how to build and train a simple neural network, and use torchviz to generate a network structure diagram.
Here's the explanation of the code:
First, it imports the necessary libraries, including PyTorch and some tools to handle data.
Then, it uses make_classification to create a binary classification dataset.
The dataset is split into training and testing sets, and random seeds are set to ensure reproducibility of results.
The data is converted into PyTorch tensors, and a simple neural network class SimpleNeuralNet is defined.
The network includes three fully connected layers, using ReLU activation functions and a final Sigmoid activation function for binary classification.
Next, the neural network is initialized, and the loss function and optimizer are defined.
A training data loader is created using TensorDataset and DataLoader, and the model is compiled to use PyTorch 2.0's optimizer.
The network training is performed, with a training loop running for three epochs.
For each batch, it performs forward propagation, calculates the loss, performs backpropagation, and updates the parameters.
Finally, the torchviz make_dot function is used to generate the network structure diagram and save it as a PNG file.
If we open the image that was saved to our machine, we can see the following figure.
Discussion
The torchviz library provides easy utility functions to quickly visualize our neural networks and write them out as images.
working environment :
OS: Windows 11 home
CPU : intel i9-13900k
GPU : Nvidia RTX 4090
Development environment: jupyter notebook.
Python Version : 3.11.0
PyTorch Version :
torch : 2.3.0+cu121
First introduce Olivetti faces data-set
This dataset contains a set of face images taken between April 1992 and April 1994 at AT&T Laboratories Cambridge. The sklearn.datasets.fetch_olivetti_faces function is the data fetching / caching function that downloads the data archive from AT&T.
As described on the original website:
There are ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement).
Data Set Characteristics:
Classes : 40
Samples total : 400
Dimensionality : 4096
Features : real, between 0 and 1
The image is quantized to 256 grey levels and stored as unsigned 8-bit integers; the loader will convert these to floating point values on the interval [0, 1], which are easier to work with for many algorithms.
The “target” for this database is an integer from 0 to 39 indicating the identity of the person pictured; however, with only 10 examples per class, this relatively small dataset is more interesting from an unsupervised or semi-supervised perspective.
The original dataset consisted of 92 x 112, while the version available here consists of 64x64 images.
When using these images, please give credit to AT&T Laboratories Cambridge. [12]
Below figure shows the 40 people's face from the dataset.
Next we plot the first 10 images for the first two persons (face id=0 and face id=1) from the Olivetti Faces dataset.
In 6 demonstrates building, training, and evaluating a simple neural network model using PyTorch for classifying Olivetti faces dataset.
Import necessary libraries, including matplotlib, fetch_olivetti_faces for loading the Olivetti faces dataset from Scikit-learn, and PyTorch modules such as torch, torch.nn, torch.optim, torch.utils.data including TensorDataset and DataLoader.
Check for the availability of GPU and set the device to CUDA (if GPU available) or CPU.
Load the Olivetti faces dataset and split it into training and testing sets using Scikit-learn's train_test_split function.
Convert the data into PyTorch tensors and move them to the defined device (GPU or CPU).
Create training and testing data loaders using TensorDataset and DataLoader to iterate over batches of data during training.
Define a simple neural network model OlivettiClassifier consisting of three fully connected layers (also known as dense layers) with ReLU activation function.
Instantiate the model and move it to the device.
Define the loss function (here using cross-entropy loss) and optimizer (using the Adam optimizer).
Train the model. In each training epoch, pass the training data through the model for forward propagation, compute the loss, perform backpropagation, and update the parameters while calculating and storing the loss value for each epoch.
Evaluate the model's performance on the testing set. Use torch.no_grad() context manager to disable gradient computation, then make predictions on the testing data through the model, and calculate the accuracy of the model.
Finally, plot the loss curve during training using Matplotlib.
Note :
TP (True Positive) is the number of positive samples that are correctly predicted as positive.
TN (True Negative) is the number of negative samples that are correctly predicted as negative.
FP (False Positive) is the number of negative samples that are incorrectly predicted as positive.
FN (False Negative) is the number of positive samples that are incorrectly predicted as negative.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
The output shows all 100 epochs loss, when epoch=100 our loss is 0.0097
and we get the test accuracy = 0.9 = 90%
Below figure plot all the loss of this training, we can see that it is convergence.
Here we will add what we have not do yet from P8.
Visualize which 21.8 that we have already done. and the bonus we have add the training loss, we can see that there is not overfitting. To do this work, we force the model overfitting, with encrease the epoch to 500, and the below figure we can see that the train loss is not decrease, and the test loss is increase, so we believe that there is overfitting.
Then we add the weight decay in optim.Adam the below figure show the test we have tried.
weight_decay=0.01
weight_decay=0.001
weight_decay=0.001
Here is the forced overfitting model, next we are going to try early stopping
We can see that the epoch stop at 80, when EarlyStopping(monitor="val_loss", mode="min", patience=10), that mean our val_loss is not improved after ten epoch, then we try to change the patience to 100, which stopped in 187 epoch, so how to tune the patience is important.
Visualize which 21.8 that we have already done. and the bonus we have add the training loss, we can see that there is not overfitting. To do this work, we force the model overfitting, with encrease the epoch to 500, and the below figure we can see that the train loss is not decrease, and the test loss is increase, so we believe that there is overfitting.
Then we add the weight decay in optim.Adam the below figure show the test we have tried.
For dropout, we have already tried in P8.
The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.
This module contains two loaders. The first one, sklearn.datasets.fetch_20newsgroups, returns a list of the raw texts that can be fed to text feature extractors such as CountVectorizer with custom parameters so as to extract feature vectors. The second one, sklearn.datasets.fetch_20newsgroups_vectorized, returns ready-to-use features, i.e., it is not necessary to use a feature extractor. [13]
Data Set Characteristics:
Classes : 20
Samples total : 18846
Dimensionality : 1
Features : text
Data Considerations
The Cleveland Indians is a major league baseball team based in Cleveland, Ohio, USA. In December 2020, it was reported that “After several months of discussion sparked by the death of George Floyd and a national reckoning over race and colonialism, the Cleveland Indians have decided to change their name.” Team owner Paul Dolan “did make it clear that the team will not make its informal nickname – the Tribe – its new team name.” “It’s not going to be a half-step away from the Indians,” Dolan said.”We will not have a Native American-themed name.”
In 105 implements a text classification task using the 20 Newsgroups dataset with PyTorch for training and testing a neural network. The data preprocessing includes converting text data into numerical features, splitting the data into training and test sets, converting the data into PyTorch tensors, and then training and testing the neural network.
First, the code checks for the availability of a GPU and sets the appropriate device. Next, it loads the 20 Newsgroups dataset and uses CountVectorizer to convert the text data into numerical features. The number of features is limited to 4096 to simplify the model. The data is then split into training and test sets.
After converting the data into PyTorch tensors, the code creates PyTorch DataLoaders to facilitate batch loading of the data. Each dataset (training and testing) has its own DataLoader.
Next, a simple neural network NewsClassifier is defined. This network has three fully connected layers (fc1, fc2, fc3) with input size 4096 (matching the number of features), hidden layers of 512 and 256 units, and an output layer of 20 units (corresponding to the 20 newsgroup categories). During the forward pass, the data goes through these layers sequentially with ReLU activation functions applied.
During training, cross-entropy loss and the Adam optimizer are used. In each epoch, the model performs a forward pass through the training set, computes the loss, performs a backward pass, and updates the weights. The loss value for each epoch is recorded and printed every two epochs.
After training, the model is evaluated on the test set to calculate accuracy. Finally, the loss values are plotted as a curve to show the change in loss during the training process.
We add test loss from P8, and see the test loss is increase, after we increase the epoch to 200, and we see that test loss is keep increase, first i don't think this is overfitting, it may be another problem that we are keep looking, whatever i still try the method this work learned, lets see what happend.
We add the weight_decay=1e-5, it look likes not increase that fast, but the accuracy is not change a lot, even when weight_decay=1e-4, is still the same.
Next we add the early stopping, it did stop early, but the accuracy is still bad.
For dropout which we have already added in P8, here will not be discuss.
At the end, i think this dataset may not be fit well in the neural network, maybe we should try another one, and that will be learn in project.