F1 (F1 Score): The harmonic mean of precision and recall, used to evaluate the balance between false positives and false negatives in binary classification tasks. For \judge, we set the thresholds to 0.5 for both toxic judgment and answering judgment, and the F1 score is calculated based on the predicted labels.
MAE (Mean Absolute Error): The average of the absolute differences between predicted values and actual values, used to measure the accuracy of continuous predictions.
MSE (Mean Squared Error): The average of the squared differences between predicted values and actual values, emphasizing larger errors more than MAE.