Design: perhaps temporarily only classifying for the largest category, then introducing smaller categories (especially after more journal data is annotated/available)
Data: more annotated data to create less gaps in the category distribution - potentially can use clustering to assist
Data: further preprocessing of each article's raw text (for example, removing header/footer info from the journal, as well as potentially author data)
Code: to implement the Stratified K-Fold and other previously in-progress techniques / approaches to bolster performance
Code: testing with different parameters to identify what functions best for this task
Code: strong documentation
Code: to be cleared of redundant lines (some were left over from my different attempts)
Code: implementing the other script modes (final-train mode and predict mode) need to be implemented:
Final-train mode: this would involve the usual training steps, but none of the data would be kept out of training (none reserved for testing); this mode would be used to prepare a version of the model to save, which would be used for predictions later
Predict mode: this would involve loading the previously trained model/its weights (using joblib library), reading in an XLSX on newly-presented journals (similar to the current setup, but without training on it), then predicting for those journals and saving the output in a prediction XLSX