Next Directions

To reach a next level of functionality / utility, the project needs:

- Design: perhaps temporarily only classifying for the largest category, then introducing smaller categories (especially after more journal data is annotated/available)
- Data: more annotated data to create less gaps in the category distribution - potentially can use clustering to assist
- Data: further preprocessing of each article's raw text (for example, removing header/footer info from the journal, as well as potentially author data)
- Code: to implement the Stratified K-Fold and other previously in-progress techniques / approaches to bolster performance
- Code: testing with different parameters to identify what functions best for this task
- Code: strong documentation
- Code: to be cleared of redundant lines (some were left over from my different attempts)
- Code: implementing the other script modes (final-train mode and predict mode) need to be implemented:
  - Final-train mode: this would involve the usual training steps, but none of the data would be kept out of training (none reserved for testing); this mode would be used to prepare a version of the model to save, which would be used for predictions later
  - Predict mode: this would involve loading the previously trained model/its weights (using joblib library), reading in an XLSX on newly-presented journals (similar to the current setup, but without training on it), then predicting for those journals and saving the output in a prediction XLSX

Page updated

Google Sites

Report abuse