Code Setup

Code Repository

This project is written in Python. The most recent version of my code is visible here (Github): https://github.com/michaelinwords/ua-americanspeech

Libraries / Tools / Imports Used

The following libraries / tools / imports were used (or anticipated for use) in the code so far, so need to be installed if not default - there are likely redundancies here:

os: This is used for navigating and manipulating the file system within Python scripts.
re: Regular expressions library, mainly utilised for text preprocessing and normalisation tasks.
PyPDF2: A library for reading and interacting with PDF files.
pandas: A data manipulation and analysis tool for handling Excel files and converting them into dataframes.
numpy: A fundamental package for scientific computing with Python, often abbreviated as np.
TfidfVectorizer (from sklearn.feature_extraction.text): Employed for transforming text data into feature vectors that can be used in machine learning models.
train_test_split (from sklearn.model_selection): A function to easily split datasets into training and testing sets.
StratifiedKFold (from sklearn.model_selection import StratifiedKFold): For creating stratified k-fold splits, which is a method of splitting data that ensures each fold is a good representative of the whole.
MultiLabelBinarizer (from sklearn.preprocessing): Used for encoding labels in a multi-label setting, converting category lists into a binary matrix format.
LogisticRegression (from sklearn.linear_model): Includes the logistic regression model for statistical analysis and predictive modeling.
OneVsRestClassifier (from sklearn.multiclass): A strategy for fitting one classifier per class in a multi-label classification problem.
classification_report, f1_score, accuracy_score (from sklearn.metrics): These are metrics for evaluating the performance of classification models.
dump, load (from joblib): Functions for saving and loading machine learning models, vectorisers, and label binarisers, useful for reusing models without retraining.
colored (from termcolor): An optional utility for printing colored text in the terminal, to facilitate reading different aspects of the output.

File/Folder Structure

In its current state, the code is set up for the following file/folder layout - in a future version, most of these locations would be stored in variables as options so that users could easily modify/link-up different folder structures:

The main script (ua-americanspeech-LING593-classifier-updated.py) should sit in the root
When joblib saves or loads a model, it will do so from the root, as well (vectoriser.joblib, model.joblib)
In the root, there are the following folders:
- PDFs-train: PDFs you intend to use in training
- PDFs-predict: when the predict mode is implemented, these are the PDFs that
- XLSX: this would contain any Excel/Google Sheet spreadsheets containing article metadata, one article per row; you would need one file for which PDFs to train on and one file for predicting PDF categories
  - The PDFs provide the actual article text, and the XLSX files help us in finding the proper metadata for each article, then connecting the two (to construct the Document object)
- performance-notes: this folder is just for keeping additional notes on the model's performance over time

Page updated

Google Sites

Report abuse

Code Setup

Code Repository

Libraries / Tools / Imports Used

File/Folder Structure

Contact