This project is written in Python. The most recent version of my code is visible here (Github): https://github.com/michaelinwords/ua-americanspeech
The following libraries / tools / imports were used (or anticipated for use) in the code so far, so need to be installed if not default - there are likely redundancies here:
os: This is used for navigating and manipulating the file system within Python scripts.
re: Regular expressions library, mainly utilised for text preprocessing and normalisation tasks.
PyPDF2: A library for reading and interacting with PDF files.
pandas: A data manipulation and analysis tool for handling Excel files and converting them into dataframes.
numpy: A fundamental package for scientific computing with Python, often abbreviated as np.
TfidfVectorizer (from sklearn.feature_extraction.text): Employed for transforming text data into feature vectors that can be used in machine learning models.
train_test_split (from sklearn.model_selection): A function to easily split datasets into training and testing sets.
StratifiedKFold (from sklearn.model_selection import StratifiedKFold): For creating stratified k-fold splits, which is a method of splitting data that ensures each fold is a good representative of the whole.
MultiLabelBinarizer (from sklearn.preprocessing): Used for encoding labels in a multi-label setting, converting category lists into a binary matrix format.
LogisticRegression (from sklearn.linear_model): Includes the logistic regression model for statistical analysis and predictive modeling.
OneVsRestClassifier (from sklearn.multiclass): A strategy for fitting one classifier per class in a multi-label classification problem.
classification_report, f1_score, accuracy_score (from sklearn.metrics): These are metrics for evaluating the performance of classification models.
dump, load (from joblib): Functions for saving and loading machine learning models, vectorisers, and label binarisers, useful for reusing models without retraining.
colored (from termcolor): An optional utility for printing colored text in the terminal, to facilitate reading different aspects of the output.
In its current state, the code is set up for the following file/folder layout - in a future version, most of these locations would be stored in variables as options so that users could easily modify/link-up different folder structures:
The main script (ua-americanspeech-LING593-classifier-updated.py) should sit in the root
When joblib saves or loads a model, it will do so from the root, as well (vectoriser.joblib, model.joblib)
In the root, there are the following folders:
PDFs-train: PDFs you intend to use in training
PDFs-predict: when the predict mode is implemented, these are the PDFs thatÂ
XLSX: this would contain any Excel/Google Sheet spreadsheets containing article metadata, one article per row; you would need one file for which PDFs to train on and one file for predicting PDF categories
The PDFs provide the actual article text, and the XLSX files help us in finding the proper metadata for each article, then connecting the two (to construct the Document object)
performance-notes: this folder is just for keeping additional notes on the model's performance over time