This notebook provides an introduction to webscraping in python using Selenium, using collecting course descriptions from online course catalogs as a running example. Note that this script will not run in Google Colab due to the interaction between Colab and Selenium. You can view the content in Colab or download the file as an ipython notebook to run as a Jupyter notebook.
This notebook provides a short vignette demonstrating how to use fine-tuned large language models in HuggingFace pipelines for text classification. I use a RoBERTa model I fine-tuned to classify courses into the National Center for Education Statistics' College Course Map to efficiently classify a large number of course records I collected from the Texas Higher Education Coordinating Board's Common Course Numbering system database.
More details about the data and model performance are available in the Classifying Courses at Scale working paper.