The Life History of SA English
Abstract:
This project aims to provide the Dictionary Unit of South African English (DSAE) with a toolset that scrapes targeted web pages for instances of key SA English words. These occurrences are packaged together with a contextualizing sentence indicating usage and available metadata to form a complete citation. Such citations were originally found and documented manually but now the DSAE seeks to improve workflow through machine automation. The project makes use of RSS feeds sourced from different local South African news sites as the mechanism of providing appropriate URLs for scraping. In order to root out search words from the news articles, several different algorithms were investigated and implemented such as the Damerau-–Levenshtein distance formula, the SoundEx algorithm as well as the Metaphone algorithm. Due to the historical nature of many of the desired words, the project struggled to produce significant amounts of citations that would be found in modern day news articles. Additional, the English based phonetic algorithms mentioned above were insufficient to capture SA words due to the differences in lexical structure.