Text Cleaning and Normalizing

Link to Our GitHub

Text Cleaning

Similar to last year’s group we used a Bag-of-Words(BOW) model to begin cleaning the text. The Bag-of-Words model ignores sentence structure, viewing each text as simply a bag of words. This means that all punctuation was removed for the text. This makes our job easier because punctuation during this period is significantly different from current English grammar rules. Lowercasing all of the text also helps simplify the document to make it easier to process through the spelling normalizing software.

Text Normalizing

Unlike last year’s group we did not just use the brute force method for spelling normalization. We were able to gain access to VARD. VARD has combined some of the most popular normalization techniques into one, such as replacing known variants with brute force method or replacing known variant spellings. VARD also tries to normalize using phonetic matching to match variants to known words and calculates the Levenshtein Distance (which is defined as the number of edits it takes to get to a modern spelling) for suggested replacements. The program will then choose the best suggestion from there.

Link to VARD Website

VARD Decision Making

In this screenshot, you can see how VARD weighs the different replacement options for the misspelled word "corporall."

List of VARD's changes

When using VARD's GUI, you can see all the words that VARD normalizes.

Issues with VARD

However, VARD has a few drawbacks. While viewing our text in VARD we noticed unidentifiable text was turned into a series of question marks. When we went to view the text directly in our Python code we found VARD had translated vertical lines found within the text to question marks. There were other weird characters within our text that VARD cannot handle. We had to manually go through and find these weird characters, but we are not entirely sure that we caught them all. Once we found the weird characters, we used our text cleaning code to replace the weird character with the appropriate letter or space. Additionally, while VARD usually normalizes well during our test runs, when we started feeding large chunks of our data into VARD, we realized that we had to lower its confidence threshold to obtain a reasonable level or accuracy. We decided on a 40% confidence threshold, but in hindsight, we probably could have gone lower.

Link to Our GitHub

Page updated

Google Sites

Report abuse