There are many applications that attempt to fix grammar and spelling mistakes; however, many of them miss many common errors and work mostly for English. If an application could effectively find issues and work with multiple languages, people all over the world leaming another language will have a free tool to proofread and correct their texts, allowing them to improve their knowledge of languages.
The purpose of this project is to create an unsupervised machine learning system that is trained on grammatically correct texts and then used to identify and correct the grammar and spelling mistakes in various sentences.
This project has three main components, and was coded in python. First is a neural network which is created with a scikit-learn platform, Second component is word2vec which takes words and converts them to vector representation, making the words easier to work with. The third element is beam search which takes the trained neural network and applies it to new sentences, trying to identify and correct mistakes. This is done by computing a score for each sentence, which represents the sentence's likelihood. If the score is low, the sentence is likely to be incorrect, so the beam search tries to rcpair it. If a word replacement improves the score, the new word is used, otherwise the sentence remains unchanged. To prevent beam search from changing sentences unnecessarily, Levenshtein distance is used to see how many letters must be changed to improve the score, and, if it is too many, the sentence is not changed.
In the experiments, the program is trained and then is given a set of erroneous sentences. The test result is the percentage of the sentences it processed correctly (correctly means the crror is fixed and the meaning of the sentence is preserved). During my main experiment, my program was trained on a bulk of texts and tested on unrelated erroneous sentences. It corrected about 70% of the 270 sentences successfully. On the rest, it may still correct the errors, but also unnecessarily changes words that are correct. For some sentences it corrects the sentence, but changes the original meaning of the sentence. I also trained my neural network on Russian novels and tested it on Russian sentences with errors. The accuracy for this test was 28%, however, the size of a training set was much smaller. One major result of my project was that my program performed on par with well known grammar checkers, such as Google Drive and Grammarly, I compared the accuracy of each application by reading through all test sentences, manually checking correctness.
My results are nowhere near perfect, thus I believe that the simple neural network approach to the problem reached its limits. To improve the accuracy further would take a more complex, or perhaps a completely different approach, for the main factor my program relies on is probability which I believe is insufficient for an accurate language processing.