In this task you will be required to "correct" Romanian text that does not have diacritics (ș ț ă î and â).
You'll be given a dataset consisting of free text that is ensured it has diacritics. You can train your model on this dataset as well as any other resource you wish. Evaluation will be done on a hidden test set by us.
There are many tools online that correct diacritics, but let's create an open-source package that anybody can plug into their own system and have state-of-the-art accuracy!
Check out the technical details for the Diac Challenge in this colab notebook!
You will be judged based on the performance of the model, ingenuity of design, speed of operation (includes model size, not only run time), and, last-but-not-least, readability of your code.
All the code you submit for this hackathon will be made public on the LiRo GitHub repo, and the winner will have the model listed in LiRo after the hackathon ends (and the test-set will be made public).