Persian Pre-processor: PrePerPrePer (Seraji, 2015, Chapter 4, pp. 82-88) is a software program developed in Ruby for the task of editing and cleaning up texts in Persian. The program uses the existing Virastar module for some formating tasks (Bargi, 2011). The present PrePer handles miscellaneous cases and performs functions to normalize texts into computational standard script. PrePer via Virastar also takes care of the occurrences of mixed character encodings. By preprocessing texts all letters in Arabic style with Arabic Unicode characters are edited to Persian style with mapping to Persian Unicode encoding. In addition, Arabic and Western digits are all converted to Persian digits. PrePer also converts white space to ZWNJ between:DownloadThe program is developed by Mojgan Seraji ( mojgan.seraji96@gmail.com ) and licensed under GNU General Public License . You need to install GEM for Ruby before running the PrePer program. PrePer can be downloaded below:Running PrePerYou can run PrePer by typing the following at the command line prompt:prompt> ruby pre_per.rb input_file.txt > output_file.txt References1. A. A. Bargi. 2011. Virastar. https://github.com/aziz/virastar.2. Seraji Mojgan. 2013. PrePer: A Pre-processor for Persian. Presented at the fifth International Conference on Iranian Linguistics (ICIL5). Bamberg, Germany. [pdf] 3. Seraji, Mojgan. 2015. Morphosyntactic Corpora and Tools for Persian. Doctoral dissertation, Uppsala University. Studia Linguistica Upsaliensia 16. [pdf] |
Home >