4 Steps & 4 Consideration
4 Steps & 4 Consideration
In the course of ERADing, we follow and consider these items:
Step 1: Selecting raw naturally-occurring language representation sources
Step 2: Filtering and transcribing
Step 3: Normalizing and pre-processing
Step 4: Human translation into English (for parallel corpora only)
Consideration 1: Opting for a high variety and avoiding sub-dialectal or close varieties
Consideration 2: Minimal inclusion of French lexicons
Consideration 3: Choosing MSA orthography rules for scripting
Consideration 4: Verifying and extending datasets' scales
Flowchart of corpora compilation procedure