Mozilla Common Voice tasks support the creation of open, high-quality voice datasets used to train and evaluate speech recognition systems. These datasets are published by the Mozilla Foundation and used by researchers, developers, and organizations around the world.
On Effect Alpha, Common Voice tasks focus on natural human speech, clarity, and correctness. The goal is not perfection, but realism. Contributors help create datasets that reflect how real people speak in everyday situations, including different accents, dialects, and speaking styles.
You may encounter three main types of tasks:
Sentence creation
Audio recording
Sentence or audio validation
Each workflow has slightly different expectations, but all follow the same core principles.
Depending on the assignment, you may:
Write natural sentences that will later be recorded.
Record yourself reading provided sentences aloud.
Validate sentences or recordings created by other contributors.
Your role is to help create clear, natural, and usable voice data that accurately represents human speech.
Work in a quiet environment when recording.
Use a functioning microphone.
Be attentive and follow instructions carefully.
Use natural language and realistic speech patterns.
Avoid rushing through tasks. Accuracy and clarity are more important than speed.
Content should reflect how people naturally speak. Avoid overly complex phrasing, unnatural wording, or machine-like language.
Read or write sentences exactly as required.
Do not add, remove, or alter words unless instructed.
Avoid paraphrasing.
Common Voice is designed to include diverse accents and speaking styles.
Accents are not errors.
Dialects are valid.
Natural variations in speech are expected.
When writing sentences:
Use complete, grammatically correct sentences.
Write in a natural conversational style.
Keep phrasing easy to read aloud.
Good examples:
The sun was already setting behind the hills.
She forgot her keys on the kitchen table.
Poor examples:
XJ92 device initialization complete.
!!! This is so cool !!!
Do not include:
Full names of private individuals
Phone numbers, addresses, or emails
URLs or hashtags
Profanity or offensive language
References to violence, hate, or illegal activity
Sentences should remain neutral and appropriate for a global audience.
Before recording:
Find a quiet environment.
Ensure your microphone works correctly.
Speak at a natural pace.
During recording:
Read the sentence exactly as written.
Do not paraphrase or correct grammar.
Speak naturally, not robotically.
If you make a mistake, re-record instead of submitting.
Avoid:
Background noise.
Whispering or exaggerated pronunciation.
Speaking too quickly or slowly.
Cutting off the start or end of the recording.
Natural clarity is more important than perfection.
Validation helps maintain dataset quality. A handy "Validator Cheat Sheet" can be found here.
Accept when:
The recording matches the text exactly.
The audio is clear and understandable.
The sentence follows content guidelines.
Reject when:
Words are missing or changed.
The speaker clearly misreads the sentence.
Audio is distorted, incomplete, or unintelligible.
The sentence violates content rules.
Common Voice encourages linguistic diversity.
Do not reject for:
Accent differences
Regional pronunciation
Natural speaking variation
Only reject when differences prevent understanding or accuracy.
Rejecting recordings because of accent or tone.
Writing overly complex or unnatural sentences.
Speaking too slowly or unnaturally clearly.
Accepting recordings that clearly misread the text.
Common Voice datasets are used globally to improve speech recognition systems. Following these guidelines ensures:
Voice models become more inclusive.
Data reflects real human speech.
Contributors receive fair rewards.
Open datasets remain reliable and trustworthy.
If something feels unclear:
Reread the task instructions.
Use your best judgment.
Ask for clarification in Discord.
It is always better to ask than to guess!