SVTS: Scalable Video-To-Speech Synthesis