Learning Visual-Audio Representations for Voice-Controlled Robots