Robot Sound Interpretation

People interpret sounds they hear and interact with the world according to their interpretations. When a human maps sound to meaning, his/her brain forms conceptual representations of the sound. Can we mimic how humans interpret sound using AI-based methods? We explore whether the sound command can be directly interpreted by the robots for visual-based decision making rather than being transcribed into text and symbols. State-of-art algorithms use automatic speech recognition (ASR) to translate sound to text, and then use language models to process text information. In contrast, our model actively learns and builds its own numerical interpretation of sounds. We call this process the Robot Sound Interpretation (RSI).