Learning Audio-Visual Dynamics Using Scene Graphs