Muffin or Chihuahua? 

Challenging Large Vision-Language Models with Multipanel VQA