Vision Language Models for Instance Segmentation and Tracking