Video Question Answering with Iterative Video-Text Co-Tokenization