Throughout the course, we will have 7 paper quiz bowls. These games will involve teams of students competing to answer questions about two papers covered in class that day.
The format of the paper quiz bowls will be as follows:
Two groups of students will give a brief 20-minute overview of their assigned paper (2 papers in total).
Afterward, all students in the class will be split into teams of 3-4 for a paper quiz bowl.
The top winning teams will receive small prizes.
Paper quiz bowls will be played with a buzzer system between teams of 2-5 players each. A moderator (me) will read questions to the players, who will try to score points for their team by buzzing first and responding with the correct answer.
The paper quiz bowl will consist of up to 16 questions of increasing difficulty: 4 easy questions (2 per paper), 4 medium questions (2 per paper), and 4 hard questions (2 per paper). Each correctly answered hard question will result in a bonus question for the team that correctly answered it.
Each question will be given a minute. If no team can answer it in a minute, we will move to the next question.
Each team can buzz at any point while the question is being read (i.e., potentially interrupting the moderator). Once a buzzer is sounded, there needs to be an answer within 5 seconds. If the provided answer is wrong, the other team can confer for the remaining time and provide their final answer at the end of a minute.
Conferrals among team members are allowed.
The scoring will be done as follows:
Easy Questions: 2 points.
Medium Questions: 4 points.
Hard Questions: 6 points.
Bonus Questions: A bonus question will be granted for every correctly answered hard question. A correctly answered bonus question will double the value of the points, granting the team 12 points (including the points for the correctly answered hard question).
For each paper quiz bowl, you will need to submit 2 questions for each of the two papers (i.e., 4 questions in total). You will also need to submit answers to those questions. Questions and answers will be due by 10:00 AM on the day of the class. Late submissions will not be accepted. You should upload your submission through the "Assignments" section on Canvas. Below are some guidelines for coming up with good questions.
The questions must have simple, non-ambiguous, and non-binary answers (i.e., no Yes or No questions, no open-ended questions).
Don't forget to include answers to your submitted questions in your submission!
If your question has multiple correct answers (e.g., List one of the reasons why ...), please provide all possible correct answers in your submission. If it's not possible to provide all correct answers, then choose a different question.
The questions should be specific to the discussed paper and cover the most important themes/topics in the paper.
While some questions might be somewhat generic (e.g., what datasets did the authors use for pretraining?), ideally, the questions should not be re-used from paper to paper without any thinking.
The best questions will be unique and have simple answers that require a deep understanding of the paper.
Try to come up with questions that test higher-level understanding of the paper rather than questions that require simple memorization of the facts (unless you think those facts are important), i.e., avoid questions such as "What was the accuracy of this model on Dataset X?", "What was the second best-performing method?", "How many parameters does model Y have?", etc.).
The questions should cover a wide range of categories/topics. Below are some examples that I curated for the following paper:
The motivation:
Question #1: What is the main bottleneck of applying Transformers to video?
Answer #2: Quadratic complexity of standard self-attention, which makes it prohibitively costly to apply self-attention on long video sequences.
Question #2: List one reason why Transformers might be advantageous over 2D/3D CNNs in the video domain?
Answer #2: (1) Less restrictive inductive biases, (2) long-range modeling ability, (3) faster training and inference compared to 3D CNNs.
Prior work:
Question #3: Which paper inspired the authors to decompose the video into a sequence of frame-level patches?
Answer #3: The Vision Transformer (ViT) paper published in ICLR 2020.
Question #4: What family of models have been dominating video classification before Transformers?
Answer #4: 3D Convolutional Neural Networks.
Technical approach:
Question #5: How is the positional information in time and space encoded in the model?
Answer #5: Using learnable spatiotemporal positional embeddings that are added to the embeddings of each patch.
Question #6: What 3 axes does the proposed "Axial" attention variant use to decompose attention computation?
Answer #6: Time, width, and height.
Implementation details:
Question #7: What spatial resolution does the TimeSformer-HR variant operate on?
Answer #7: 448x448 spatial resolution.
Question #8: How many and what type of GPUs do the authors use to train their model?
Answer #8: 32 V100 GPUs.
Experimental results:
Question #9: Why does the proposed TimeSformer model require large-scale Imagenet pretraining?
Answer #9: Due to optimization difficulties of training a model with a large number of parameters from scratch.
Question #10: Why does the divided space-time attention variant outperform joint space-time attention variant?
Answer #10: Due to a significantly larger number of learnable parameters (121.4M vs 85.9M).
Other:
Question #11: List one of the main limitations of the proposed model.
Answer #11: (1) requires image-level pretraining, (2) struggles with temporally heavy datasets such as SSv2.
To submit your questions and answers, please use the following LaTeX template. Don't forget to compile the tex file into a PDF and submit the PDF in your final submission on Canvas.