Emergence and Effectiveness of Task Vectors in In-Context Learning : An Encoder Decoder Perspective
Figure A: TD vs. ICL Performance on POS Tagging. Overall, TD score is correlated with downstream ICL performance and these results illustrate that our encoder-decoder framework for understanding ICL extends beyond transformers.
Figure B: TD vs. ICL Performance on Bitwise Arithmetic. Like in the POS Tagging on the left, TD score is correleated with downstream ICL performance, but is less clear in the bitwise arithmetic task.
Figure C: UMAP Visualization of Mamba-2 8B Representations Across Layers for POS Tagging. Starting from layer 33, the representations become separable and the TD score peaks at this layer. Therefore, we conduct the Figure A and B experiments by analyzing this layer's representations.
Figure D : Test MSE over training. we replicated the training experiments for a mixture of regression tasks in Kim et al. (https://arxiv.org/pdf/2410.05448) and explored whether our findings hold. We trained the small transformer with a mixture of tasks consisting of linear regression, quadratic regression, sparse linear regression with base=4, and leakyReLU regression in equal fractions (1/4 for each). Transformer architecture and other hyperparameters are kept the same as the experiment in Figure 2.
Figure E : TD score at layer 6 over training. Over training, the TD score barely changes, indicating that these regression families may share the common, underlying algorithm and the model does not learn separate task vectors.
Figure F : Number of ICL examples vs Test MSE at Epoch 10, 100, 290. The model was able to learn linear, sparse linear, and leaky ReLU regression but fails at quadratic regression. We interpret these results as evidence that the model does not develop distinct task encoding-decodings for the selected regression tasks. The selected regression tasks are theoretically possible to run on similar algorithms. For instance, the leaky ReLU can be implemented by running the standard linear regression and renormalizing by signs, so the model does not necessarily encode different regression algorithms. This aligns with observations by Kim et al. (Figure 6 and Section 4.2) where they showed learning common structure across regression tasks transferably helps the learning of other regression tasks.
Figure G : TD score over layers at Epoch 10, 100, 290. This plot, along with Figure E and the UMAP visualization in Figure H, demonstrate the lack of separation in the middle layer, suggesting that distinct task vectors are not implemented over the training.
Figure H: UMAP Visualization of Representations Across Layers for Mixture of Regression Tasks
Figure I: Attention head pruning experiment on sparse linear regression tasks (synthetic tasks described in Section 2.3 of our main manuscript). We pruned each attention head and measured the change in mean squared error (MSE) across 100 random sparse linear regression ICL sequence of each basis(same as in the Figure 2 of our main manuscript). AIE stands for Attention Importance Estimation, defined as the change in mean squared error (MSE) between the head-pruned model and the original model. Notably, at the fifth layer (labeled as "layer 4" in the figure since indexing begins from zero), different attention heads corresponded distinctly to different bases—for example, head 5 of layer 4 (l4h5) corresponds to base 0, while head 3 of the same layer (l4h3) corresponds to base 2. We argue that these experiments provide direct evidence that in sparse linear regression tasks, structurally distinct four-base linear regression algorithms are implemented within the model.
Figure J : Attention head pruning experiment on mixture of regression families (refer to above). We pruned each attention head and measured the change in mean squared error (MSE) across 100 randomly generated mixtures of regression ICL sequences for each regression type (same as shown in Figures D, E, and F on this website). AIE stands for Attention Importance Estimation, defined as the change in mean squared error (MSE) between the head-pruned model and the original model. Notably, except for the quadratic regression task (for which the model fails to learn an effective algorithm), we observed that attention heads are consistently shared across linear regression, leaky ReLU, and sparse linear regression tasks. This suggests that the algorithms for these three regression tasks share structural similarities, aligning with the observation of a "common structure" by Kim et al. We argue that these experiments provide direct evidence that these three variants of linear regressions structurally indistinguishable.