position info encoded in CNN

Notes from reading an ICLR paper: How much position information do convolutional neural networks encode?

CNN has been applied to learn the absolute coordinates of objects in an image and it works really well. The question is whether / how CNN encodes the position info?

The paper assumes a CNN model (e.g. VGG / ResNet ) pre-trained with ImageNet has already encoded the position of every pixel of the defined input in some way. 

To prove that, the paper uses an extra convolution layer on top of a pre-trained cnn model to convert the already encoded position information to certain spatial patterns. Assuming the spatial patterns can only be derived from position information, the position info has to be encoded there. The spatial patterns used are vertical gradient like pattern (i.e. color value goes from high to low vertically), gaussian distribution pattern (i.e. high value in the middle and it drops gradually when moving out), etc. 

The pre-trained cnn model (VGG / ResNet) was trained for classifying objects in ImageNet and logically has nothing to do with position information. The pre-trained model is used for extracting features from images only, and it assumes the extracted features have contained position info.

To be able to convert  the position information to spatial patterns (i.e. the conversion function), the paper trains the extra convolution layer using some existing image dataset (any) as input X and apply an arbitrarily selected spatial pattern as y for all X. This training teaches the extra layer to ignore the content of the image and use only the position information to output spatial patterns. The extra layer is the only part to be trained with X and y in this paper.

So assuming the pre-trained cnn model has encoded position information somewhere, the extra convolution layer then outputs the spatial pattern based on the position info, i.e. it needs to learn about outputting a high value for a pixel at the top (given vertical gradient pattern). The important thing here is it needs to know a pixel is at the top in the first place. Knowing where a pixel is (position), the extra convolution layer outputs values accordingly, and it proves the position info is there.

What if the position information is learned by the extra convolution layer?

The paper gets rid of the pre-trained model and uses only the extra layer, and it doesn't work. So it concludes the position information is encoded in the pre-trained model.

What if the spatial patterns are rigidly memorised by the extra layer, instead of being derived from position information?

Some dude has exactly the same question but the authors seem not to answering it well.

How a pre-trained model encodes position information?

zero padding. The solution works well with zero padding but not working without zero padding or other padding (e.g. circular padding)