Home / Project Blogs /Employing Scale & Attention

Goal

As a follow-up to this publication, we need an algorithm that achieves both high recall and high precision.

Intro

Previous work on segmenting brain regions from images of Nissl-stained rat brain tissue has shown that histology images offer different benefits at different scales. The convolutional segmentation network, U-Net, reaches a greater recall when trained with large-scale images and greater precision when trained with small-scale images.

Here we turn to the literature and explore models designed with scale in mind. Specifically, we consider two models:

  • Multi-Scale U-Net (MSU-Net) Paper | Code

    • Use multiple kernel sizes (3x3, 7x7)

    • Use multiple convolutional sequences

  • Multi-Scale Attention Net (MA-Net) Paper | Code

    • Uses Position-wise Attention Block and Multi-scale Fusion Attention Block

The plan is to compare the performance of the models trained with images of the original resolution.

Training Data

We will be using the same data as in the previous publication but with the following partition for training, validating, and testing:

  • Train: 'lvl23.png','lvl30.png','lvl21_1.png','lvl32.png'

  • Valid: 'lvl31.png','lvl28.png','lvl22.png','lvl24.png'

  • Test: 'lvl21.png','lvl25.png'

Similar to the first paper, we will sample image patches of 512x512 pixels from the original images. However, this time we are only focusing on the Fornix and we also use a slight variation in the sampling. Here we sample from a multivariate normal distribution with covariance matrix [[height**2, 0], [0, width**2]] centered at the center of the fornix, instead of using a circle with a diameter of (height + width)/2 for a uniform sampling boundary.

Swapping Networks

Here we show the average intersection over union on the validation set during training for the three networks. Overall, MA-Net seems to outperform the other networks by reaching a larger IoU sooner while remaining consistent throughout training.

Now, if we evaluate the three networks on one entire test image (effectively testing the models’ performance on a larger portion of negative samples), we recreate the problem of false positives shown in our previous experiments.

U-Net

IoU = 0.02

Recall = 0.84

Precision = 0.02

MSU-Net

IoU = 0.01

Recall = 0.99

Precision = 0.01

MA-Net

IoU = 0.02

Recall = 0.79

Precision = 0.02

We hypothesize all networks still suffer from a high false-positive rate on entire images, because, even if we are engaging networks that consider multiple scales, the input size of the networks spans a physical distance in the tissue that lacks spatial context. In our previous experiments, we noted that more spatial context reduced the false-positive rate, so it makes sense that a network built to consider multiple scales will still have a high false-positive rate if it only has access to a limited view of the tissue at a time.

Spatial Context

The spatial context of histology images appears to be critical in reducing false-positive Fornix predictions. In our previous experiments, we increased spatial context by downscaling the images in order to fit more physical space in the input of the model (512x512 pixels). However, downscaling the data resulted in a decrease in the recall metric; since the networks no longer had detailed information, they were not able to fully segment the fornix. Here we increase spatial context by modifying the network input dimensions instead of downscaling the data. This will ensure the network has access to detail and spatial context at the same time.

We trained the same three networks with two different data sets. Specifically, we used the same sampling scheme as above to create one dataset with image patches of 1024x1024 pixels and another with image patches of 2048x2048 pixels.

In both cases, MA-Net outperforms the other two models. We can now evaluate the networks on entire images as before to observe any effects on the false-positive rate.

Inputs of 1024x1024

U-Net

IoU = 0.0

Recall = 0.0

Precision = 0.0

MSU-Net

IoU = 0.02*

Recall = 0.95*

Precision = 0.02*

MA-Net

IoU = 0.1

Recall = 0.93

Precision = 0.1

*Thresholded at 0.4 instead of 0.5 like all other images.

Inputs of 2048x2048

U-Net

IoU = 0.0

Recall = 0.0

Precision = 0.0

MSU-Net

IoU = 0.02

Recall = 0.09

Precision = 0.03

MA-Net

IoU = 0.49

Recall = 0.97

Precision = 0.49

It is evident MA-Net effectively makes use of the larger input size. As a result, the predictions contain fewer false-positive pixels while still producing a good recall score.

MA-Net Discussion

MA-Net trains consistently well across the three input sizes we tried. A possible explanation is that the model's attention blocks allow it to consider information from the entire input image. U-Net and MSU-Net, in contrast, have a receptive field that is as good as their convolutional blocks allow them to be. This means that even with different kernel sizes, at least those proposed in MSU-Net, the scale necessary to solve brain region segmentation from histology images is not captured by the convolutional configurations of these two models. However, increasing the input size for MA-Net comes at a cost: memory space of attention blocks.

Memory Space

For reference, the U-Net model trained with a batch size of 16 images each with dimensions 512x512 pixels occupies a space of 7.5GB. Moreover, we used an NVIDA GPU with 12GB of memory.

The memory occupancy of MA-Net is as follows:

  • 6GB when trained with 16 images of 512^2 pixels per batch

  • 7GB when trained with 4 images of 1024^2 pixels per batch

  • 10GB when trained with 1 image of 2048^2 pixels per batch

Since the batch size for the three instances is equivalent, the size can be attributed to the growth of the model.

Lastly, it may be possible to further increase the input size for even more spatial context, and it would only be a matter of keeping it all in memory.

Conclusion

We found MA-Net achieves both a high recall and a high precision. The model is able to do this by using the information in large-scale images while also considering information across a larger physical distance through attention mechanisms.

We did not experiment with input sizes larger than 2048^2 pixels. It may be possible to input larger images for even more context. Considering MA-Net trains well across the three input sizes we tried, we cannot confirm we reached a ceiling for input size. Allowing larger images could be useful for images of entire tissue sections. Right now, one full-sized/original image of our dataset spans less than half of an entire tissue section.