Afsayh Saquib

Exploring Capsule Networks

Problems with Convolutional Networks

Convolution networks are still on the forefront of image classification because they work so well on so many different problems - their capabilities of what is essentially translated feature detection without much regard for orientation or relative spacing in an image allow classification of images whose blueprint might be much different from ones in training. In order to achieve this, these networks build high level features, the highest being the actual classification, from lower level features, like edge detection, or image entropy. This is done most effectively using a technique called "max pooling" which allows for successive convolutional layers that can detect higher and higher order features by averaging previous layers. However, there is a catch, by averaging these layers we are losing both information in the image detected by the inputs to layer and also turning away from our original aims for neural networks - to replicate human patterns of learning. By shedding the orientational and spatial information associated with an image for quick gains in feature detection, we lose the human ability to recognize a uniform object regardless of orientation, and in some cases lighting, while only having been exposed to a singular orientation.

For example, would you be able to tell that these two images of cake are actually the same cake? A CNN would have some difficulty.

Overview of Capsule Networks

In a paper (https://arxiv.org/pdf/1710.09829.pdf) in late 2017, Geoffrey Hinton demonstrated his idea for a new method of training on images through a so-called Capsule network, where a capsule was "a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part ". In other words, a group of neurons that act as a feature detector for object parts, not dissimilar to our previous CNN model. However, through a process known as dynamic routing, capsules at a lower level will iteratively make predictions about the activity vectors at the next layer and when predictions from multiple capsules agree on the feature to be detected in the said capsule, that capsule will "activate" and begin the same process, making the contributing capsules its children. In the paper's own words this creates a parse-tree like structure where the lack of pooling allows the network to retain the spatial information of a feature identified by a capsule, while keeping all but the final layer of capsules be convolutional in order to replicate the learning across the image.

Issues with the paper

The paper states that while there are many ways to implement these capsule nets, the aim of the paper is only to show one "straightforward" implementation where they test on MNIST, the handwritten digit dataset. Firstly, while this dataset is often used to benchmark ideas and algorithms in the machine learning community, it is quite simplistic in terms of image content and ideas that are shown to mesh will with the dataset are not guaranteed to work with anything else. Nonetheless, their structure is as follows:

Where the structure on the left is as previously described, and the DigitCaps layer is the final capsule layer where the length of the activity vectors represent the instances of that specific class and can be used for digit reconstruction as shown on the right. While this implementation is not slow on MNIST, it becomes an issue when the dataset is more complex.

Testing an implementation

I initially wanted to test the paper's results for myself and also wanted to try out PyTorch as I had no experience with it - however the most prevalent implementation of Capsule Networks in Pytorch (https://github.com/gram-ai/capsule-networks) is a complete pain to install on Windows and caused much consternation for me. I ended up using a Tensorflow implementation (https://github.com/naturomics/CapsNet-Tensorflow) which was set up to run on MNIST, all of my training was run on a Nvidia GTX 1080 Ti. The MNIST results where as stated in the paper, however I was surprised to see how few iterations it actually took for the network to converge which furthered my thinking that the mixture of hyperparameter optimization and network structure with the simplicity of MNIST trivialized the problem.

In order to try a more feature filled dataset, I attempted to use smallNORB (https://cs.nyu.edu/~ylclab/data/norb-v1.0-small/) which has fewer classes than MNIST but incorporates rotation and has objects more complicated than single digits. The training time for this dataset, however, was immensly long. I'm not sure if this was due to my graphics card, the tensorflow implementation, or just the operations required from a Capsule Net - I was unable to train for as many iterations as MNIST but surprisingly the loss still converged almost as fast. On testing on this few iterations I was able to get ~98% accuracy.

smallNORB dataset examples

This was not the stopping point that I had envisioned when proposing this project, however when I attempted to move forward from this dataset and try training on something more concrete - head CT scans to classify Intracranial hemorrhages or Cranial fractures (http://headctstudy.qure.ai/) the amount of time it was taking to train made me unable to get any meaningful results.

Conclusion

While I wasn't able to do exactly what I'd set out to do in my project proposal, I was able to see first hand the reasons why I'd heard that Capsule Networks are slow - the iterative routing mechanism blows up the amount of calculations that need to be done at each layer, which if the image being trained on is large and feature-rich means the feature detection at each capsule will be intensive as well. In continuing my investigation of these networks, I would probably set aside more time to train on the head CT data as that would have really showed if alongside being slow Capsule Nets were also hard to train on non-dataset-specific models.