Our project aims at achieving higher accuracy in comparison to the state-of-the-art architecture for Cholecystectomy surgery. EndoNet architecture is a neural network architecture that was proposed for Recognition Tasks on Laparoscopic videos and has shown significant results. It uses an extended version of AlexNet architecture. But since the inception of Endonet, there has been the introduction of many object detection novel architectures, which are better in terms of accuracy, we feel the accuracy can be increased for the Cholecystectomy surgery phase- detection problem.
Here is the theoretical comparison of the few among the currently best architectures - AlexNet, VGGNet, and Resnet.
From the table above, we can see that AlexNet and Resnet152 use the almost same number of parameters but Resnet152 gives almost 10% higher accuracy as compared to that of AlexNet. Also, we can see that VGGNet and ResNet152 give similar accuracy (smaller for VGGNet) but the number of parameters used by VGGNet is almost double as compared to Resnet152 and hence using VGGNet we are getting a model with high training time and lower accuracy.
So, out of the above three networks, ResNet seems to be the most promising network. Although ResNet152 has higher computation time and consumes more energy in comparison to AlexNet, however, as mentioned in our problem statement, accuracy has been given much more weightage in comparison to time and power consumption. Hence, we went ahead and decided to use ResNet152 architecture for tool detection in Cholecystectomy surgery videos.
Since ResNet consumes higher energy and takes more time to train, we decided to experiment with various Resnet models with varying number of layers, such as ResNet 34, where 34 represents the number of layers in the architecture. As we know, architectures with a lesser number of nodes take less time to train. So, we looked at the theoretical results for various ResNet layers and concluded the significant difference in terms of error between the two consecutive architectures. So, to obtain results with higher accuracy, we decided to use the architecture with 152 layers i.e. ResNet152.
All the above theoretical results were validated by us as well on the Cholec80 dataset and are mentioned in the results section.
ResNet architecture is known as residual networks and forms an important component for many Computer Vision problems. They learn the residual representation functions instead of learning the signal representation directly and are known to have deeper architectures up to 152 layers (ResNet 152). The skip connections, known as shortcut connections, are introduced to fit the input from the previous layer to the next layer without modifying the input. It is because of the above skip connections that such deeper networks are made possible. ResNet also won the ILSVRC 2015 in image classification, detection, and localization and strengthens our trust in the usage of the architecture.
ResNets despite having deeper architectures, some with 152 layers, effectively solve the problem of Vanishing/ Exploding Gradients. The figure above shows, how the skip connection is added to add the input x to the output after a few weight layers. So, the output computed becomes H(x) = F(x) + x. The weight layers learn a kind of residual mapping, and even if there is some vanishing gradient, the identity is still there to be transferred back and they can be recovered, hence avoiding the vanishing gradient problem.
Surgical phases evolve over time so it is natural that the current phase depends on neighboring phases and to capture this temporal information we focused on recurrent neural networks approach [9]. We experimented with 3 different approaches : RNN, LSTM and GRU. GRU seemed to perform best among the three for our dataset.
A Gated Recurrent Unit (GRU), as its name suggests, is a variant of the RNN architecture, and uses gating mechanisms to control and manage the flow of information between cells in the neural network. GRUs were introduced only in 2014 by Cho, et al. and can be considered a relatively new architecture, especially when compared to the widely-adopted LSTM, which was proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber.
The structure of the GRU allows it to adaptively capture dependencies from large sequences of data without discarding information from earlier parts of the sequence. This is achieved through its gating units, similar to the ones in LSTMs, which solve the vanishing/exploding gradient problem of traditional RNNs. These gates are responsible for regulating the information to be kept or discarded at each time step. Other than its internal gating mechanisms, the GRU functions just like an RNN, where sequential input data is consumed by the GRU cell at each time step along with the memory, or otherwise known as the hidden state. The hidden state is then re-fed into the RNN cell together with the next input data in the sequence. This process continues like a relay system, producing the desired output. [10] [11]
Gated Recurrent Unit, fully gated version
By Jeblad - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=66225938