The whole process is divided into two steps: Tool Recognition and Phase Recognition.
Tool usage signals can be used to generate more discriminative features for phase recognition. The aim of tool recognition is to capture information (e.g. motion/orientation of the tools) and visual cues (e.g. lighting/color) to potentially improve phase recognition. In addition to helping phase detection, the tool detection task itself is an interesting problem because it can be used in other applications, like to automatically index a surgical video database by labeling the tool present in the videos. Here we have used ResNet-152 architecture towards multi-label classification in 7 tool classes.
Various types of features have been used to carry out the phase recognition task like binary tool usage signals. However, these features are typically obtained through a manual annotation process, which is cumbersome. Another feature type that is typically used is visual features, such as pixel values, the combination of features (color, texture, and shape) which are handcrafted. Due to this, possible significant characteristics of images are lost.
We plan to learn inherent visual features from surgical videos using ResNet-152 because they have dramatically improved the results for various image recognition tasks in recent years. Also, it is better to automatically learn the features from laparoscopic videos because of the visual challenges like non-static camera, which make it difficult to design suitable features.
In feature extraction, the model learned in tool detection will be used to produce image features that will feed to the phase recognition task for multi-class phase classification. For phase recognition, since the data is temporal, we decided to pursue recurrent neural networks approach as mentioned in [9] and decided on testing RNN, GRU and LSTM architectures. Out of these three, GRU ended up with best results.