We used an Anaconda environment for package management. The Cholec80 dataset was trained on a server with 16 CPUs, 80GB memory, 4 Tesla T4 cards (16G RAM each) and 720GB SSD. CUDA/ Cudnn, Pytorch, Scikit-learn, Tmux, and other relevant libraries were installed. A single GPU was used for training the model and validation.
We divided the 80 videos of the Cholec80 dataset in the following way:
Train dataset: 40 videos (01-40)
Validation dataset: 40 videos (01-40)
Test dataset: 40 videos (41-80)
For each frame processed, we applied several transformations. These include resizing frame to 224x224 pixels, random resized cropping to 224x224 pixel, random horizontal flip and normalization of image.
Open source computer vision (OpenCV) is a Machine Learning and Computer Vision library which finds its applications in most of the modern image applications. OpenCV forms a significant part in accelerating Machine learning based Computer Vision projects. The videos from Cholec80 dataset were first captured using the VideoCapture() function from OpenCV library, and then the respective frames were extracted and eventually saved using the imwrite() function. We have performed frame extraction at 1fps where videos are captured at 25fps in the dataset.
The model has been trained using Pytorch.
The Figure below shows our training model (for tool recognition) at a higher level.
Number of epochs: 3
SGD learning rate: 0.001
Binary Cross Entropy Loss (BCE) function
Using transfer learning (pretrained weight) for Resnet-152 so that the weights converges in relatively less time.
Model will give probabilities for each of the seven classes and for determining if all the possible labels are predicted correctly or not, we used the following methodology:
If the actual label is 1 and the probability outcome for that tool is greater than 0.5 then we say that prediction is correct
If the actual label is 0 and the probability outcome for that tool is less than 0.5 then we say that prediction is correct
For all other cases the prediction will be incorrect.
The Resnet-152 model learned in tool recognition is then used to extract specific tool features it learned. This is done by extracting hidden features from the last hidden layer before Softmax layer. These features are then used in Phase Recognition. The aim of training on tool features was to capture information (e.g. motion and orientation of the tools) and visual cues (e.g. lighting and color) that could potentially enhance phase recognition. [9]
Higher level flow depicting how feartures were extracte based on tool recognition results.
For phase recognition, we trained three Recurrent network based architectures - RNN, GNU and LSTM - on tool features extracted during Feature Extraction.
SGD learning rate: 0.001
Number of epochs: 50
Cross Entropy loss function
Input sequence length was tested at 100 and 200, which corresponds to around 1.5 and 3 minutes within the video. This is a reasonable choice since most phases span a couple of minutes.
The model produces probability inferences for all phase classes per frame taken into account. The class with maximum probability is taken to be the final phase prediction for the frame.