NEURAL NETWORK Model

BASIC Design

In machine learning, embedding vector is an important concept and tool in which knowledge is embedded. For example, the word-embedding vector reflects the vocabulary relationship learnt by Word2Vec, and the face-embedding vector reveals the human facial features learnt by the Siamese Network. When the embedding vector is placed in the middle of a neural network, an encoder/decoder architecture would be formed, and is capable of handling various tasks involving understanding of information (e.g. languages translation in RNN).

In computer vision, this model structure is particularly popular for reconstruction tasks (e.g. physiognomy and paleontology). After studying various papers on re-construction tasks, we based our model on the one proposed by Henzler et al. (2018) about reconstruction of mammalian crania.

A single 2D input image would be encoded as an embedding tensor via various layers of convolution layers. The embedding tensor would represent the knowledge of the 2D input image, and it would be passed to the various layers of deconvolution layers in the decoder to be reconstructed as a 3D volume.

MODEL DETAILS

There are 114 layers in the encoder and 25 layers in the decoder in our model. The encoder reduces the spatial resolution of the input and increases the number of channels, while the decoder increases the spatial resolution.

Similar to the other CNN models, a convolution layer, batch normalization layer, and a ReLU layer are used to form a basic block. Three basic blocks form one residual block. Residual blocks have become very popular in deep learning since the launch of ResNet-50 in 2015, as the skip connection design effectively shortens training time without sacrificing performance.

Every three residual blocks form a down block in our model. A down block reduces the spatial resolution into half. After four down blocks, the original input image with size 256 x 256 would be squeezed to a 8 x 8 matrix (with 256 channels) as the embedding tensor.

In the decoder, deconvolution layers are key. Each deconvolution layer would double the diameter of the input. After various deconvolution layers, the 8 x 8 embedding would be magnified to a 128 x 128 x 128 output volume. It is worth noting that while the spatial resolution refers to the x-y plane, the number of channels in the decoder refers to the z-axis. Therefore, in our model, the x- and y-axes and z-axis are not symmetric, as the paradigm of 3D image formation in these axes are different.

Skip connections between the encoder and decoder are constructed, which share details among different layers of the encoder and decoder, while ensuring efficient training.

The model is built under the Keras/Tensorflow framework.

TRAINING PROCESS

The model is constructed under the Keras/Tensorflow framework. The training rate starts from 0.015, and approaches 0.01 asymptotically by using learning rate decay. Stochastic gradient descent is used without mini batch.

We use a single X-ray and one CT scan data for each patient in the training process. The patients are split on 90/10 basis for the training and test set. The 2D images and 3D scans are serialized as a TFExample Protobuf, and stored in TFRecord dataset for efficient training.

Training was run on Google Cloud AI-Platform using the n1-standard-1 worker. Training took approximately two days to complete 20 epochs.

Although the final results were obtained from the model trained on Google Cloud AI-Platform, initial training of the model was performed on Euler CPU/GPU Supercomputer. The training of the model on Euler was run for around 50 hours. The training process on Euler gave us some insight about the strengths and weaknesses of the model.

After the model had started giving promising outputs, we modified the code for Euler to work on Google Cloud AI-Platform where training was done more efficiently due to the use of TFRecords.

why IS A 2D INPUT SUFFICIENT TO RECONSTRUCT A 3D with details?

Before the increasing popularity of neural networks in the field of computer vision, there were different 3D reconstruction approaches using techniques such as photometric stereo. Most of these approaches, however, attempt to reconstruct only a two-dimensional surface (e.g. surface of a statue) of a three-dimensional volume, instead of a three-dimensional volume with three-dimensional details (e.g. the entire statue). Without a priori knowledge, reconstructing a volume with three dimensional details using two-dimensional input is, in some sense, “infeasible”, since the information in two-dimension is only sufficient to reconstruct two-dimensional surface.

If a priori knowledge is used, however, the reconstruction of the extra third dimension would become feasible. The encoder of the model would learn how to embed the 2D image into the embedding tensor. The model is required to understand the anatomy of the human chest. Without understanding these knowledges, the model is not able to generate an embedding tensor that can reconstruct the 3D volume. Therefore, the reconstruction of the 3D volume makes use of such a priori anatomy knowledge.

Even though the model was first used to reconstruct mammalian crania (Henzler et al. 2018), we can see that crania are more like a 2D surface (or binary 3D image, as it is either bones or air) as the volume is much simpler. In contrast, human chest contains extremely complex three-dimensional details, and thus this project is a novel attempt to reconstruct an object with substantially more complicated three-dimensional details. To the best of our knowledge, a 3D reconstruction of such level using this model has never been attempted before.

If a meaningful 3D volume of human chest can be reconstructed, the model has likely "learned" normal human anatomy. The embedding tensor itself contains information of the original input embedded on such anatomy framework, and thus the tensor can be used in other tasks via another decoder, whether such tensor is precomputed or not.

Model graphic altered from Henzler et al (2018) paper.