This algorithm seeks to find a path that could allow robot to draw a portrait in one stroke.
Our image processing pipeline:
Face Detection (A face crop with size 256x256 is produced)
Edge Extraction (Extract all the edge points in the image)
Minimum Spanning Tree (Treat all edge points as vertices in a complete graph and find minimum spanning tree)
Depth-first Search (Use depth-first search to traverse all the tree node)
Ramer-Douglas-Peucher Downsampling (Use RDP downsampling method to simplify path while maintaing the details of a portrait)
<LikeUnet CNN for Human Matting>
Our network design and training are based on three papers below:
U-Net: Convolutional Networks for Biomedical Image Segmentation
ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks
With tools as powerful as Unet, the image segmentation is a relatively easy job. Since we don't want long training time and a huge model, LikeUnet is, well, like Unet, but with fewer convolution layers and kernels, and slightly deeper. Plus, it looks good.
Both FCN-8s and Unet implementations are based on caffe. FCN-8s has a model size of above 500MB, and Unet has a model size of around 180MB. Such big model is somehow inefficient for the simple task. Compared to those two networks, LikeUnet is implemented on keras, and the model size is just over 7MB. The newest version is merely 1MB. This is done by cutting unnecessary convolution layers. It speeds up the training and shrinks the model size significantly. One thing about Unet is that the input size is not flexible. Whereas in FCN, the network does not require a specific input size as long as the height and width of the image are multiples of 32. Also, Unet has better binary segmentation accuracy than FCN. So it is reasonable to design our network based on Unet and adjust it with the good of FCN. The design takes the best of these two networks, according to the training and validation errors and mIoUs, LikeUnet is so far a pretty good network for human matting.
Apart from the network design, there are some training techniques worth mentioning. We borrow the idea from bootstrap techniques. Since the network does not require a specific image size, we sample our data and resize it to have different aspect ratios. Also, we randomly select some boxes in one image and hope that the network could learn to segment incomplete images as well. Every batch has the same size, but in one epoch, the network would have seen different sizes. Though by resampling our dataset, the training and validation error both increases a bit, it performs better in a real-world deployment. Also, this network implements a method introduced by ACNet. By replacing 3x3 conv with 3x3 + 1x3 + 3x1 convs in the training phase, there might be chances to increase the model performance, and it did improve in our test. The fantastic things come after. Due to the sigma additivity of convolution operation (conv is a linear operator after all), we could manually combine the weights of 3 different conv layers into one conv layer before deployment. The new layer is equivalent to the original 3 parallel layers. Thus, by doing so, performance increases, and the runtime is still the same. Last but not least, at the end of every epoch, data is shuffled to reduce the impact of the potential correlation of data.
LikeUnet Architecture
Result
<Haar Cascade Face Detection>
Using the haar cascade front face detection CascadeClassifier provided by opencv-python, we could find a smaller rectangle that contains the most significant face in the whole image. The reason why we do human matting before this is to make it easier for a traditional model like the haar cascade model to detect human faces.
<Canny edge detector>
Canny detector is provided by Canny in opencv-python.
The Process of Canny edge detection algorithm can be broken down to 5 different steps:
Apply Gaussian filter to smooth the image to remove the noise. In our implementation, we also apply Bilateral filter before, to maintain edges and remove noise to the best extent.
Find the intensity gradients of the image: Sepcifically, the sum intensity gradients along the vertical edges and horizontal edges.
Apply non-maximum suppression to get rid of spurious response to edge detection. It is like maxpooling without the pooling. It can stabilize the result by selecting the local maxima.
Apply empirically determined double threshold to find potential edges. If an edge pixel's gradient value is higher than the high threshold, it is marked as a strong edge pixel. If an edge pixel's gradient value is between high and low threshold, it is marked as a weak edge pixel. If an edge pixel's value is smaller than the low threshold, it is suppressed.
Track edge by hysteresis: Finalize the detection of edges by suppressing all the other edges that are weak and not connected to strong edges. If weak edges are too far away and isolated, it is suppressed. If weak edges are near strong edges, or conntected to strong edges, it is regarded as edges.
<Otsu's Method for Image Thresholding>
Separate the pixels into two groups by its intensity values and use the mean of the two groups as canny threshold.
One downside about Canny detector is that you have to set the two thresholds by hand. Those two thresholds can be hard to find, and the best value could vary from images to images. So we use Otsu's method for image thresholding to separate the pixel intensity into two groups. The idea behind this method is that it exhaustively search every separating pixel intensity point to find the best separation that could both maximize the between-group variance and minimize the within-group variance. After we have divided the pixel intensity value into two groups, we take the mean of each group as low and high threshold for canny detector. Experiments show that this method consistently generates high-quality edges.
Otsu's method visualization
<Prim's Algorithm for Minimum Spanning Tree>
After we extract all the edges, we don't have a graph; we don't know how to go through all the edges in one path and make the drawing like a real portrait. Typically, if two edge points are close, it means it is connected. If they are far away from each other, then they should not be connected in the drawing. Based on this, we find that the minimum spanning tree provides us with exactly what we ask for. If we consider all the edge points as vertices in a complete graph, by finding the minimum spanning tree, we can connect edge points that are close to each other and discard the line between two far away points.
We would choose prim's algorithm, a greedy algorithm, to generate minimum spanning tree for its excellent performance in dense graph. According to prim's algorithm, you start at a random initial point, find the smallest edge that is connected with the initial point and add it as an edge of the minimum spanning tree. Continue to expand the tree until all the vertices are included. Though we could implement the algorithm ourselves, python is a relatively slow language for computationally expensive tasks. What's more, we find that the implementation minimum_spanning_tree by scipy is pretty fast (probably written in C), so we didn't bother to write our implementation of prim's algorithm.
Prim's algorithm visualization
<Depth-first Search>
Depth-first search is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node (selecting some arbitrary node as the root node in the case of a minimum spanning tree, because mst has not root node) and explores as far as possible along each branch before backtracking. DFS is how we can generate one path that goes through every vertex and edge.
(Note: The path generated by this method may go through most of the edges twice. This isn't very nice because it can take a long time to draw. But this is also good because it adds redundancy to the system.)
<Ramer-Douglas-Peucker Algorithm for Downsampling>
Input parameters are the starting curve, an ordered set of points or lines, and the maximum tolerance ε > 0.
The algorithm recursively divides the line. Initially, it is given all the points between the first and last point. It automatically marks the first and last point to be kept. It then finds the point that is farthest from the line segment with the first and last points as endpoints by computing the error vector of projection. Consider this as projecting all the points to the vector defined by the first and last point, then the error vector of projection is the difference of the original points and the projected points, this could speed up the computation and make code easier to understand. If the point is closer than ε to the line segment, then any points not currently marked to be kept can be discarded without the simplified curve being worse than ε. If the point farthest from the line segment (with maximum error) is greater than ε from the approximation, then that point must be kept. The algorithm recursively calls itself with the first and the farthest point and then with the farthest and the last point.
When the recursion is completed, a new output curve can be generated consisting of all and only those points that have been marked as kept.
This algorithm is for simplifying straight lines and maintaining all the details in curves.
(Personally, I would consider the implementation of this algorithm as the most genius one in the entire image pipeline :)
Ramer-Douglas-Peucker algorithm visualization
Edge detection
Original path generated by minimum spanning tree and depth-first search
RDP downsampled path