RBOX

When training a model for text recognition, one important thing is to label the training data.

It needs to know where the text is in an image and the bounding box. RBOX is one way to encode the bounding box.

https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxoZWxsb2JlbmNoZW58Z3g6MTU2MDY1ZmRlMDAzNzg2NA

IOU loss and Angle Loss

The RBOX encoding has 5 numbers (d1, d2, d3, d4, angle). d1 to d4 are the distances to the top, bottom, left and right edge of the rectangle. 

It needs to calculate the loss with respect to rotation angle and the bounding rectangle. 

let the ground truth distances and rotation angle be d1_gt, d2_gt, d3_gt, d4_gt and angle_gt

Let the prediction be d1_pred, d2_pred, d3_pred, d4_pred and angle_pred

The loss function for rotation angle is 

    angle_loss = 1 - cos(angle_pred - angle_gt)

so when the angle difference is 0 the loss is 0 and when difference is near +/-90 degree the loss is near 1. The angle ranges from -90 to + 90.

Assuming the ground truth rectangle and the prediction rectangle are the same, here it calculates the overlapped area.

    intersect_area = (min(d3_gt, d3_pred) + min(d4_gt, d4_pred)) * (min(d1_gt, d1_pred) + min(d2_gt, d2_pred))

The ground truth area = (d1_gt + d2_gt) * (d3_gt + d4_gt)

The prediction area = (d1_pred + d2_pred) * (d3_pred + d4_pred)

The union of the ground truth area and the prediction area 

    union_area = = ground truth area + prediction area - intersect

    iou_loss = -log((intersect_area + 1)/(union_area + 1))


Dice loss

The ground truth is also translated into a score map, which is a matrix with the rectangle area (actually a shrunk area of the rectangle) set to 1 and the rest to 0.

The prediction of score is also giving a score map. Note them as gt_score and pred_score.

    intersect = sum(gt_score * pred_score)

union = sum(gt_score) + sum(pred_score)

the intersect is the element multiplication of the two score matrixes, and then sum up all elements

the union is simply the sum up of all elements from the two matrixes.

The Dice loss is

       dice_loss = 1 - 2 * intersect / union

When gt_score and pred_score fully overlap the intersect is half of the union, so the factor 2 * intersect scales the numerator.

In this way, 2 * intersect / union ranges from 0 to 1 and the loss ranges from 1 to 0 depending on how big the overlap is.

The union is usually added with 1e-5 to avoid divided by 0 error.

Combine IOU loss, Angle Loss and Dice Loss

All the three types of losses need to be combined into a single loss function so as to train a model.

For IOU and angle loss, only pixels within the rectangles (text areas) area included, so

    iou_loss = sum(iou_loss * gt_score) / sum(gt_score)

This aggregates the element-wise loss into a scalar number. The division by sum(gt_score) is for normalizing the loss to avoid bigger areas have bigger impact in the loss.

similarly

    angle_loss = sum(angle_loss * gt_score) / sum(gt_score)

The geometry related loss is

    geo_loss =  weight * angle_loss + iou_loss

The weight adjusts the importance of angle loss. The weight is defaulted to 10?

Finally

    total loss = geo_loss + dice_loss